1. simplify rand in dsl/package.scala (commit: 8ff4b97274e58f5944506b25481c6eb44238a4cd) (details)
  2. [SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project (commit: 3c0af793f9e050f5d8dfb2f5daab6c0043c39748) (details)
Commit 8ff4b97274e58f5944506b25481c6eb44238a4cd by gatorsmile
simplify rand in dsl/package.scala
(cherry picked from commit d54d8b86301581142293341af25fd78b3278a2e8)
Signed-off-by: Xiao Li <>
(commit: 8ff4b97274e58f5944506b25481c6eb44238a4cd)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (diff)
Commit 3c0af793f9e050f5d8dfb2f5daab6c0043c39748 by gatorsmile
[SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project
The ColumnPruning rule tries adding an extra Project if an input node
produces fields more than needed, but as a post-processing step, it
needs to remove the lower Project in the form of "Project - Filter -
Project" otherwise it would conflict with PushPredicatesThroughProject
and would thus cause a infinite optimization loop. The current
post-processing method is defined as:
private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan =
plan transform {
   case p1  Project(_, f  Filter(_, p2  Project(_, child)))
     if p2.outputSet.subsetOf(child.outputSet) =>
     p1.copy(child = f.copy(child = child))
``` This method works well when there is only one Filter but would not
if there's two or more Filters. In this case, there is a deterministic
filter and a non-deterministic filter so they stay as separate filter
nodes and cannot be combined together.
An simplified illustration of the optimization process that forms the
infinite loop is shown below (F1 stands for the 1st filter, F2 for the
2nd filter, P for project, S for scan of relation, PredicatePushDown as
abbrev. of PushPredicatesThroughProject):
                            F1 - F2 - P - S PredicatePushDown      =>  
F1 - P - F2 - S ColumnPruning          =>    F1 - P - F2 - P - S
                      =>    F1 - P - F2 - S        (Project removed)
PredicatePushDown      =>    P - F1 - F2 - S ColumnPruning          => 
P - F1 - P - F2 - S
                      =>    P - F1 - P - F2 - P - S
                      =>    P - F1 - F2 - P - S    (only one Project
removed) RemoveRedundantProject =>    F1 - F2 - P - S        (goes back
to the loop start)
``` So the problem is the ColumnPruning rule adds a Project under a
Filter (and fails to remove it in the end), and that new Project
triggers PushPredicateThroughProject. Once the filters have been push
through the Project, a new Project will be added by the ColumnPruning
rule and this goes on and on. The fix should be when adding Projects,
the rule applies top-down, but later when removing extra Projects, the
process should go bottom-up to ensure all extra Projects can be matched.
Added a optimization rule test in ColumnPruningSuite; and a end-to-end
test in SQLQuerySuite.
Author: maryannxue <>
Closes #21674 from maryannxue/spark-24696.
(commit: 3c0af793f9e050f5d8dfb2f5daab6c0043c39748)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala (diff)