1. [SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation (details)
Commit 40668c53ed799881db1f316ceaf2f978b294d8ed by dhyun
[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation
## What changes were proposed in this pull request? The upper bound of
group-by columns row number is to multiply distinct counts of group-by
columns. However, column with only null value will cause the output row
number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2 Actual: output rows: 0 Expected: output rows: 2
## How was this patch tested? According unit test has been added, plus
manual test has been done in our tpcds benchmark environement.
Closes #24286 from pengbo/master.
Lead-authored-by: pengbo <bo.peng1019@gmail.com> Co-authored-by:
mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Dongjoon Hyun
(cherry picked from commit c58a4fed8d79aff9fbac9f9a33141b2edbfb0cea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/AggregateEstimationSuite.scala (diff)