SuccessChanges

Summary

  1. [SPARK-25164][SQL] Avoid rebuilding column and path list for each column (commit: 8db935f9724d0820a010340651a4d61da1bfa916) (details)
Commit 8db935f9724d0820a010340651a4d61da1bfa916 by gatorsmile
[SPARK-25164][SQL] Avoid rebuilding column and path list for each column
in parquet reader
## What changes were proposed in this pull request?
VectorizedParquetRecordReader::initializeInternal rebuilds the column
list and path list once for each column. Therefore, it indirectly
iterates 2\*colCount\*colCount times for each parquet file.
This inefficiency impacts jobs that read parquet-backed tables with many
columns and many files. Jobs that read tables with few columns or few
files are not impacted.
This PR changes initializeInternal so that it builds each list only
once.
I ran benchmarks on my laptop with 1 worker thread, running this query:
<pre> sql("select * from parquet_backed_table where id1 = 1").collect
</pre> There are roughly one matching row for every 425 rows, and the
matching rows are sprinkled pretty evenly throughout the table (that is,
every page for column <code>id1</code> has at least one matching row).
6000 columns, 1 million rows, 67 32M files:
master | branch | improvement
-------|---------|----------- 10.87 min | 6.09 min | 44%
6000 columns, 1 million rows, 23 98m files:
master | branch | improvement
-------|---------|----------- 7.39 min | 5.80 min | 21%
600 columns 10 million rows, 67 32M files:
master | branch | improvement
-------|---------|----------- 1.95 min | 1.96 min | -0.5%
60 columns, 100 million rows, 67 32M files:
master | branch | improvement
-------|---------|----------- 0.55 min | 0.55 min | 0%
## How was this patch tested?
- sql unit tests
- pyspark-sql tests
Closes #22188 from bersprockets/SPARK-25164.
Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 8db935f9724d0820a010340651a4d61da1bfa916)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java (diff)