SuccessChanges

Summary

  1. [SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL (details)
Commit f5118f81e395bde0cd8253dbef6a9e6455c3958a by dhyun
[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL
benchmarks
### What changes were proposed in this pull request? In the PR, I
propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in
SQL benchmarks and use the `NoOp` datasource. I added an implicit class
to `SqlBasedBenchmark` with the `.noop()` method. It can be used in
benchmark like: `ds.noop()`. The last one is unfolded to
`ds.write.format("noop").mode(Overwrite).save()`.
### Why are the changes needed? To avoid additional overhead that
`collect()` (and other actions) has. For example, `.collect()` has to
convert values according to external types and pull data to the driver.
This can hide actual performance regressions or improvements of
benchmarked operations.
### Does this PR introduce any user-facing change? No
### How was this patch tested? Re-run all modified benchmarks using
Amazon EC2.
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5
(ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/10 |
- Run `TPCDSQueryBenchmark` using instructions from the PR #26049
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1
spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1  // This need `Spark
2.4`
```
- Other benchmarks ran by the script:
```
#!/usr/bin/env python3
import os from sparktestsupport.shellutils import run_cmd
benchmarks = [
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
   ['avro/test',
'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.JoinBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.MiscBenchmark'],
   ['hive/test',
'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.RangeBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.UDFBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'],
   ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
   print("Run benchmark: %s" % b[1])
   run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27078 from MaxGekk/noop-in-benchmarks.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim
Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun
<dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/CSVBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/DataSourceReadBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/MiscBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/TPCDSQueryBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/IntervalBenchmark-jdk11-results.txt (diff)
The file was modifiedexternal/avro/benchmarks/AvroReadBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/DateTimeBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/MakeDateTimeBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/JsonBenchmark-results.txt (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/AggregateBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/RangeBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/RangeBenchmark-results.txt (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/execution/benchmark/ObjectHashAggregateExecBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/UDFBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MakeDateTimeBenchmark.scala (diff)
The file was addedsql/core/benchmarks/WideSchemaBenchmark-jdk11-results.txt
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideTableBenchmark.scala (diff)
The file was addedsql/hive/benchmarks/OrcReadBenchmark-jdk11-results.txt
The file was modifiedsql/core/benchmarks/OrcV2NestedSchemaPruningBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/IntervalBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/FilterPushdownBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/OrcV2NestedSchemaPruningBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/JoinBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/WideSchemaBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideSchemaBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/AggregateBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/InExpressionBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/BloomFilterBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/CSVBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/JoinBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/OrcNestedSchemaPruningBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/MiscBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/ExtractBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/DateTimeBenchmark-results.txt (diff)
The file was modifiedsql/hive/benchmarks/OrcReadBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ExtractBenchmark.scala (diff)
The file was addedsql/core/benchmarks/JsonBenchmark-jdk11-results.txt
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/MakeDateTimeBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/NestedSchemaPruningBenchmark.scala (diff)
The file was modifiedsql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/DataSourceReadBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeBenchmark.scala (diff)
The file was addedsql/core/benchmarks/TPCDSQueryBenchmark-jdk11-results.txt
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UDFBenchmark.scala (diff)
The file was modifiedexternal/avro/benchmarks/AvroReadBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/BloomFilterBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala (diff)
The file was addedsql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
The file was modifiedsql/core/benchmarks/WideTableBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala (diff)
The file was addedsql/core/benchmarks/InExpressionBenchmark-jdk11-results.txt
The file was modifiedsql/core/benchmarks/WideTableBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/ParquetNestedSchemaPruningBenchmark-results.txt (diff)
The file was addedsql/hive/benchmarks/ObjectHashAggregateExecBenchmark-jdk11-results.txt
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InExpressionBenchmark.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/RangeBenchmark.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/OrcNestedSchemaPruningBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/ParquetNestedSchemaPruningBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/ExtractBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/UDFBenchmark-results.txt (diff)