SuccessChanges

Summary

  1. [SPARK-32175][CORE] Fix the order between initialization for (commit: 9be088357eff4328248b29a3a49a816756745345) (details)
  2. [SPARK-32477][CORE] JsonProtocol.accumulablesToJson should be (commit: 5eab8d27e686fac5688ba4482599e3652ec17882) (details)
  3. [SPARK-32449][ML][PYSPARK] Add summary to (commit: 40e6a5bbb0dbedae4a270f830aafd4cb310a8fe2) (details)
  4. [SPARK-32346][SQL] Support filters pushdown in Avro datasource (commit: d897825d2d0430cb52ae9ac0f6fd742582041682) (details)
  5. [SPARK-32476][CORE] ResourceAllocator.availableAddrs should be (commit: 9dc02378518b4b0b8a069684f575ed40813fa417) (details)
  6. [SPARK-30322][DOCS] Add stage level scheduling docs (commit: e926d419d305c9400f6f2426ca3e8d04a9180005) (details)
  7. [SPARK-32332][SQL] Support columnar exchanges (commit: a025a89f4ef3a05d7e70c02f03a9826bb97eceac) (details)
  8. [SPARK-32397][BUILD] Allow specifying of time for build to keep time (commit: 50911df08eb7a27494dc83bcec3d09701c2babfe) (details)
  9. [SPARK-32487][CORE] Remove j.w.r.NotFoundException from `import` in (commit: 163867435a6af1e9a37521e34ea41b07168f4730) (details)
  10. [SPARK-32248][BUILD] Recover Java 11 build in Github Actions (commit: 08a66f8fd0df38280dfd54bb79aa8a8ae1272fc9) (details)
  11. [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties (commit: 89d9b7cc64f01de9b3e88352d6a1979852873a77) (details)
  12. [SPARK-32455][ML] LogisticRegressionModel prediction optimization (commit: 81b0785fb2d9a2d45d4366a58a3c30fe478c299a) (details)
  13. [SPARK-32431][SQL] Check duplicate nested columns in read from in-built (commit: 99a855575c3a5554443a27385caf49661cc7f139) (details)
  14. [SPARK-32478][R][SQL] Error message to show the schema mismatch in (commit: e1d73210341a314601a953e6ac483112660874e6) (details)
  15. [SPARK-32412][SQL] Unify error handling for spark thrift server (commit: 510a1656e650246a708d3866c8a400b7a1b9f962) (details)
  16. [SPARK-32488][SQL] Use @parser::members and @lexer::members to avoid (commit: 30e3042dc5febf49123483184e6282fefde8ebc0) (details)
  17. [SPARK-32491][INFRA] Do not install SparkR in test-only mode in testing (commit: 1f7fe5415e88cc289b44e366cd4e74290784db5f) (details)
  18. [SPARK-32493][INFRA] Manually install R instead of using setup-r in (commit: e0c8bd07af6ea2873c77ae6428b3ab4ee68e8e32) (details)
  19. [SPARK-32496][INFRA] Include GitHub Action file as the changes in (commit: 12f443cd99a91689dc5b44b6794205289ef2d998) (details)
  20. [SPARK-32227] Fix regression bug in load-spark-env.cmd with Spark 3.0.0 (commit: 743772095273b464f845efefb3eb59284b06b9be) (details)
  21. [SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub (commit: 32f4ef005fd590e0e7c319b43a459cb3828bba5a) (details)
  22. [SPARK-32489][CORE] Pass `core` module UTs in Scala 2.13 (commit: 7cf3b54a2a7528e815841015af50e08ce4515cb9) (details)
  23. [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning (commit: 366a1789333bac97643159857a206bcd773761a4) (details)
  24. [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite (commit: 6032c5b0320fe70455586f4ce863d5d9361b5e07) (details)
  25. [SPARK-32175][SPARK-32175][FOLLOWUP] Remove flaky test added in (commit: 9d7b1d935f7a2b770d8b2f264cfe4a4db2ad64b6) (details)
  26. [SPARK-32482][SS][TESTS] Eliminate deprecated poll(long) API calls to (commit: f6027827a49a52f6586f73aee9e4067659f650b6) (details)
  27. [SPARK-32421][SQL] Add code-gen for shuffled hash join (commit: ae82768c1396bfe626b52c4eac33241a9eb91f54) (details)
  28. [SPARK-32468][SS][TESTS] Fix timeout config issue in Kafka connector (commit: 813532d10310027fee9e12680792cee2e1c2b7c7) (details)
  29. [SPARK-32160][CORE][PYSPARK] Add a config to switch allow/disallow to (commit: 8014b0b5d61237dc4851d4ae9927778302d692da) (details)
  30. [SPARK-32406][SQL][FOLLOWUP] Make RESET fail against static and core (commit: f4800406a455b33afa7b1d62d10f236da4cd1f83) (details)
  31. [SPARK-31418][CORE][FOLLOW-UP][MINOR] Fix log messages to print stage id (commit: 4eaf3a0a23d66f234e98062852223d81ec770fbe) (details)
  32. [SPARK-31894][SS][FOLLOW-UP] Rephrase the config doc (commit: 354313b6bc89149f97b7cebf6249abd9e3e87724) (details)
  33. [SPARK-32083][SQL] AQE coalesce should at least return one partition (commit: 1c6dff7b5fc171c190feea0d8f7d323e330d9151) (details)
Commit 9be088357eff4328248b29a3a49a816756745345 by tgraves
[SPARK-32175][CORE] Fix the order between initialization for
ExecutorPlugin and starting heartbeat thread
### What changes were proposed in this pull request?
This PR changes the order between initialization for ExecutorPlugin and
starting heartbeat thread in Executor.
### Why are the changes needed?
In the current master, heartbeat thread in a executor starts after
plugin initialization so if the initialization takes long time,
heartbeat is not sent to driver and the executor will be removed from
cluster.
### Does this PR introduce _any_ user-facing change?
Yes. Plugins for executors will be allowed to take long time for
initialization.
### How was this patch tested?
New testcase.
Closes #29002 from sarutak/fix-heartbeat-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Thomas Graves <tgraves@apache.org>
(commit: 9be088357eff4328248b29a3a49a816756745345)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/TestUtils.scala (diff)
Commit 5eab8d27e686fac5688ba4482599e3652ec17882 by dongjoon
[SPARK-32477][CORE] JsonProtocol.accumulablesToJson should be
deterministic
### What changes were proposed in this pull request?
This PR aims to make `JsonProtocol.accumulablesToJson` deterministic.
### Why are the changes needed?
Currently, `JsonProtocol.accumulablesToJson` is indeterministic. So,
`JsonProtocolSuite` itself is also using mixed test cases in terms of
`"Accumulables": [ ... ]`.
Not only this is indeterministic, but also this causes a UT failure in
`JsonProtocolSuite` in Scala 2.13.
### Does this PR introduce _any_ user-facing change?
Yes. However, this is a fix on indeterministic behavior.
### How was this patch tested?
- Scala 2.12: Pass the GitHub Action or Jenkins.
- Scala 2.13: Do the following.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none
-DwildcardSuites=org.apache.spark.util.JsonProtocolSuite
```
**BEFORE**
```
*** 1 TEST FAILED ***
```
**AFTER**
``` All tests passed.
```
Closes #29282 from dongjoon-hyun/SPARK-32477.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 5eab8d27e686fac5688ba4482599e3652ec17882)
The file was modifiedcore/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/JsonProtocol.scala (diff)
Commit 40e6a5bbb0dbedae4a270f830aafd4cb310a8fe2 by srowen
[SPARK-32449][ML][PYSPARK] Add summary to
MultilayerPerceptronClassificationModel
### What changes were proposed in this pull request? Add training
summary to MultilayerPerceptronClassificationModel...
### Why are the changes needed? so that user can get the training
process status, such as loss value of each iteration and total iteration
number.
### Does this PR introduce _any_ user-facing change? Yes
MultilayerPerceptronClassificationModel.summary
MultilayerPerceptronClassificationModel.evaluate
### How was this patch tested? new tests
Closes #29250 from huaxingao/mlp_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 40e6a5bbb0dbedae4a270f830aafd4cb310a8fe2)
The file was modifiedpython/docs/source/reference/pyspark.ml.rst (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_training_summary.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/ann/ANNSuite.scala (diff)
Commit d897825d2d0430cb52ae9ac0f6fd742582041682 by gengliang.wang
[SPARK-32346][SQL] Support filters pushdown in Avro datasource
### What changes were proposed in this pull request? In the PR, I
propose to support pushed down filters in Avro datasource V1 and V2. 1.
Added new SQL config `spark.sql.avro.filterPushdown.enabled` to control
filters pushdown to Avro datasource. It is on by default. 2. Renamed
`CSVFilters` to `OrderedFilters`. 3. `OrderedFilters` is used in
`AvroFileFormat` (DSv1) and in `AvroPartitionReaderFactory` (DSv2) 4.
Modified `AvroDeserializer` to return None from the `deserialize` method
when pushdown filters return `false`.
### Why are the changes needed? The changes improve performance on
synthetic benchmarks up to **2** times on JDK 11:
``` OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on
Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz Filters
pushdown:                         Best Time(ms)   Avg Time(ms) 
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                        9614           9669 
       54          0.1        9614.1       1.0X pushdown disabled      
                         10077          10141          66          0.1 
    10077.2       1.0X w/ filters                                      
4681           4713          29          0.2        4681.5       2.1X
```
### Does this PR introduce any user-facing change? No
### How was this patch tested?
- Added UT to `AvroCatalystDataConversionSuite` and `AvroSuite`
- Re-running `AvroReadBenchmark` using Amazon EC2:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5
(ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository
ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|
and `./dev/run-benchmarks`:
```python
#!/usr/bin/env python3
import os from sparktestsupport.shellutils import run_cmd
benchmarks = [
['avro/test',
'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
   print("Run benchmark: %s" % b[1])
   run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #29145 from MaxGekk/avro-filters-pushdown.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang
<gengliang.wang@databricks.com>
(commit: d897825d2d0430cb52ae9ac0f6fd742582041682)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala (diff)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/OrderedFiltersSuite.scala
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala (diff)
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVFilters.scala
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala (diff)
The file was modifiedexternal/avro/benchmarks/AvroReadBenchmark-jdk11-results.txt (diff)
The file was modifiedexternal/avro/benchmarks/AvroReadBenchmark-results.txt (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/OrderedFilters.scala
The file was removedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVFiltersSuite.scala
Commit 9dc02378518b4b0b8a069684f575ed40813fa417 by dongjoon
[SPARK-32476][CORE] ResourceAllocator.availableAddrs should be
deterministic
### What changes were proposed in this pull request?
This PR aims to make `ResourceAllocator.availableAddrs` deterministic.
### Why are the changes needed?
Currently, this function returns indeterministically due to the
underlying `HashMap`. So, the test case itself is creating a list `[0,
1, 2]` initially, but ends up with comparing `[2, 1, 0]`.
Not only this happens in the 3.0.0, but also this causes UT failures on
Scala 2.13 environment.
### Does this PR introduce _any_ user-facing change?
Yes, but this fixes the in-deterministic behavior.
### How was this patch tested?
- Scala 2.12: This should pass the UT with the modified test case.
- Scala 2.13: This can be tested like the following (at least
`JsonProtocolSuite`)
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none
-DwildcardSuites=org.apache.spark.deploy.JsonProtocolSuite
```
**BEFORE**
```
*** 2 TESTS FAILED ***
```
**AFTER**
``` All tests passed.
```
Closes #29281 from dongjoon-hyun/SPARK-32476.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 9dc02378518b4b0b8a069684f575ed40813fa417)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala (diff)
Commit e926d419d305c9400f6f2426ca3e8d04a9180005 by tgraves
[SPARK-30322][DOCS] Add stage level scheduling docs
### What changes were proposed in this pull request?
Document the stage level scheduling feature.
### Why are the changes needed?
Document the stage level scheduling feature.
### Does this PR introduce _any_ user-facing change?
Documentation.
### How was this patch tested?
n/a docs only
Closes #29292 from tgravescs/SPARK-30322.
Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas
Graves <tgraves@apache.org>
(commit: e926d419d305c9400f6f2426ca3e8d04a9180005)
The file was modifieddocs/configuration.md (diff)
The file was modifieddocs/running-on-yarn.md (diff)
Commit a025a89f4ef3a05d7e70c02f03a9826bb97eceac by tgraves
[SPARK-32332][SQL] Support columnar exchanges
### What changes were proposed in this pull request?
This PR adds abstract classes for shuffle and broadcast, so that users
can provide their columnar implementations.
This PR updates several places to use the abstract exchange classes, and
also update `AdaptiveSparkPlanExec` so that the columnar rules can see
exchange nodes.
This is an alternative of https://github.com/apache/spark/pull/29134 .
Close https://github.com/apache/spark/pull/29134
### Why are the changes needed?
To allow columnar exchanges.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes #29262 from cloud-fan/columnar.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Thomas
Graves <tgraves@apache.org>
(commit: a025a89f4ef3a05d7e70c02f03a9826bb97eceac)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/simpleCosting.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala (diff)
Commit 50911df08eb7a27494dc83bcec3d09701c2babfe by d_tsai
[SPARK-32397][BUILD] Allow specifying of time for build to keep time
consistent between modules
### What changes were proposed in this pull request?
Upgrade codehaus maven build helper to allow people to specify a time
during the build to avoid snapshot artifacts with different version
strings.
### Why are the changes needed?
During builds of snapshots the maven may assign different versions to
different artifacts based on the time each individual sub-module starts
building.
The timestamp is used as part of the version string when run `maven
deploy` on a snapshot build. This results in different sub-modules
having different version strings.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual build while specifying the current time, ensured the time is
consistent in the sub components.
Open question: Ideally I'd like to backport this as well since it's sort
of a bug fix and while it does change a dependency version it's not one
that is propagated. I'd like to hear folks thoughts about this.
Closes #29274 from
holdenk/SPARK-32397-snapshot-artifact-timestamp-differences.
Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: DB Tsai
<d_tsai@apple.com>
(commit: 50911df08eb7a27494dc83bcec3d09701c2babfe)
The file was modifiedpom.xml (diff)
Commit 163867435a6af1e9a37521e34ea41b07168f4730 by dongjoon
[SPARK-32487][CORE] Remove j.w.r.NotFoundException from `import` in
[Stages|OneApplication]Resource
### What changes were proposed in this pull request?
This PR aims to remove `java.ws.rs.NotFoundException` from two
problematic `import` statements. All the other use cases are correct.
### Why are the changes needed?
In `StagesResource` and `OneApplicationResource`, there exist two
`NotFoundException`s.
- javax.ws.rs.NotFoundException
- org.apache.spark.status.api.v1.NotFoundException
To use `org.apache.spark.status.api.v1.NotFoundException` correctly, we
should not import `java.ws.rs.NotFoundException`. This causes UT
failures in Scala 2.13 environment.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Scala 2.12: Pass the GitHub Action or Jenkins.
- Scala 2.13: Do the following manually.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none
-DwildcardSuites=org.apache.spark.deploy.history.HistoryServerSuite
```
**BEFORE**
```
*** 4 TESTS FAILED ***
```
**AFTER**
```
*** 1 TEST FAILED ***
```
Closes #29293 from dongjoon-hyun/SPARK-32487.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 163867435a6af1e9a37521e34ea41b07168f4730)
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/OneApplicationResource.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala (diff)
Commit 08a66f8fd0df38280dfd54bb79aa8a8ae1272fc9 by dongjoon
[SPARK-32248][BUILD] Recover Java 11 build in Github Actions
### What changes were proposed in this pull request?
This PR aims to recover Java 11 build in `GitHub Action`.
### Why are the changes needed?
This test coverage is removed before. Now, it's time to recover it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action.
Closes #29295 from dongjoon-hyun/SPARK-32248.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 08a66f8fd0df38280dfd54bb79aa8a8ae1272fc9)
The file was modified.github/workflows/master.yml (diff)
Commit 89d9b7cc64f01de9b3e88352d6a1979852873a77 by gurwls223
[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties
and fixing a thread leak issue in pinned thread mode
### What changes were proposed in this pull request?
This PR proposes:
1. To introduce `InheritableThread` class, that works identically with
`threading.Thread` but it can inherit the inheritable attributes of a
JVM thread such as `InheritableThreadLocal`.
    This was a problem from the pinned thread mode, see also
https://github.com/apache/spark/pull/24898. Now it works as below:
    ```python
   import pyspark
    spark.sparkContext.setLocalProperty("a", "hi")
   def print_prop():
       print(spark.sparkContext.getLocalProperty("a"))
    pyspark.InheritableThread(target=print_prop).start()
   ```
    ```
   hi
   ```
2. Also, it adds the resource leak fix into `InheritableThread`. Py4J
leaks the thread and does not close the connection from Python to JVM.
In `InheritableThread`, it manually closes the connections when PVM
garbage collection happens. So, JVM threads finish safely. I manually
verified by profiling but there's also another easy way to verify:
    ```bash
   PYSPARK_PIN_THREAD=true ./bin/pyspark
   ```
    ```python
   >>> from threading import Thread
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> spark._jvm._gateway_client.deque
   deque([<py4j.clientserver.ClientServerConnection object at
0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at
0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at
0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at
0x11a015358>, <py4j.clientserver.ClientServerConnection object at
0x119fc00f0>])
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> spark._jvm._gateway_client.deque
   deque([<py4j.clientserver.ClientServerConnection object at
0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at
0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at
0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at
0x11a015358>, <py4j.clientserver.ClientServerConnection object at
0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at
0x119fc00f0>])
   ```
    This issue is fixed now.
3. Because now we have a fix for the issue here, it also proposes to
deprecate `collectWithJobGroup` which was a temporary workaround added
to avoid this leak issue.
### Why are the changes needed?
To support pinned thread mode properly without a resource leak, and a
proper inheritable local properties.
### Does this PR introduce _any_ user-facing change?
Yes, it adds an API `InheritableThread` class for pinned thread mode.
### How was this patch tested?
Manually tested as described above, and unit test was added as well.
Closes #28968 from HyukjinKwon/SPARK-32010.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 89d9b7cc64f01de9b3e88352d6a1979852873a77)
The file was modifiedpython/pyspark/util.py (diff)
The file was modifieddocs/job-scheduling.md (diff)
The file was modifiedpython/pyspark/__init__.py (diff)
The file was modifiedpython/pyspark/rdd.py (diff)
The file was modifiedpython/pyspark/context.py (diff)
The file was modifiedpython/pyspark/tests/test_pin_thread.py (diff)
Commit 81b0785fb2d9a2d45d4366a58a3c30fe478c299a by huaxing
[SPARK-32455][ML] LogisticRegressionModel prediction optimization
### What changes were proposed in this pull request? for binary
`LogisticRegressionModel`: 1, keep variables `_threshold` and
`_rawThreshold` instead of computing them on each instance; 2, in
`raw2probabilityInPlace`, make use of the characteristic that the sum of
probability is 1.0;
### Why are the changes needed? for better performance
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? existing testsuite and performace test in
REPL
Closes #29255 from zhengruifeng/pred_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin
Gao <huaxing@us.ibm.com>
(commit: 81b0785fb2d9a2d45d4366a58a3c30fe478c299a)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (diff)
Commit 99a855575c3a5554443a27385caf49661cc7f139 by wenchen
[SPARK-32431][SQL] Check duplicate nested columns in read from in-built
datasources
### What changes were proposed in this pull request? When
`spark.sql.caseSensitive` is `false` (by default), check that there are
not duplicate column names on the same level (top level or nested
levels) in reading from in-built datasources Parquet, ORC, Avro and
JSON. If such duplicate columns exist, throw the exception:
``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in
the data schema:
```
### Why are the changes needed? To make handling of duplicate nested
columns is similar to handling of duplicate top-level columns i. e.
output the same error when `spark.sql.caseSensitive` is `false`:
```Scala org.apache.spark.sql.AnalysisException: Found duplicate
column(s) in the data schema: `camelcase`
```
Checking of top-level duplicates was introduced by
https://github.com/apache/spark/pull/17758.
### Does this PR introduce _any_ user-facing change? Yes. For the
example from SPARK-32431:
ORC:
```scala java.io.IOException: Error reading file:
file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc
at
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
at
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
... Caused by: java.io.EOFException: Read past end of RLE integer from
compressed stream Stream for column 3 kind DATA position: 6 length: 6
range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3
at
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
```
JSON:
```scala
+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+
```
Parquet:
```scala
+------------+
|StructColumn|
+------------+
|     [0,, 1]|
+------------+
```
Avro:
```scala
+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+
```
After the changes, Parquet, ORC, JSON and Avro output the same error:
```scala Found duplicate column(s) in the data schema: `camelcase`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the
data schema: `camelcase`;
at
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112)
at
org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51)
at
org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67)
```
### How was this patch tested? Run modified test suites:
```
$ build/sbt "sql/test:testOnly
org.apache.spark.sql.FileBasedDataSourceSuite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*"
``` and added new UT to `SchemaUtilsSuite`.
Closes #29234 from MaxGekk/nested-case-insensitive-column.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 99a855575c3a5554443a27385caf49661cc7f139)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/util/SchemaUtilsSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/NestedDataSourceSuite.scala
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (diff)
Commit e1d73210341a314601a953e6ac483112660874e6 by gurwls223
[SPARK-32478][R][SQL] Error message to show the schema mismatch in
gapply with Arrow vectorization
### What changes were proposed in this pull request?
This PR proposes to:
1. Fix the error message when the output schema is misbatched with R
DataFrame from the given function. For example,
    ```R
   df <- createDataFrame(list(list(a=1L, b="2")))
   count(gapply(df, "a", function(key, group) { group }, structType("a
int, b int")))
   ```
    **Before:**
    ```
   Error in handleErrors(returnStatus, conn) :
     ...
     java.lang.UnsupportedOperationException
    ...
   ```
    **After:**
    ```
   Error in handleErrors(returnStatus, conn) :
    ...
    java.lang.AssertionError: assertion failed: Invalid schema from
gapply: expected IntegerType, IntegerType, got IntegerType, StringType
       ...
   ```
2. Update documentation about the schema matching for `gapply` and
`dapply`.
### Why are the changes needed?
To show which schema is not matched, and let users know what's going on.
### Does this PR introduce _any_ user-facing change?
Yes, error message is updated as above, and documentation is updated.
### How was this patch tested?
Manually tested and unitttests were added.
Closes #29283 from HyukjinKwon/r-vectorized-error.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: e1d73210341a314601a953e6ac483112660874e6)
The file was modifieddocs/sparkr.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala (diff)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL_arrow.R (diff)
Commit 510a1656e650246a708d3866c8a400b7a1b9f962 by wenchen
[SPARK-32412][SQL] Unify error handling for spark thrift server
operations
### What changes were proposed in this pull request?
Log error/warn message only once at the server-side for both sync and
async modes
### Why are the changes needed?
In
https://github.com/apache/spark/commit/b151194299f5ba15e0d9d8d7d2980fd164fe0822
we make the error logging for  SparkExecuteStatementOperation with
`runInBackground=true` not duplicated, but the operations with
runInBackground=false and other metadata operation still will be log
twice which happened in the operation's `runInternal` method and
ThriftCLIService.
In this PR, I propose to reflect the logic to get a unified error
handling approach.
### Does this PR introduce _any_ user-facing change?
Yes, when spark.sql.hive.thriftServer.async=false and people call sync
APIs the error message will be logged only once at server-side.
### How was this patch tested?
locally verified the result in target/unit-test.log
add unit tests.
Closes #29204 from yaooqinn/SPARK-32412.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 510a1656e650246a708d3866c8a400b7a1b9f962)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperation.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
Commit 30e3042dc5febf49123483184e6282fefde8ebc0 by wenchen
[SPARK-32488][SQL] Use @parser::members and @lexer::members to avoid
generating unused code
### What changes were proposed in this pull request?
This PR aims to update `SqlBse.g4` for avoiding generating unused code.
Currently, ANTLR generates unused methods and variables;
`isValidDecimal` and `isHint` are only used in the generated lexer. This
PR changed the code to use `parser::members` and `lexer::members` to
avoid it.
### Why are the changes needed?
To reduce unnecessary code.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes #29296 from maropu/UpdateSqlBase.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 30e3042dc5febf49123483184e6282fefde8ebc0)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala (diff)
Commit 1f7fe5415e88cc289b44e366cd4e74290784db5f by gurwls223
[SPARK-32491][INFRA] Do not install SparkR in test-only mode in testing
script
### What changes were proposed in this pull request?
This PR proposes to skip SparkR installation that is to run R linters
(see SPARK-8505) in the test-only mode at `dev/run-tests.py` script.
As of SPARK-32292, the test-only mode in `dev/run-tests.py` was
introduced, for example:
``` dev/run-tests.py --modules sql,core
```
which only runs the relevant tests and does not run other tests such as
linters. Therefore, we don't need to install SparkR when `--modules` are
specified.
### Why are the changes needed?
GitHub Actions build is currently failed as below:
``` ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
[error] running /home/runner/work/spark/spark/R/install-dev.sh ;
received return code 1
##[error]Process completed with exit code 10.
```
For some reasons, looks GitHub Actions started to have R 3.4.4 installed
by default; however, R 3.4 was dropped as of SPARK-32073.  When SparkR
tests are not needed, GitHub Actions still builds SparkR with a low R
version and it causes the test failure.
This PR partially fixes it by avoid the installation of SparkR.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions tests should run to confirm this fix is correct.
Closes #29300 from HyukjinKwon/install-r.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 1f7fe5415e88cc289b44e366cd4e74290784db5f)
The file was modifieddev/run-tests.py (diff)
Commit e0c8bd07af6ea2873c77ae6428b3ab4ee68e8e32 by gurwls223
[SPARK-32493][INFRA] Manually install R instead of using setup-r in
GitHub Actions
### What changes were proposed in this pull request?
This PR proposes to manually install R instead of using `setup-r` which
seems broken. Currently, GitHub Actions uses its default R 3.4.4
installed, which we dropped as of SPARK-32073.
While I am here, I am also upgrading R version to 4.0. Jenkins will test
the old version and GitHub Actions tests the new version. AppVeyor uses
R 4.0 but it does not check CRAN which is important when we make a
release.
### Why are the changes needed?
To recover GitHub Actions build.
### Does this PR introduce _any_ user-facing change?
No, dev-only
### How was this patch tested?
Manually tested at https://github.com/HyukjinKwon/spark/pull/15
Closes #29302 from HyukjinKwon/SPARK-32493.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: e0c8bd07af6ea2873c77ae6428b3ab4ee68e8e32)
The file was modified.github/workflows/master.yml (diff)
Commit 12f443cd99a91689dc5b44b6794205289ef2d998 by gurwls223
[SPARK-32496][INFRA] Include GitHub Action file as the changes in
testing
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/26556 excluded
`.github/workflows/master.yml`. So tests are skipped if the GitHub
Actions configuration file is changed.
As of SPARK-32245, we now run the regular tests via the testing script.
We should include it to test to make sure GitHub Actions build does not
break due to some changes such as Python versions.
### Why are the changes needed?
For better test coverage in GitHub Actions build.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions in this PR will test.
Closes #29305 from HyukjinKwon/SPARK-32496.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 12f443cd99a91689dc5b44b6794205289ef2d998)
The file was modifieddev/run-tests.py (diff)
Commit 743772095273b464f845efefb3eb59284b06b9be by gurwls223
[SPARK-32227] Fix regression bug in load-spark-env.cmd with Spark 3.0.0
### What changes were proposed in this pull request? Fix regression bug
in load-spark-env.cmd with Spark 3.0.0
### Why are the changes needed? cmd doesn't support set env twice. So
set `SPARK_ENV_CMD=%SPARK_CONF_DIR%\%SPARK_ENV_CMD%` doesn't take
effect, which caused regression.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Manually tested. 1. Create a
spark-env.cmd under conf folder. Inside this, `echo spark-env.cmd` 2.
Run old load-spark-env.cmd, nothing printed in the output 2. Run fixed
load-spark-env.cmd, `spark-env.cmd` showed in the output.
Closes #29044 from warrenzhu25/32227.
Lead-authored-by: Warren Zhu <zhonzh@microsoft.com> Co-authored-by:
Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 743772095273b464f845efefb3eb59284b06b9be)
The file was modifiedbin/load-spark-env.cmd (diff)
Commit 32f4ef005fd590e0e7c319b43a459cb3828bba5a by gurwls223
[SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub
Actions
### What changes were proposed in this pull request?
CRAN check fails due to the size of the generated PDF docs as below:
```
...
WARNING
‘qpdf’ is needed for checks on size reduction of PDFs
... Status: 1 WARNING, 1 NOTE See
‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’ for
details.
```
This PR proposes to install `qpdf` in GitHub Actions.
Note that I cannot reproduce in my local with the same R version so I am
not documenting it for now.
Also, while I am here, I piggyback to install SparkR when the module
includes `sparkr`. it is rather a followup of SPARK-32491.
### Why are the changes needed?
To fix SparkR CRAN check failure.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions will test it out.
Closes #29306 from HyukjinKwon/SPARK-32497.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 32f4ef005fd590e0e7c319b43a459cb3828bba5a)
The file was modifieddev/run-tests.py (diff)
The file was modified.github/workflows/master.yml (diff)
Commit 7cf3b54a2a7528e815841015af50e08ce4515cb9 by dongjoon
[SPARK-32489][CORE] Pass `core` module UTs in Scala 2.13
### What changes were proposed in this pull request?
So far, we fixed many stuffs in `core` module. This PR fixes the
remaining UT failures in Scala 2.13.
- `OneApplicationResource.environmentInfo` will return a deterministic
result for `sparkProperties`, `hadoopProperties`, `systemProperties`,
and `classpathEntries`.
- `SubmitRestProtocolSuite` has Scala 2.13 answer in addition to the
existing Scala 2.12 answer, and uses the expected answer based on the
Scala runtime version.
### Why are the changes needed?
To support Scala 2.13.
### Does this PR introduce _any_ user-facing change?
Yes, `environmentInfo` is changed, but this fixes the indeterministic
behavior.
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Do the following.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13
```
**BEFORE**
``` Tests: succeeded 2612, failed 3, canceled 1, ignored 8, pending 0
*** 3 TESTS FAILED ***
```
**AFTER**
``` Tests: succeeded 2615, failed 0, canceled 1, ignored 8, pending 0
All tests passed.
```
Closes #29298 from dongjoon-hyun/SPARK-32489.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 7cf3b54a2a7528e815841015af50e08ce4515cb9)
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/OneApplicationResource.scala (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/app_environment_expectation.json (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/rest/SubmitRestProtocolSuite.scala (diff)
Commit 366a1789333bac97643159857a206bcd773761a4 by hkarau
[SPARK-32199][SPARK-32198] Reduce job failures during decommissioning
### What changes were proposed in this pull request?
This PR reduces the prospect of a job loss during decommissioning. It
fixes two holes in the current decommissioning framework:
- (a) Loss of decommissioned executors is not treated as a job failure:
We know that the decommissioned executor would be dying soon, so its
death is clearly not caused by the application.
- (b) Shuffle files on the decommissioned host are cleared when the
first fetch failure is detected from a decommissioned host: This is a
bit tricky in terms of when to clear the shuffle state ? Ideally you
want to clear it the millisecond before the shuffle service on the node
dies (or the executor dies when there is no external shuffle service) --
too soon and it could lead to some wastage and too late would lead to
fetch failures.
  The approach here is to do this clearing when the very first fetch
failure is observed on the decommissioned block manager, without waiting
for other blocks to also signal a failure.
### Why are the changes needed?
Without them decommissioning a lot of executors at a time leads to job
failures.
### Code overview
The task scheduler tracks the executors that were decommissioned along
with their
`ExecutorDecommissionInfo`. This information is used by: (a) For
handling a `ExecutorProcessLost` error, or (b) by the `DAGScheduler`
when handling a fetch failure.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a new unit test `DecommissionWorkerSuite` to test the new behavior
by exercising the Master-Worker decommissioning. I chose to add a new
test since the setup logic was quite different from the existing
`WorkerDecommissionSuite`. I am open to changing the name of the newly
added test suite :-)
### Questions for reviewers
- Should I add a feature flag to guard these two behaviors ? They seem
safe to me that they should only get triggered by decommissioning, but
you never know :-).
Closes #29014 from agrawaldevesh/decom_harden.
Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by:
Holden Karau <hkarau@apple.com>
(commit: 366a1789333bac97643159857a206bcd773761a4)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (diff)
The file was addedcore/src/test/scala/org/apache/spark/deploy/DecommissionWorkerSuite.scala
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/ExternalClusterManagerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala (diff)
Commit 6032c5b0320fe70455586f4ce863d5d9361b5e07 by hkarau
[SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite
### What changes were proposed in this pull request?
This test tries to fix the flakyness of
BlockManagerDecommissionIntegrationSuite.
### Description of the problem
Make the block manager decommissioning test be less flaky
An interesting failure happens when migrateDuring = true (and persist or
shuffle is true):
- We schedule the job with tasks on executors 0, 1, 2.
- We wait 300 ms and decommission executor 0.
- If the task is not yet done on executor 0, it will now fail because
  the block manager won't be able to save the block. This condition is
  easy to trigger on a loaded machine where the github checks run.
- The task with retry on a different executor (1 or 2) and its shuffle
  blocks will land there.
- No actual block migration happens here because the decommissioned
  executor technically failed before it could even produce a block.
To remove the above race, this change replaces the fixed wait for 300 ms
to wait for an actual task to succeed. When a task has succeeded, we
know its blocks would have been written for sure and thus its executor
would certainly be forced to migrate those blocks when it is
decommissioned.
The change always decommissions an executor on which a real task
finished successfully instead of picking the first executor. Because the
system may choose to schedule nothing on the first executor and instead
run the two tasks on one executor.
### Why are the changes needed?
I have had bad luck with BlockManagerDecommissionIntegrationSuite and it
has failed several times on my PRs. So fixing it.
### Does this PR introduce _any_ user-facing change?
No, unit test only change.
### How was this patch tested?
Github checks. Ran this test 100 times, 10 at a time in parallel in a
script.
Closes #29226 from agrawaldevesh/block-manager-decom-flaky.
Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by:
Holden Karau <hkarau@apple.com>
(commit: 6032c5b0320fe70455586f4ce863d5d9361b5e07)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala (diff)
Commit 9d7b1d935f7a2b770d8b2f264cfe4a4db2ad64b6 by sarutak
[SPARK-32175][SPARK-32175][FOLLOWUP] Remove flaky test added in
### What changes were proposed in this pull request?
This PR removes a test added in SPARK-32175(#29002).
### Why are the changes needed?
That test is flaky. It can be mitigated by increasing the timeout but it
would rather be simpler to remove the test. See also the
[discussion](https://github.com/apache/spark/pull/29002#issuecomment-666746857).
### Does this PR introduce _any_ user-facing change?
No.
Closes #29314 from sarutak/remove-flaky-test.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 9d7b1d935f7a2b770d8b2f264cfe4a4db2ad64b6)
The file was modifiedcore/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala (diff)
Commit f6027827a49a52f6586f73aee9e4067659f650b6 by kabhwan.opensource
[SPARK-32482][SS][TESTS] Eliminate deprecated poll(long) API calls to
avoid infinite wait in tests
### What changes were proposed in this pull request? Structured
Streaming Kafka connector tests are now using a deprecated `poll(long)`
API which could cause infinite wait. In this PR I've eliminated these
calls and replaced them with `AdminClient`.
### Why are the changes needed? Deprecated `poll(long)` API calls.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Existing unit tests.
Closes #29289 from gaborgsomogyi/SPARK-32482.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by:
Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: f6027827a49a52f6586f73aee9e4067659f650b6)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala (diff)
Commit ae82768c1396bfe626b52c4eac33241a9eb91f54 by wenchen
[SPARK-32421][SQL] Add code-gen for shuffled hash join
### What changes were proposed in this pull request?
Adding codegen for shuffled hash join. Shuffled hash join codegen is
very similar to broadcast hash join codegen. So most of code change is
to refactor existing codegen in `BroadcastHashJoinExec` to `HashJoin`.
Example codegen for query in
[`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153):
```
def shuffleHashJoin(): Unit = {
   val N: Long = 4 << 20
   withSQLConf(
     SQLConf.SHUFFLE_PARTITIONS.key -> "2",
     SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "10000000",
     SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
     codegenBenchmark("shuffle hash join", N) {
       val df1 = spark.range(N).selectExpr(s"id as k1")
       val df2 = spark.range(N / 3).selectExpr(s"id * 3 as k2")
       val df = df1.join(df2, col("k1") === col("k2"))
     
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[ShuffledHashJoinExec]).isDefined)
       df.noop()
     }
   }
}
```
Shuffled hash join codegen:
```
== Subtree 3 / 3 (maxMethodCodeSize:113; maxConstantPoolSize:126(0.19%
used); numInnerClasses:0) ==
*(3) ShuffledHashJoin [k1#2L], [k2#6L], Inner, BuildRight
:- *(1) Project [id#0L AS k1#2L]
:  +- *(1) Range (0, 4194304, step=1, splits=1)
+- *(2) Project [(id#4L * 3) AS k2#6L]
  +- *(2) Range (0, 1398101, step=1, splits=1)
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage3(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=3
/* 006 */ final class GeneratedIteratorForCodegenStage3 extends
org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private scala.collection.Iterator inputadapter_input_0;
/* 010 */   private org.apache.spark.sql.execution.joins.HashedRelation
shj_relation_0;
/* 011 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[]
shj_mutableStateArray_0 = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
/* 012 */
/* 013 */   public GeneratedIteratorForCodegenStage3(Object[]
references) {
/* 014 */     this.references = references;
/* 015 */   }
/* 016 */
/* 017 */   public void init(int index, scala.collection.Iterator[]
inputs) {
/* 018 */     partitionIndex = index;
/* 019 */     this.inputs = inputs;
/* 020 */     inputadapter_input_0 = inputs[0];
/* 021 */     shj_relation_0 =
((org.apache.spark.sql.execution.joins.ShuffledHashJoinExec)
references[0] /* plan */).buildHashedRelation(inputs[1]);
/* 022 */     shj_mutableStateArray_0[0] = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0);
/* 023 */
/* 024 */   }
/* 025 */
/* 026 */   private void shj_doConsume_0(InternalRow inputadapter_row_0,
long shj_expr_0_0) throws java.io.IOException {
/* 027 */     // generate join key for stream side
/* 028 */
/* 029 */     // find matches from HashRelation
/* 030 */     scala.collection.Iterator shj_matches_0 = false ?
/* 031 */     null :
(scala.collection.Iterator)shj_relation_0.get(shj_expr_0_0);
/* 032 */     if (shj_matches_0 != null) {
/* 033 */       while (shj_matches_0.hasNext()) {
/* 034 */         UnsafeRow shj_matched_0 = (UnsafeRow)
shj_matches_0.next();
/* 035 */         {
/* 036 */           ((org.apache.spark.sql.execution.metric.SQLMetric)
references[1] /* numOutputRows */).add(1);
/* 037 */
/* 038 */           long shj_value_1 = shj_matched_0.getLong(0);
/* 039 */           shj_mutableStateArray_0[0].reset();
/* 040 */
/* 041 */           shj_mutableStateArray_0[0].write(0, shj_expr_0_0);
/* 042 */
/* 043 */           shj_mutableStateArray_0[0].write(1, shj_value_1);
/* 044 */         
append((shj_mutableStateArray_0[0].getRow()).copy());
/* 045 */
/* 046 */         }
/* 047 */       }
/* 048 */     }
/* 049 */
/* 050 */   }
/* 051 */
/* 052 */   protected void processNext() throws java.io.IOException {
/* 053 */     while ( inputadapter_input_0.hasNext()) {
/* 054 */       InternalRow inputadapter_row_0 = (InternalRow)
inputadapter_input_0.next();
/* 055 */
/* 056 */       long inputadapter_value_0 =
inputadapter_row_0.getLong(0);
/* 057 */
/* 058 */       shj_doConsume_0(inputadapter_row_0,
inputadapter_value_0);
/* 059 */       if (shouldStop()) return;
/* 060 */     }
/* 061 */   }
/* 062 */
/* 063 */ }
```
Broadcast hash join codegen for the same query (for reference here):
```
== Subtree 2 / 2 (maxMethodCodeSize:280; maxConstantPoolSize:218(0.33%
used); numInnerClasses:0) ==
*(2) BroadcastHashJoin [k1#2L], [k2#6L], Inner, BuildRight, false
:- *(2) Project [id#0L AS k1#2L]
:  +- *(2) Range (0, 4194304, step=1, splits=1)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint,
false]),false), [id=#22]
  +- *(1) Project [(id#4L * 3) AS k2#6L]
     +- *(1) Range (0, 1398101, step=1, splits=1)
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage2(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=2
/* 006 */ final class GeneratedIteratorForCodegenStage2 extends
org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private boolean range_initRange_0;
/* 010 */   private long range_nextIndex_0;
/* 011 */   private TaskContext range_taskContext_0;
/* 012 */   private InputMetrics range_inputMetrics_0;
/* 013 */   private long range_batchEnd_0;
/* 014 */   private long range_numElementsTodo_0;
/* 015 */   private
org.apache.spark.sql.execution.joins.LongHashedRelation bhj_relation_0;
/* 016 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[]
range_mutableStateArray_0 = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[4];
/* 017 */
/* 018 */   public GeneratedIteratorForCodegenStage2(Object[]
references) {
/* 019 */     this.references = references;
/* 020 */   }
/* 021 */
/* 022 */   public void init(int index, scala.collection.Iterator[]
inputs) {
/* 023 */     partitionIndex = index;
/* 024 */     this.inputs = inputs;
/* 025 */
/* 026 */     range_taskContext_0 = TaskContext.get();
/* 027 */     range_inputMetrics_0 =
range_taskContext_0.taskMetrics().inputMetrics();
/* 028 */     range_mutableStateArray_0[0] = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 029 */     range_mutableStateArray_0[1] = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 030 */     range_mutableStateArray_0[2] = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 031 */
/* 032 */     bhj_relation_0 =
((org.apache.spark.sql.execution.joins.LongHashedRelation)
((org.apache.spark.broadcast.TorrentBroadcast) references[1] /*
broadcast */).value()).asReadOnlyCopy();
/* 033 */     incPeakExecutionMemory(bhj_relation_0.estimatedSize());
/* 034 */
/* 035 */     range_mutableStateArray_0[3] = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0);
/* 036 */
/* 037 */   }
/* 038 */
/* 039 */   private void initRange(int idx) {
/* 040 */     java.math.BigInteger index =
java.math.BigInteger.valueOf(idx);
/* 041 */     java.math.BigInteger numSlice =
java.math.BigInteger.valueOf(1L);
/* 042 */     java.math.BigInteger numElement =
java.math.BigInteger.valueOf(4194304L);
/* 043 */     java.math.BigInteger step =
java.math.BigInteger.valueOf(1L);
/* 044 */     java.math.BigInteger start =
java.math.BigInteger.valueOf(0L);
/* 045 */     long partitionEnd;
/* 046 */
/* 047 */     java.math.BigInteger st =
index.multiply(numElement).divide(numSlice).multiply(step).add(start);
/* 048 */     if
(st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
/* 049 */       range_nextIndex_0 = Long.MAX_VALUE;
/* 050 */     } else if
(st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
/* 051 */       range_nextIndex_0 = Long.MIN_VALUE;
/* 052 */     } else {
/* 053 */       range_nextIndex_0 = st.longValue();
/* 054 */     }
/* 055 */     range_batchEnd_0 = range_nextIndex_0;
/* 056 */
/* 057 */     java.math.BigInteger end =
index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice)
/* 058 */     .multiply(step).add(start);
/* 059 */     if
(end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
/* 060 */       partitionEnd = Long.MAX_VALUE;
/* 061 */     } else if
(end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
/* 062 */       partitionEnd = Long.MIN_VALUE;
/* 063 */     } else {
/* 064 */       partitionEnd = end.longValue();
/* 065 */     }
/* 066 */
/* 067 */     java.math.BigInteger startToEnd =
java.math.BigInteger.valueOf(partitionEnd).subtract(
/* 068 */       java.math.BigInteger.valueOf(range_nextIndex_0));
/* 069 */     range_numElementsTodo_0  =
startToEnd.divide(step).longValue();
/* 070 */     if (range_numElementsTodo_0 < 0) {
/* 071 */       range_numElementsTodo_0 = 0;
/* 072 */     } else if
(startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L))
!= 0) {
/* 073 */       range_numElementsTodo_0++;
/* 074 */     }
/* 075 */   }
/* 076 */
/* 077 */   private void bhj_doConsume_0(long bhj_expr_0_0) throws
java.io.IOException {
/* 078 */     // generate join key for stream side
/* 079 */
/* 080 */     // find matches from HashedRelation
/* 081 */     UnsafeRow bhj_matched_0 = false ? null:
(UnsafeRow)bhj_relation_0.getValue(bhj_expr_0_0);
/* 082 */     if (bhj_matched_0 != null) {
/* 083 */       {
/* 084 */         ((org.apache.spark.sql.execution.metric.SQLMetric)
references[2] /* numOutputRows */).add(1);
/* 085 */
/* 086 */         long bhj_value_2 = bhj_matched_0.getLong(0);
/* 087 */         range_mutableStateArray_0[3].reset();
/* 088 */
/* 089 */         range_mutableStateArray_0[3].write(0, bhj_expr_0_0);
/* 090 */
/* 091 */         range_mutableStateArray_0[3].write(1, bhj_value_2);
/* 092 */         append((range_mutableStateArray_0[3].getRow()));
/* 093 */
/* 094 */       }
/* 095 */     }
/* 096 */
/* 097 */   }
/* 098 */
/* 099 */   protected void processNext() throws java.io.IOException {
/* 100 */     // initialize Range
/* 101 */     if (!range_initRange_0) {
/* 102 */       range_initRange_0 = true;
/* 103 */       initRange(partitionIndex);
/* 104 */     }
/* 105 */
/* 106 */     while (true) {
/* 107 */       if (range_nextIndex_0 == range_batchEnd_0) {
/* 108 */         long range_nextBatchTodo_0;
/* 109 */         if (range_numElementsTodo_0 > 1000L) {
/* 110 */           range_nextBatchTodo_0 = 1000L;
/* 111 */           range_numElementsTodo_0 -= 1000L;
/* 112 */         } else {
/* 113 */           range_nextBatchTodo_0 = range_numElementsTodo_0;
/* 114 */           range_numElementsTodo_0 = 0;
/* 115 */           if (range_nextBatchTodo_0 == 0) break;
/* 116 */         }
/* 117 */         range_batchEnd_0 += range_nextBatchTodo_0 * 1L;
/* 118 */       }
/* 119 */
/* 120 */       int range_localEnd_0 = (int)((range_batchEnd_0 -
range_nextIndex_0) / 1L);
/* 121 */       for (int range_localIdx_0 = 0; range_localIdx_0 <
range_localEnd_0; range_localIdx_0++) {
/* 122 */         long range_value_0 = ((long)range_localIdx_0 * 1L) +
range_nextIndex_0;
/* 123 */
/* 124 */         bhj_doConsume_0(range_value_0);
/* 125 */
/* 126 */         if (shouldStop()) {
/* 127 */           range_nextIndex_0 = range_value_0 + 1L;
/* 128 */           ((org.apache.spark.sql.execution.metric.SQLMetric)
references[0] /* numOutputRows */).add(range_localIdx_0 + 1);
/* 129 */           range_inputMetrics_0.incRecordsRead(range_localIdx_0
+ 1);
/* 130 */           return;
/* 131 */         }
/* 132 */
/* 133 */       }
/* 134 */       range_nextIndex_0 = range_batchEnd_0;
/* 135 */       ((org.apache.spark.sql.execution.metric.SQLMetric)
references[0] /* numOutputRows */).add(range_localEnd_0);
/* 136 */       range_inputMetrics_0.incRecordsRead(range_localEnd_0);
/* 137 */       range_taskContext_0.killTaskIfInterrupted();
/* 138 */     }
/* 139 */   }
/* 140 */
/* 141 */ }
```
### Why are the changes needed?
Codegen shuffled hash join can help save CPU cost. We added shuffled
hash join codegen internally in our fork, and seeing obvious improvement
in benchmark compared to current non-codegen code path.
Test example query in
[`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153),
seeing 30% wall clock time improvement compared to existing non-codegen
code path:
Enable shuffled hash join code-gen:
``` Running benchmark: shuffle hash join
Running case: shuffle hash join wholestage off
Stopped after 2 iterations, 1358 ms
Running case: shuffle hash join wholestage on
Stopped after 5 iterations, 2323 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz shuffle hash join:            
          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per
Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join wholestage off                    649            679 
       43          6.5         154.7       1.0X shuffle hash join
wholestage on                     436            465          45       
9.6         103.9       1.5X
```
Disable shuffled hash join codegen:
``` Running benchmark: shuffle hash join
Running case: shuffle hash join wholestage off
Stopped after 2 iterations, 1345 ms
Running case: shuffle hash join wholestage on
Stopped after 5 iterations, 2967 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz shuffle hash join:            
          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per
Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join wholestage off                    646            673 
       37          6.5         154.1       1.0X shuffle hash join
wholestage on                     549            594          47       
7.6         130.9       1.2X
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `WholeStageCodegenSuite`.
Closes #29277 from c21/codegen.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: ae82768c1396bfe626b52c4eac33241a9eb91f54)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala (diff)
Commit 813532d10310027fee9e12680792cee2e1c2b7c7 by kabhwan.opensource
[SPARK-32468][SS][TESTS] Fix timeout config issue in Kafka connector
tests
### What changes were proposed in this pull request? While I'm
implementing SPARK-32032 I've found a bug in Kafka:
https://issues.apache.org/jira/browse/KAFKA-10318. This will cause
issues only later when it's fixed but it would be good to fix it now
because SPARK-32032 would like to bring in `AdminClient` where the code
blows up with the mentioned `ConfigException`. This would reduce the
code changes in the mentioned jira. In this PR I've changed
`default.api.timeout.ms` to `request.timeout.ms` which fulfils this
condition.
### Why are the changes needed? Solve later problems and reduce
SPARK-32032 PR size.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Existing unit tests.
Closes #29272 from gaborgsomogyi/SPARK-32468.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by:
Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 813532d10310027fee9e12680792cee2e1c2b7c7)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala (diff)
Commit 8014b0b5d61237dc4851d4ae9927778302d692da by gurwls223
[SPARK-32160][CORE][PYSPARK] Add a config to switch allow/disallow to
create SparkContext in executors
### What changes were proposed in this pull request?
This is a follow-up of #28986. This PR adds a config to switch
allow/disallow to create `SparkContext` in executors.
- `spark.driver.allowSparkContextInExecutors`
### Why are the changes needed?
Some users or libraries actually create `SparkContext` in executors. We
shouldn't break their workloads.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to create `SparkContext` in executors with the
config enabled.
### How was this patch tested?
More tests are added.
Closes #29278 from ueshin/issues/SPARK-32160/add_configs.
Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 8014b0b5d61237dc4851d4ae9927778302d692da)
The file was modifiedpython/pyspark/tests/test_context.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/SparkContextSuite.scala (diff)
The file was modifieddocs/core-migration-guide.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedpython/pyspark/context.py (diff)
Commit f4800406a455b33afa7b1d62d10f236da4cd1f83 by gurwls223
[SPARK-32406][SQL][FOLLOWUP] Make RESET fail against static and core
configs
### What changes were proposed in this pull request?
This followup addresses comments from
https://github.com/apache/spark/pull/29202#discussion_r462054784
1. make RESET static SQL configs/spark core configs fail as same as the
SET command. Not that, for core ones, they have to be pre-registered,
otherwise, they are still able to be SET/RESET
2. add test cases for configurations w/ optional default values
### Why are the changes needed?
behavior change with suggestions from PMCs
### Does this PR introduce _any_ user-facing change?
Yes, RESET will fail after this PR, before it just does nothing because
the static ones are static.
### How was this patch tested?
add more tests.
Closes #29297 from yaooqinn/SPARK-32406-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: f4800406a455b33afa7b1d62d10f236da4cd1f83)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala (diff)
Commit 4eaf3a0a23d66f234e98062852223d81ec770fbe by gurwls223
[SPARK-31418][CORE][FOLLOW-UP][MINOR] Fix log messages to print stage id
instead of the object name
### What changes were proposed in this pull request? Just few log lines
fixes which are logging the object name instead of the stage IDs
### Why are the changes needed? This would make it easier later for
debugging.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Just log messages. Existing tests should
be enough
Closes #29279 from venkata91/SPARK-31418-follow-up.
Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 4eaf3a0a23d66f234e98062852223d81ec770fbe)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (diff)
Commit 354313b6bc89149f97b7cebf6249abd9e3e87724 by wenchen
[SPARK-31894][SS][FOLLOW-UP] Rephrase the config doc
### What changes were proposed in this pull request? Address comment in
https://github.com/apache/spark/pull/28707#discussion_r461102749
### Why are the changes needed? Hide the implementation details in the
config doc.
### Does this PR introduce _any_ user-facing change? Config doc change.
### How was this patch tested? Document only.
Closes #29315 from xuanyuanking/SPARK-31894-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 354313b6bc89149f97b7cebf6249abd9e3e87724)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 1c6dff7b5fc171c190feea0d8f7d323e330d9151 by wenchen
[SPARK-32083][SQL] AQE coalesce should at least return one partition
### What changes were proposed in this pull request?
This PR updates the AQE framework to at least return one partition
during coalescing.
This PR also updates `ShuffleExchangeExec.canChangeNumPartitions` to not
coalesce for `SinglePartition`.
### Why are the changes needed?
It's a bit risky to return 0 partitions, as sometimes it's different
from empty data. For example, global aggregate will return one result
row even if the input table is empty. If there is 0 partition, no task
will be run and no result will be returned. More specifically, the
global aggregate requires `AllTuples` and we can't coalesce to 0
partitions.
This is not a real bug for now. The global aggregate will be planned as
partial and final physical agg nodes. The partial agg will return at
least one row, so that the shuffle still have data. But it's better to
fix this issue to avoid potential bugs in the future.
According to https://github.com/apache/spark/pull/28916, this change
also fix some perf problems.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
updated test.
Closes #29307 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: 1c6dff7b5fc171c190feea0d8f7d323e330d9151)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)