Changes

Summary

  1. [SPARK-35899][SQL] Utility to convert connector expressions to Catalyst (commit: 63cd131) (details)
  2. [SPARK-35466][PYTHON] Fix disallow_untyped_defs mypy checks for (commit: a9ebfc5) (details)
  3. [SPARK-35879][CORE][SHUFFLE] Fix performance regression caused by (commit: 14d4dec) (details)
  4. [SPARK-35895][SQL] Support subtracting Intervals from TimestampWithoutTZ (commit: 645fb59) (details)
  5. [SPARK-35903][TESTS] Parameterize 'master' in TPCDSQueryBenchmark (commit: f68fbae) (details)
  6. [SPARK-35905][SQL][TESTS] Fix UT to clean up table/view in SQLQuerySuite (commit: 74637a6) (details)
  7. [SPARK-35904][SQL] Collapse above RebalancePartitions (commit: def29e5) (details)
  8. [SPARK-35893][TESTS] Add unit test case for MySQLDialect.getCatalystType (commit: b11b175) (details)
  9. [SPARK-35886][SQL] PromotePrecision should not overwrite genCode (commit: b89cd8d) (details)
  10. [SPARK-35909][DOCS] Fix broken Python Links in (commit: a7369b3) (details)
  11. [SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame (commit: 03e6de2) (details)
  12. [SPARK-35901][PYTHON] Refine type hints in pyspark.pandas.window (commit: 8c401be) (details)
  13. [SPARK-35880][SS] Track the duplicates dropped count in dedupe operator (commit: 0da463e) (details)
  14. [SPARK-35318][SQL][FOLLOWUP] Hide the internal view properties for `show (commit: 378ac78) (details)
  15. [SPARK-35064][SQL] Group error in spark-catalyst (commit: 1c81ad2) (details)
  16. [SPARK-35258][SHUFFLE][YARN] Add new metrics to ExternalShuffleService (commit: 3255511) (details)
  17. Revert "[SPARK-35904][SQL] Collapse above RebalancePartitions" (commit: 108635a) (details)
  18. [SPARK-35728][SPARK-35778][SQL][TESTS] Check multiply/divide of day-time (commit: 356aef4) (details)
  19. [SPARK-35898][SQL] Fix arrays and maps in RowToColumnConverter (commit: c660650) (details)
  20. [SPARK-35910][CORE][SHUFFLE] Update remoteBlockBytes based on merged (commit: 9c157a4) (details)
  21. [SPARK-35344][PYTHON] Support creating a Column of numpy literals in (commit: 5f0113e) (details)
  22. [SPARK-33898][SQL] Support SHOW CREATE TABLE In V2 (commit: 8fbbd2e) (details)
  23. [SPARK-35888][SQL] Add dataSize field in CoalescedPartitionSpec (commit: 622fc68) (details)
  24. [SPARK-34302][SQL] Migrate ALTER TABLE ... CHANGE COLUMN command to use (commit: 620fde4) (details)
  25. [SPARK-35876][SQL] ArraysZip should retain field names to avoid being (commit: 880bbd6) (details)
  26. [SPARK-35920][BUILD] Upgrade to Chill 0.10.0 (commit: b999e6b) (details)
  27. [SPARK-35922][BUILD] Upgrade maven-shade-plugin to 3.2.4 (commit: c45a6f5) (details)
  28. [SPARK-35899][SQL][FOLLOWUP] Utility to convert connector expressions to (commit: 8a21d2d) (details)
  29. [SPARK-35483][FOLLOWUP][TESTS] Enable docker_integration_tests for (commit: 57896e6) (details)
  30. [SPARK-34302][FOLLOWUP][SQL][TESTS] Update jdbc.v2.*IntegrationSuite (commit: 16e5035) (details)
  31. [SPARK-35483][FOLLOWUP][TESTS] Update run-tests.py doctest (commit: 0a7a6f7) (details)
  32. [SPARK-35916][SQL] Support subtraction among (commit: 7635114) (details)
  33. [SPARK-35721][PYTHON] Path level discover for python unittests (commit: 5db51ef) (details)
  34. [SPARK-35927][SQL] Remove type collection AllTimestampTypes (commit: 78e6263) (details)
  35. [SPARK-35923][SQL] Coalesce empty partition with mixed (commit: def7383) (details)
  36. [SPARK-35928][BUILD] Upgrade ASM to 9.1 (commit: 7e70282) (details)
  37. [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark (commit: 2702fb9) (details)
  38. [SPARK-35906][SQL] Remove order by if the maximum number of rows less (commit: 4a17e7a) (details)
  39. [SPARK-35924][BUILD][TESTS] Add Java 17 ea build test to GitHub action (commit: a6088e5) (details)
  40. Revert "[SPARK-35721][PYTHON] Path level discover for python unittests" (commit: 1f6e2f5) (details)
  41. [SPARK-35921][BUILD] ${spark.yarn.isHadoopProvided} in config.properties (commit: 05c6b8a) (details)
  42. [SPARK-32922][SHUFFLE][CORE] Adds support for executors to fetch local (commit: 9a5cd15) (details)
  43. [SPARK-35784][SS] Implementation for RocksDB instance (commit: 3257a30) (details)
  44. [SPARK-35873][PYTHON] Cleanup the version logic from the pandas API on (commit: 28a201a) (details)
  45. Revert "[SPARK-34549][BUILD] Upgrade aws kinesis to 1.14.0 and java sdk (commit: 7ad682a) (details)
  46. [SPARK-35943][PYTHON] Introduce Axis type alias (commit: 0a838dc) (details)
  47. [SPARK-35896][SS] Include more granular metrics for stateful operators (commit: 24b67ca) (details)
  48. [SPARK-35829][SQL] Clean up evaluates subexpressions and add more (commit: 064230d) (details)
  49. [SPARK-35946][PYTHON] Respect Py4J server in InheritableThread API (commit: 8d28839) (details)
  50. [SPARK-35937][SQL] Extracting date field from timestamp should work in (commit: ad4b679) (details)
  51. Revert "[SPARK-33995][SQL] Expose make_interval as a Scala function" (commit: 7668226) (details)
  52. [SPARK-35935][SQL] Prevent failure of `MSCK REPAIR TABLE` on table (commit: d28ca9c) (details)
  53. [SPARK-35947][INFRA] Increase JVM stack size in release-build.sh (commit: 5312008) (details)
  54. [SPARK-35948][INFRA] Simplify release scripts by removing Spark (commit: b218cc9) (details)
  55. [SPARK-33298][CORE][FOLLOWUP] Add Unstable annotation to (commit: 6bbfb45) (details)
  56. [SPARK-34365][AVRO] Add support for positional Catalyst-to-Avro schema (commit: 4dd41b9) (details)
  57. [SPARK-34920][CORE][SQL] Add error classes with SQLSTATE (commit: e3bd817) (details)
  58. [SPARK-35951][DOCS] Add since versions for Avro options in Documentation (commit: c6afd6e) (details)
  59. [SPARK-35932][SQL] Support extracting hour/minute/second from timestamp (commit: e88aa49) (details)
  60. [SPARK-35735][SQL] Take into account day-time interval fields in cast (commit: 2febd5c) (details)
  61. [SPARK-35953][SQL] Support extracting date fields from timestamp without (commit: 733e85f) (details)
  62. initial commit for skeleton ansible for jenkins worker config (commit: 2c94fbc) (details)
  63. [SPARK-35725][SQL] Support optimize skewed partitions in (commit: d46c1e3) (details)
  64. [SPARK-34859][SQL] Handle column index when using vectorized Parquet (commit: a5c8866) (details)
  65. [SPARK-35939][DOCS][PYTHON] Deprecate Python 3.6 in Spark documentation (commit: 9e39415) (details)
  66. [SPARK-35938][PYTHON] Add deprecation warning for Python 3.6 (commit: 5ad1261) (details)
  67. [SPARK-35944][PYTHON] Introduce Name and Label type aliases (commit: a98c8ae) (details)
  68. [SPARK-35888][SQL][FOLLOWUP] Return partition specs for all the shuffles (commit: cd6a463) (details)
  69. [SPARK-35065][SQL] Group exception messages in spark/sql (core) (commit: 5d74ace) (details)
  70. [SPARK-35714][FOLLOW-UP][CORE] Use a shared stopping flag for (commit: 868a594) (details)
  71. [SPARK-35950][WEBUI] Failed to toggle Exec Loss Reason in the executors (commit: dc85b0b) (details)
  72. [SPARK-35960][BUILD][TEST] Bump the scalatest version to 3.2.9 (commit: 34286ae) (details)
  73. [SPARK-35961][SQL] Only use local shuffle reader when (commit: ba0a479) (details)
  74. fix Spark version (commit: 6a2f434) (details)
  75. Revert "fix Spark version" (commit: 74c4641) (details)
  76. [SPARK-35962][DOCS] Deprecate old Java 8 versions prior to 8u201 (commit: 912d2b9) (details)
Commit 63cd1314d225d431b7b5ef9417e3af0c39a69a63 by dongjoon
[SPARK-35899][SQL] Utility to convert connector expressions to Catalyst

### What changes were proposed in this pull request?

This PR adds a utility to convert public connector expressions to Catalyst expressions.

Notable differences:
- Switched to `QueryCompilationErrors` from an explicit `AnalysisException`.
- Decoupled the resolving logic for v2 references into separate methods to use in other places.

### Why are the changes needed?

These changes are needed as more and more places require this logic and it is better to implement it in a single place.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33096 from aokolnychyi/spark-35899.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 63cd131)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala (diff)
Commit a9ebfc5374b224867e067856f53d511d0cf25396 by ueshin
[SPARK-35466][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.data_type_ops.*

### What changes were proposed in this pull request?

Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: a9ebfc5)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/null_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/complex_ops.py (diff)
The file was modifiedpython/mypy.ini (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/boolean_ops.py (diff)
Commit 14d4decf736297e2bf4d824ccbd604c9da49ccf4 by yao
[SPARK-35879][CORE][SHUFFLE] Fix performance regression caused by collectFetchRequests

### What changes were proposed in this pull request?

This PR fixes perf regression at the executor side when creating fetch requests with large initial partitions

![image](https://user-images.githubusercontent.com/8326978/123270865-dd21e800-d532-11eb-8447-ad80e47b034f.png)

In NetEase, we had an online job that took `45min` to "fetch" about 100MB of shuffle data, which actually turned out that it was just collecting fetch requests slowly. Normally, such a task should finish in seconds.

See the `DEBUG` log

```
21/06/22 11:52:26 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to kyuubi.163.org:
21/06/22 11:53:05 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3941440 at BlockManagerId(12, .., 43559, None) with 19 blocks
21/06/22 11:53:44 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3716400 at BlockManagerId(20, .., 38287, None) with 18 blocks
21/06/22 11:54:41 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 4559280 at BlockManagerId(6, .., 39689, None) with 22 blocks
21/06/22 11:55:08 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3120160 at BlockManagerId(33, .., 39449, None) with 15 blocks
```

I also create a test case locally with my local laptop docker env to give some reproducible cases.

```
bin/spark-sql --conf spark.kubernetes.file.upload.path=./ --master k8s://https://kubernetes.docker.internal:6443 --conf spark.kubernetes.container.image=yaooqinn/spark:v20210624-5 -c spark.kubernetes.context=docker-for-desktop_1 --num-executors 5 --driver-memory 5g --conf spark.kubernetes.executor.podNamePrefix=sparksql
```

```sql
SET spark.sql.adaptive.enabled=true;
SET spark.sql.shuffle.partitions=3000;
SELECT /*+ REPARTITION */ 1 as pid, id from range(1, 1000000, 1, 500);
SELECT /*+ REPARTITION(pid, id) */ 1 as pid, id from range(1, 1000000, 1, 500);
```

### Why are the changes needed?

fix perf regression which was introduced by SPARK-29292 (3ad4863673fc46080dda963be3055a3e554cfbc7) in v3.1.0.

3ad4863673fc46080dda963be3055a3e554cfbc7 is for support compilation with scala 2.13 but the performance losses is huge. We need to consider backporting this PR to branch 3.1.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Mannully,

#### before
```log
21/06/23 13:54:22 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647
21/06/23 13:54:38 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2314708 at BlockManagerId(2, 10.1.3.114, 36423, None) with 86 blocks
21/06/23 13:54:59 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2636612 at BlockManagerId(3, 10.1.3.115, 34293, None) with 87 blocks
21/06/23 13:55:18 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2508706 at BlockManagerId(4, 10.1.3.116, 41869, None) with 90 blocks
21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2350854 at BlockManagerId(5, 10.1.3.117, 45787, None) with 85 blocks
21/06/23 13:55:34 INFO ShuffleBlockFetcherIterator: Getting 438 (11.8 MiB) non-empty blocks including 90 (2.5 MiB) local and 0 (0.0 B) host-local and 348 (9.4 MiB) remote blocks
21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 87 blocks (2.5 MiB) from 10.1.3.115:34293
21/06/23 13:55:34 INFO TransportClientFactory: Successfully created connection to /10.1.3.115:34293 after 1 ms (0 ms spent in bootstraps)
21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 90 blocks (2.4 MiB) from 10.1.3.116:41869
21/06/23 13:55:34 INFO TransportClientFactory: Successfully created connection to /10.1.3.116:41869 after 2 ms (0 ms spent in bootstraps)
21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 85 blocks (2.2 MiB) from 10.1.3.117:45787
```
```log
21/06/23 14:00:45 INFO MapOutputTracker: Broadcast outputstatuses size = 411, actual size = 828997
21/06/23 14:00:45 INFO MapOutputTrackerWorker: Got the map output locations
21/06/23 14:00:45 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647
21/06/23 14:00:55 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1894389 at BlockManagerId(2, 10.1.3.114, 36423, None) with 99 blocks
21/06/23 14:01:04 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1919993 at BlockManagerId(3, 10.1.3.115, 34293, None) with 100 blocks
21/06/23 14:01:14 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1977186 at BlockManagerId(5, 10.1.3.117, 45787, None) with 103 blocks
21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1938336 at BlockManagerId(4, 10.1.3.116, 41869, None) with 101 blocks
21/06/23 14:01:23 INFO ShuffleBlockFetcherIterator: Getting 500 (9.1 MiB) non-empty blocks including 97 (1820.3 KiB) local and 0 (0.0 B) host-local and 403 (7.4 MiB) remote blocks
21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 101 blocks (1892.9 KiB) from 10.1.3.116:41869
21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 103 blocks (1930.8 KiB) from 10.1.3.117:45787
21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 99 blocks (1850.0 KiB) from 10.1.3.114:36423
21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 100 blocks (1875.0 KiB) from 10.1.3.115:34293
21/06/23 14:01:23 INFO ShuffleBlockFetcherIterator: Started 4 remote fetches in 37889 ms
```

#### After

```log
21/06/24 13:01:16 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647
21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call blockInfos.map(_._2).sum: 40 ms
21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call mergeFetchBlockInfo for shuffle_0_9_2990_2997/9: 0 ms
21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call mergeFetchBlockInfo for shuffle_0_15_2395_2997/15: 0 ms
```

Closes #33063 from yaooqinn/SPARK-35879.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 14d4dec)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
Commit 645fb59652fad5a7b84a691e2446b396cf81048a by max.gekk
[SPARK-35895][SQL] Support subtracting Intervals from TimestampWithoutTZ

### What changes were proposed in this pull request?

Support the following operation:
- TimestampWithoutTZ - Year-Month interval

The following operation is actually supported in https://github.com/apache/spark/pull/33076/. This PR is to add end-to-end tests for them:
- TimestampWithoutTZ - Calendar interval
- TimestampWithoutTZ - Daytime interval

### Why are the changes needed?

Support subtracting all 3 interval types from a timestamp without time zone

### Does this PR introduce _any_ user-facing change?

No, the timestamp without time zone type is not release yet.

### How was this patch tested?

Unit tests

Closes #33086 from gengliangwang/subtract.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 645fb59)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/datetime.sql (diff)
Commit f68fbae7abd49ef83fe58422e3f96cfa6e6c8f9f by dhyun
[SPARK-35903][TESTS] Parameterize 'master' in TPCDSQueryBenchmark

### What changes were proposed in this pull request?

Like SPARK-8397, this PR aims to parameterize TPCDSQueryBenchmark's Spark master by reusing `spark.sql.test.master`.

### Why are the changes needed?

This is helpful for testers.

### Does this PR introduce _any_ user-facing change?

No. This is a test environment.

### How was this patch tested?

Manually, I checked the performance difference with TPCDS 10g data.

Closes #33098 from dongjoon-hyun/SPARK-35903.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: f68fbae)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala (diff)
Commit 74637a6ca717c7d14a0511f026e85a2f538e99ae by dhyun
[SPARK-35905][SQL][TESTS] Fix UT to clean up table/view in SQLQuerySuite

### What changes were proposed in this pull request?
Fix UT mistake in SQLQuerySuite

### Why are the changes needed?
Fix UT mistake in SQLQuerySuite

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #33092 from AngersZhuuuu/SPARK-33338-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 74637a6)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
Commit def29e507521a0383b9fb4c0fcbba8784ee27e80 by dongjoon
[SPARK-35904][SQL] Collapse above RebalancePartitions

### What changes were proposed in this pull request?

1. Make `RebalancePartitions` extend `RepartitionOperation`.
2. Make `CollapseRepartition` support `RebalancePartitions`.

### Why are the changes needed?

`CollapseRepartition` can optimize `RebalancePartitions` if possible.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33099 from wangyum/SPARK-35904.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: def29e5)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseRepartitionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
Commit b11b175148592b27695b422e81a22881bf14dd87 by dongjoon
[SPARK-35893][TESTS] Add unit test case for MySQLDialect.getCatalystType

### What changes were proposed in this pull request?
Add unit test case for MySQLDialect.getCatalystType

### Why are the changes needed?
add unit test case

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit Test

Closes #33087 from zengruios/SPARK-35893.

Authored-by: zengruios <578395184@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: b11b175)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala (diff)
Commit b89cd8d75a0e78c6953cdd21c6e9c41495ed018f by viirya
[SPARK-35886][SQL] PromotePrecision should not overwrite genCode

### What changes were proposed in this pull request?

This patch fixes `PromotePrecision` where it overwrites `genCode` where subexpression elimination should happen.

### Why are the changes needed?

`PromotePrecision` overwrites `genCode` where subexpression elimination should happen. So if it is most top expression of a subexpression, it is never replaced.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added test.

Closes #33103 from viirya/fix-precision.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: b89cd8d)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala (diff)
Commit a7369b3080ec3d76957df63cf905a68e41197ba3 by dongjoon
[SPARK-35909][DOCS] Fix broken Python Links in docs/sql-getting-started.md

### What changes were proposed in this pull request?

The hyperlinks in Python code blocks in [Spark SQL Guide - Getting Started](https://spark.apache.org/docs/latest/sql-getting-started.html) currently point to invalid addresses and return 404. This pull request fixes that issue by pointing them to correct links in Python API docs.

### Why are the changes needed?

Error in documentation classifies as a bug and hence needs to be fixed.

### Does this PR introduce _any_ user-facing change?

Yes. This PR fixes documentation error in https://spark.apache.org/docs/latest/sql-getting-started.html

### How was this patch tested?

This patch was locally built after cloning the repo from scratch and then doing a clean build after fixing the required problems.

Closes #33107 from dhruvildave/sql-doc.

Authored-by: Dhruvil Dave <dhruvil.dave@outlook.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: a7369b3)
The file was modifieddocs/sql-getting-started.md (diff)
Commit 03e6de2abe6b767e55de2761499ae937db588508 by gurwls223
[SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame

### What changes were proposed in this pull request?

This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests.

### Why are the changes needed?

Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore.
And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense.

### Does this PR introduce _any_ user-facing change?

No, it's kinda internal refactoring stuff.

### How was this patch tested?

Added the related tests and manually check they're passed.

Closes #33054 from itholic/SPARK-35605.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 03e6de2)
The file was modifiedpython/pyspark/pandas/__init__.py (diff)
The file was modifiedpython/pyspark/pandas/plot/core.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.pyi (diff)
The file was modifiedpython/docs/source/reference/pyspark.sql.rst (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_dataframe.py (diff)
The file was modifiedpython/docs/source/reference/pyspark.pandas/frame.rst (diff)
Commit 8c401beb806267d4c23aeb27ab8898dcc3a0f98d by gurwls223
[SPARK-35901][PYTHON] Refine type hints in pyspark.pandas.window

### What changes were proposed in this pull request?

Refines type hints in `pyspark.pandas.window`.

Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`.

### Why are the changes needed?

We can use more strict type hints for functions in pyspark.pandas.window using the generic way.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33097 from ueshin/issues/SPARK-35901/window.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 8c401be)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
Commit 0da463e59304954515f003f98574c740b47b89fb by kabhwan.opensource
[SPARK-35880][SS] Track the duplicates dropped count in dedupe operator

### What changes were proposed in this pull request?

Add a metric to track the number of duplicates dropped in input in streaming deduplication operator. Also introduce a `StatefulOperatorCustomMetric` to allow stateful operators to output their own unique metrics in `StateOperatorProgress.customMetrics` in `StreamingQueryProgress`.

### Why are the changes needed?

1. Having the duplicates dropped count help monitor and debug any incorrect results issue or find reasons for state size increases in dedupe operator.
2. New API `StatefulOperatorCustomMetric` allows stateful operators to expose their own unique metrics in `StateOperatorProgress.customMetrics` in `StreamingQueryProgress`

### Does this PR introduce _any_ user-facing change?

Yes. For deduplication stateful operator a new metric `numDuplicatesDropped` is shown in `StateOperatorProgress` within `StreamingQueryProgress`. Example `StreamingQueryProgress` output in JSON form.

```
{
  "id" : "510be3cd-a955-4faf-8456-d97c78d39af5",
  "runId" : "c170c4cd-04cb-4a28-b054-74020e3998e1",
  ...
  ,
  "stateOperators" : [ {
    "numRowsTotal" : 1,
    "numRowsUpdated" : 1,
    "numRowsDroppedByWatermark" : 0,
    "customMetrics" : {
      "loadedMapCacheHitCount" : 0,
      "loadedMapCacheMissCount" : 0,
      "numDuplicatesDropped" : 0,
      "stateOnCurrentVersionSizeBytes" : 392
    }
  }],
  ...
}
```

### How was this patch tested?

Existing UTs for regression and added a UT.

Closes #33065 from vkorukanti/SPARK-35880.

Authored-by: Venki Korukanti <venki.korukanti@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 0da463e)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StateStoreMetricsTest.scala (diff)
Commit 378ac78bdfd197eb921c56b0a6f22cccd7cd42a1 by wenchen
[SPARK-35318][SQL][FOLLOWUP] Hide the internal view properties for `show tblproperties`

### What changes were proposed in this pull request?
PR #32441 hid the internal view properties for describe table command, But the `show tblproperties view` case is not covered.

### Why are the changes needed?
Avoid internal properties confusing the users.

### Does this PR introduce _any_ user-facing change?
Yes
Before this change, the user will see below output for  `show tblproperties test_view`
```
....
p1 v1
p2 v2
view.catalogAndNamespace.numParts 2
view.catalogAndNamespace.part.0 spark_catalog
view.catalogAndNamespace.part.1 default
view.query.out.col.0 c1
view.query.out.numCols 1
view.referredTempFunctionsNames []
view.referredTempViewNames []
...
```
After this change, the internal properties will be hidden.
```
....
p1 v1
p2 v2
...
```
### How was this patch tested?
existing UT

Closes #33016 from jerqi/hide_show_tblproperties.

Authored-by: RoryQi <1242949407@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 378ac78)
The file was modifiedsql/core/src/test/resources/sql-tests/results/show-tblproperties.sql.out (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
Commit 1c81ad20296d34f137238dadd67cc6ae405944eb by wenchen
[SPARK-35064][SQL] Group error in spark-catalyst

### What changes were proposed in this pull request?
This PR group exception messages in sql/catalyst/src/main/scala/org/apache/spark/sql (except catalyst)

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce any user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32916 from dgd-contributor/SPARK-35064_catalyst_group_error.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1c81ad2)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/execution/RowIterator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/UDTRegistration.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/ReadOnlySQLConf.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala (diff)
Commit 3255511d52f0c9652b34de4f499ee5081f59e0a5 by mridulatgmail.com
[SPARK-35258][SHUFFLE][YARN] Add new metrics to ExternalShuffleService for better monitoring

### What changes were proposed in this pull request?
This adds two new additional metrics to `ExternalBlockHandler`:
- `blockTransferRate` -- for indicating the rate of transferring blocks, vs. the data within them
- `blockTransferAvgSize_1min` -- a 1-minute trailing average of block sizes transferred by the ESS

Additionally, this enhances `YarnShuffleServiceMetrics` to expose the histogram/`Snapshot` information from `Timer` metrics within `ExternalBlockHandler`.

### Why are the changes needed?
Currently `ExternalBlockHandler` exposes some useful metrics, but is lacking around metrics for the rate of block transfers. We have `blockTransferRateBytes` to tell us the rate of _bytes_, but no metric to tell us the rate of _blocks_, which is especially relevant when running the ESS on HDDs that are sensitive to random reads. Many small block transfers can have a negative impact on performance, but won't show up as a spike in `blockTransferRateBytes` since the sizes are small. Thus the new metrics to show information around average block size and block transfer rate are very useful to monitor the health/performance of the ESS, especially when running on HDDs.

For the `YarnShuffleServiceMetrics`, currently the three `Timer` metrics exposed by `ExternalBlockHandler` are being underutilized in a YARN-based environment -- they are basically treated as a `Meter`, only exposing rate-based information, when the metrics themselves are collected detailed histograms of timing information. We should expose this information for better observability.

### Does this PR introduce _any_ user-facing change?
Yes, there are two entirely new metrics for the ESS, as documented in `monitoring.md`. Additionally in a YARN environment, `Timer` metrics exposed by the ESS will include more rich timing information.

### How was this patch tested?
New unit tests are added to verify that new metrics are showing up as expected.

We have been running this patch internally for approx. 1 year and have found it to be useful for monitoring the health of ESS and diagnosing performance issues.

Closes #32388 from xkrogen/xkrogen-SPARK-35258-ess-new-metrics.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 3255511)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/network/yarn/YarnShuffleServiceSuite.scala (diff)
The file was modifiedcommon/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/network/yarn/YarnShuffleServiceMetricsSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalBlockHandlerSuite.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/ExternalShuffleServiceMetricsSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java (diff)
The file was modifieddocs/monitoring.md (diff)
Commit 108635af1708173a72bec0e36bf3f2cea5b088c4 by yumwang
Revert "[SPARK-35904][SQL] Collapse above RebalancePartitions"

This reverts commit def29e50
(commit: 108635a)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseRepartitionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
Commit 356aef48b8dc2fa97b28d960ae9e84a1b7a5752a by max.gekk
[SPARK-35728][SPARK-35778][SQL][TESTS] Check multiply/divide of day-time and year-month interval of any fields by a numeric

### What changes were proposed in this pull request?
[SPARK-35728](https://issues.apache.org/jira/browse/SPARK-35728): Add test case to check multiply/divide of day-time
intervals of any fields by numeric
[SPARK-35778](https://issues.apache.org/jira/browse/SPARK-35778): Add test case to check multiply/divide of year-month intervals of any fields by numeric

### Why are the changes needed?
Improve test coverage

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add ut tests

Lead-authored-by: Lei Peng <peng.8leigmail.com>
Co-authored-by: AngersZhuuuu <angers.zhugmail.com>

Closes #33080 from Peng-Lei/SPARK-35728-35778.

Lead-authored-by: PengLei <peng.8lei@gmail.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 356aef4)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/IntervalExpressionsSuite.scala (diff)
Commit c6606502a2e338c0e973e5772a8cc44126ae2fde by herman
[SPARK-35898][SQL] Fix arrays and maps in RowToColumnConverter

### What changes were proposed in this pull request?

This PR fixes support for arrays and maps in `RowToColumnConverter`. In particular this PR fixes two bugs:

1. `appendArray` in `WritableColumnVector` does not reserve any elements in its child arrays, which causes the assertion in `OffHeapColumnVector.putArray` to fail.
2. The nullability of the child columns is propagated incorrectly when creating the child converters of `ArrayConverter` and `MapConverter` in `RowToColumnConverter`.

This PR fixes these issues.

### Why are the changes needed?

Both bugs cause an exception to be thrown.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

I added additional test cases to `ColumnVectorSuite` to catch the first bug, and I added `RowToColumnConverterSuite` to catch the both bugs (but specifically the second).

Closes #33108 from tomvanbussel/SPARK-35898.

Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: herman <herman@databricks.com>
(commit: c660650)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/RowToColumnConverterSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala (diff)
Commit 9c157a490bb59e02dcf44b14b411ea5beb68c238 by dongjoon
[SPARK-35910][CORE][SHUFFLE] Update remoteBlockBytes based on merged block info to reduce task time

### What changes were proposed in this pull request?

Currently, we calculate the `remoteBlockBytes` based on the original block info list. It's not efficient. Usually, it costs more ~25% time to be spent here.

If the original reducer size is big but the actual reducer size is small due to automatically partition coalescing of AQE, the reducer will take more time to calculate `remoteBlockBytes`.

We can reduce this cost via remote requests which contain merged block info lists.

### Why are the changes needed?

improve task performance

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new unit tests and verified manually.

Closes #33109 from yaooqinn/SPARK-35910.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 9c157a4)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
Commit 5f0113e3a666f64c8b37cc889af846299a84960d by ueshin
[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark

### What changes were proposed in this pull request?

The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.

```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method

Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.

### Why are the changes needed?

Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions`  is adjusted in order to support that in pandas-on-Spark.

### Does this PR introduce _any_ user-facing change?

Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```

After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0     True
1     True
2    False
3    False
4    False
Name: source, dtype: bool
```

### How was this patch tested?

Unit tests.

Closes #32955 from xinrong-databricks/datatypeops_literal.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 5f0113e)
The file was modifiedpython/pyspark/pandas/spark/functions.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/numpy_compat.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/plot/core.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/date_ops.py (diff)
The file was addedpython/pyspark/pandas/tests/test_spark_functions.py
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
Commit 8fbbd2e6d72a4e0eb9e12abd840924afc44809f1 by yao
[SPARK-33898][SQL] Support SHOW CREATE TABLE In V2

### What changes were proposed in this pull request?
1. Implement V2 execution node `ShowCreateTableExec` similar to V1 `ShowCreateTableCommand`
2. No support `SHOW CREATE TABLE XXX AS SERDE`

### Why are the changes needed?
[SPARK-33898](https://issues.apache.org/jira/browse/SPARK-33898)

### Does this PR introduce _any_ user-facing change?
Yes. Support the user to execute `SHOW CREATE TABLE` command in V2 table

### How was this patch tested?
Add two UT tests
1. ./dev/scalastyle
2. run test DataSourceV2SQLSuite

Closes #32931 from Peng-Lei/SPARK-33898.

Lead-authored-by: PengLei <18066542445@189.cn>
Co-authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 8fbbd2e)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
Commit 622fc686e26e03e8a8ee16568cfd3189fbfaa41e by wenchen
[SPARK-35888][SQL] Add dataSize field in CoalescedPartitionSpec

### What changes were proposed in this pull request?

* add `dataSize` field in `CoalescedPartitionSpec`
* add data size test suite in `ShufflePartitionsUtilSuite`

### Why are the changes needed?

Currently, all test suite about `CoalescedPartitionSpec` do not check the data size due to it doesn't contains data size field.

We can add data size in `CoalescedPartitionSpec` and then add test case for better coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CI

Closes #33079 from ulysses-you/SPARK-35888.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 622fc68)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff)
Commit 620fde476777cdce4e55c3398d5aec44ba035de3 by wenchen
[SPARK-34302][SQL] Migrate ALTER TABLE ... CHANGE COLUMN command to use UnresolvedTable to resolve the identifier

### What changes were proposed in this pull request?

This PR proposes to migrate the following `ALTER TABLE ... CHANGE COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).

### Does this PR introduce _any_ user-facing change?

After this PR, the above `ALTER TABLE ... CHANGE COLUMN` commands will have a consistent resolution behavior.

### How was this patch tested?

Updated existing tests.

Closes #33113 from imback82/alter_change_column.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 620fde4)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/change-column.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
Commit 880bbd6aaaead2ed26dfad38425f4e7357769b6a by gurwls223
[SPARK-35876][SQL] ArraysZip should retain field names to avoid being re-written by analyzer/optimizer

### What changes were proposed in this pull request?

This PR fixes an issue that field names of structs generated by `arrays_zip` function could be unexpectedly re-written by analyzer/optimizer.
Here is an example.
```
val df = sc.parallelize(Seq((Array(1, 2), Array(3, 4)))).toDF("a1", "b1").selectExpr("arrays_zip(a1, b1) as zipped")
df.printSchema
root
|-- zipped: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- a1: integer (nullable = true)                                      // OK. a1 is expected name
|    |    |-- b1: integer (nullable = true)                                      // OK. b1 is expected name

df.explain
== Physical Plan ==
*(1) Project [arrays_zip(_1#3, _2#4) AS zipped#12]               // Not OK. field names are re-written as _1 and _2 respectively

df.write.parquet("/tmp/test.parquet")
val df2 = spark.read.parquet("/tmp/test.parquet")

df2.printSchema
root
|-- zipped: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- _1: integer (nullable = true)                                      // Not OK. a1 is expected but got _1
|    |    |-- _2: integer (nullable = true)                                      // Not OK. b1 is expected but got _2
```

This issue happens when aliases are eliminated by `AliasHelper.replaceAliasButKeepName` or `AliasHelper.trimNonTopLevelAliases` called via analyzer/optimizer
https://github.com/apache/spark/blob/b89cd8d75a0e78c6953cdd21c6e9c41495ed018f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L883
https://github.com/apache/spark/blob/b89cd8d75a0e78c6953cdd21c6e9c41495ed018f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3759
I investigated functions which can be affected this issue but I found only `arrays_zip` so far.

To fix this issue, this PR changes the definition of `ArraysZip` to retain field names to avoid being re-written by analyzer/optimizer.

### Why are the changes needed?

This is apparently a bug.

### Does this PR introduce _any_ user-facing change?

No. After this change, the field names are no longer re-written but it should be expected behavior for users.

### How was this patch tested?

New tests.

Closes #33106 from sarutak/arrays-zip-retain-names.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 880bbd6)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleIdCollection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala (diff)
Commit b999e6bd902309600b82e3c516a4962918efe55b by dongjoon
[SPARK-35920][BUILD] Upgrade to Chill 0.10.0

### What changes were proposed in this pull request?

This PR aims to upgrade Chill to 0.10.0.

### Why are the changes needed?

This is a maintenance release having cross-compilation to 2.12.14 and 2.13.6 .
- https://github.com/twitter/chill/releases/tag/v0.10.0

### Does this PR introduce _any_ user-facing change?

No, this is a dependency change.

### How was this patch tested?

Pass the CIs.

Closes #33119 from dongjoon-hyun/SPARK-35920.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: b999e6b)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
Commit c45a6f5d0966b120e54429f4a19738cc11ad7056 by dongjoon
[SPARK-35922][BUILD] Upgrade maven-shade-plugin to 3.2.4

### What changes were proposed in this pull request?

This PR aims to upgrade `maven-shade-plugin` to 3.2.4.

### Why are the changes needed?

This is required to build with Java 17-ea.

Since `maven-shade-plugin` 3.2.3, `asm` 8.0 is used now. We should remove our custom dependency of `7.3.1`.
- https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-shade-plugin/3.2.4
- https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-shade-plugin/3.2.3

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33122 from dongjoon-hyun/SPARK-35922.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: c45a6f5)
The file was modifiedpom.xml (diff)
Commit 8a21d2dcfed5df9e220054fdb958a3abdeb2e34e by dongjoon
[SPARK-35899][SQL][FOLLOWUP] Utility to convert connector expressions to Catalyst

### What changes were proposed in this pull request?

This PR addresses post-review comments on PR #33096:
- removes `private[sql]` modifier
- removes the option to pass a resolver to simplify the API

### Why are the changes needed?

These changes are needed to simply the utility API.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33120 from aokolnychyi/spark-35899-follow-up.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 8a21d2d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala (diff)
Commit 57896e662e9bd16cd2400afa53cea7981f42b714 by dongjoon
[SPARK-35483][FOLLOWUP][TESTS] Enable docker_integration_tests for catalyst/sql module changes too

### What changes were proposed in this pull request?

This PR aims to enable `docker_integration_tests` when `catalyst` and `sql` module changes additionally.

### Why are the changes needed?

Currently, `catalyst` and `sql` module changes do not trigger the JDBC integration test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33125 from dongjoon-hyun/SPARK-35483.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 57896e6)
The file was modifieddev/sparktestsupport/modules.py (diff)
Commit 16e50356ee3dc4401c5cc9115411fe10128d4327 by dongjoon
[SPARK-34302][FOLLOWUP][SQL][TESTS] Update jdbc.v2.*IntegrationSuite

### What changes were proposed in this pull request?

This PR aims to update JDBC v2 integration suite by adding `catalogName`.

### Why are the changes needed?

To recover the integration test suite.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #33124 from dongjoon-hyun/SPARK-34302.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 16e5035)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala (diff)
Commit 0a7a6f750c021a7fea10e3862898edf03a75c212 by dongjoon
[SPARK-35483][FOLLOWUP][TESTS] Update run-tests.py doctest

### What changes were proposed in this pull request?

This PR updates the doctests in `run-tests.py`.

### Why are the changes needed?

This should be consists with `modules.py` behavior.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the GitHub Action.

I checked manually.
```
$ python dev/run-tests.py
Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment local
[info] Found the following changed modules: root
[info] Setup the following environment variables for tests:

========================================================================
Running Apache RAT checks
========================================================================
RAT checks passed.
```

Closes #33127 from dongjoon-hyun/SPARK-35483-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 0a7a6f7)
The file was modifieddev/run-tests.py (diff)
Commit 7635114d531639ccf70a88707251155a7122e54b by gengliang
[SPARK-35916][SQL] Support subtraction among Date/Timestamp/TimestampWithoutTZ

### What changes were proposed in this pull request?

Support the following operations:

- TimestampWithoutTZ - Date
- Date - TimestampWithoutTZ
- TimestampWithoutTZ - Timestamp
- Timestamp - TimestampWithoutTZ
- TimestampWithoutTZ - TimestampWithoutTZ

For subtraction between `TimestampWithoutTZ` and `Timestamp`, the `Timestamp` column is cast as TimestampWithoutTZType.

### Why are the changes needed?

Support basic subtraction among Date/Timestamp/TimestampWithoutTZ.

### Does this PR introduce _any_ user-facing change?

No, the timestamp without time zone type is not release yet.

### How was this patch tested?

Unit tests

Closes #33115 from gengliangwang/subtractTimestampWithoutTz.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 7635114)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/typeCoercion/native/promoteStrings.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/datetime.sql (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/typeCoercion/native/decimalPrecision.sql.out (diff)
Commit 5db51efa1a0d98ad74a94f7b73bcb1161817e0a5 by gurwls223
[SPARK-35721][PYTHON] Path level discover for python unittests

### What changes were proposed in this pull request?
Add path level discover for python unittests.

### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.

Such as:
- pyspark-core pyspark.tests.test_pin_thread

Thus we need some auto-discover way to find all testcase rather than specified every case by manually.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
    for g in sorted(m.python_test_goals):
        print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKL

Closes #32867 from Yikun/SPARK_DISCOVER_TEST.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 5db51ef)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_stats.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_datetime.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_indexing.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
Commit 78e6263cce949843c07c821356ea58c4c568ccdc by gengliang
[SPARK-35927][SQL] Remove type collection AllTimestampTypes

### What changes were proposed in this pull request?

Replace the type collection `AllTimestampTypes` with the new data type `AnyTimestampType`

### Why are the changes needed?

As discussed in https://github.com/apache/spark/pull/33115#discussion_r659866760, it is more convenient to have a new data type "AnyTimestampType" instead of using type collection `AllTimestampTypes`:
1. simplify the pattern match
2. In the default type coercion rules, when implicit casting a type to a TypeCollection type, Spark chooses the first convertible data type as the result. If we are going to make the default timestamp type configurable, having AnyTimestampType is better

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #33129 from gengliangwang/allTimestampTypes.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 78e6263)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala (diff)
Commit def738365e9c84a71dc10f24ac1ea60e6295b794 by wenchen
[SPARK-35923][SQL] Coalesce empty partition with mixed CoalescedPartitionSpec and PartialReducerPartitionSpec

### What changes were proposed in this pull request?

Skip empty partitions in `ShufflePartitionsUtil.coalescePartitionsWithSkew`.

### Why are the changes needed?

Since [SPARK-35447](https://issues.apache.org/jira/browse/SPARK-35447), we apply `OptimizeSkewedJoin` before `CoalesceShufflePartitions`. However, There are something different with the order of these two rules.

Let's say if we have a skewed partitions: [0, 128MB, 0, 128MB, 0]:

* coalesce partitions first then optimize skewed partitions:
  [64MB, 64MB, 64MB, 64MB]
* optimize skewed partition first then coalesce partitions:
  [0, 64MB, 64MB, 0, 64MB, 64MB, 0]

So we can do coalesce in `ShufflePartitionsUtil.coalescePartitionsWithSkew` with mixed `CoalescedPartitionSpec` and `PartialReducerPartitionSpec` if `CoalescedPartitionSpec` is empty.

### Does this PR introduce _any_ user-facing change?

No, not release yet.

### How was this patch tested?

Add test.

Closes #33123 from ulysses-you/SPARK-35923.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: def7383)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
Commit 7e7028282c817e0cdbcd402a3c89dbb487d38719 by dongjoon
[SPARK-35928][BUILD] Upgrade ASM to 9.1

### What changes were proposed in this pull request?

This PR aims to upgrade ASM to 9.1

### Why are the changes needed?

The latest `xbean-asm9-shaded` is built with ASM 9.1.

- https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20
- https://github.com/apache/geronimo-xbean/blob/5e0e3c0c6463f5b1e29185bdd9c2366c38371aa0/pom.xml#L67

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33130 from dongjoon-hyun/SPARK-35928.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 7e70282)
The file was modifiedcore/pom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedrepl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/ClosureCleaner.scala (diff)
The file was modifiedgraphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala (diff)
The file was modifiedproject/plugins.sbt (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedrepl/pom.xml (diff)
The file was modifiedgraphx/pom.xml (diff)
The file was modifiedpom.xml (diff)
The file was modifiedsql/core/pom.xml (diff)
Commit 2702fb9af0bbf61c97391c3dd41e9c39685e649e by ueshin
[SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark

### What changes were proposed in this pull request?

Cleaning up the type hints in pandas-on-Spark.

- Use a single file `_typing.py` for type variables or aliases
- Rename `IndexOpsLike` to `SeriesOrIndex`.
- Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively
- Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]`

### Why are the changes needed?

This is a cleanup for the mypy check stuff series.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33117 from ueshin/issues/SPARK-35859/cleanup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 2702fb9)
The file was modifiedpython/pyspark/pandas/extensions.py (diff)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/complex_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/null_ops.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/spark/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/typedef/typehints.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was addedpython/pyspark/pandas/_typing.py
The file was modifiedpython/pyspark/pandas/indexes/numeric.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
Commit 4a17e7a5ae155c6a8586b457ff248be6d52d0a10 by dongjoon
[SPARK-35906][SQL] Remove order by if the maximum number of rows less than or equal to 1

### What changes were proposed in this pull request?

This PR removes order by if the maximum number of rows less than or equal to 1. For example:
```scala
spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost")
```
Before this pr:
```
== Optimized Logical Plan ==
Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B)
+- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
   +- Project, Statistics(sizeInBytes=20.0 B)
      +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5)
```

After this pr:
```
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project, Statistics(sizeInBytes=20.0 B)
   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5)
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33100 from wangyum/SPARK-35906.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 4a17e7a)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q61.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q96.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q96/simplified.txt (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94/simplified.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q90.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q96.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q61/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q90/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q96/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92.sf100/explain.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q90.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q61/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q90/simplified.txt (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q61.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95/simplified.txt (diff)
Commit a6088e50363a2a39c2f14287b82d5e84a0efb53b by dongjoon
[SPARK-35924][BUILD][TESTS] Add Java 17 ea build test to GitHub action

### What changes were proposed in this pull request?
This PR aims to add Java 17-ea build test to GitHub action.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass newly added Java 17-ea GitHub action job.

Closes #33126 from williamhyun/SPARK-35924.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: a6088e5)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 1f6e2f55d7896c9128f80a8f1ed4c317244d013b by ueshin
Revert "[SPARK-35721][PYTHON] Path level discover for python unittests"

This reverts commit 5db51efa1a0d98ad74a94f7b73bcb1161817e0a5.
(commit: 1f6e2f5)
The file was modifiedpython/pyspark/pandas/tests/test_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_stats.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_datetime.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_indexing.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
Commit 05c6b8acdc24bc6f982a63dfb5cca21cc9993312 by d_tsai
[SPARK-35921][BUILD] ${spark.yarn.isHadoopProvided} in config.properties is not edited if build with SBT

### What changes were proposed in this pull request?

This PR changes `SparkBuild.scala` to edit `config.properties` in `yarn` sub-module in build with SBT like as build with Maven does.

### Why are the changes needed?

yarn sub-module contains config.properties.
```
spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}
```

The `${spark.yarn.isHadoopProvided}` part is replaced with `true` or `false` in build depending on whether Hadoop is provided or not (specified by -Phadoop-provided).
The edited config.properties will be loaded at runtime to control how to populate Hadoop-related classpath.

If we build with Maven, these process works but doesn't with SBT.

If we build with SBT and deploy apps on YARN, the following warning appears and classpath is not populated correctly.
```
21/06/29 10:51:20 WARN config.package: Can not load the default value of `spark.yarn.isHadoopProvided` from `org/apache/spark/deploy/yarn/config.properties` with error, java.lang.IllegalArgumentException: For input string: "${spark.yarn.isHadoopProvided}". Using `false` as a default value.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Built with SBT and extracted `config.properties` from the build artifact and confirmed `${spark.yarn.isHadoopProvided} was correctly edited with `true` or `false`.
```
cat org/apache/spark/deploy/yarn/config.properties
spark.yarn.isHadoopProvided = false                                # In case build with -Pyarn and without -Phadoop-provided
spark.yarn.isHadoopProvided = true                                 # In case build with -Pyarn and -Phadoop-provided
```
I also confirmed the warning message shown above no longer appears.

Closes #33121 from sarutak/sbt-yarn-config-properties.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
(commit: 05c6b8a)
The file was modifiedproject/SparkBuild.scala (diff)
Commit 9a5cd15e8726ccd93a550f90e8113b80fc6d0122 by mridulatgmail.com
[SPARK-32922][SHUFFLE][CORE] Adds support for executors to fetch local and remote merged shuffle data

### What changes were proposed in this pull request?
This is the shuffle fetch side change where executors can fetch local/remote push-merged shuffle data from shuffle services. This is needed for push-based shuffle - SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
The change adds support to the `ShuffleBlockFetchIterator` to fetch push-merged block meta and shuffle chunks from local and remote ESS. If the fetch of any of these fails, then the iterator fallsback to fetch the original shuffle blocks that belonged to the push-merged block.

### Why are the changes needed?
These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

### Does this PR introduce _any_ user-facing change?
When push-based shuffle is turned on then that will fetch push-merged blocks from the remote shuffle service. The client logs will indicate this.

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Min Shen mshenlinkedin.com
Co-authored-by: Ye Zhou yezhoulinkedin.com

Closes #32140 from otterc/SPARK-32922.

Lead-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: otterc <singh.chandni@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 9a5cd15)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockIdSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockId.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/storage/PushBasedFetchHelper.scala
The file was modifiedcore/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/MapOutputTracker.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/serializer/SerializerManager.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
Commit 3257a30e5399d4f366e4aae60b04371b31514fb4 by viirya
[SPARK-35784][SS] Implementation for RocksDB instance

### What changes were proposed in this pull request?
The implementation for the RocksDB instance, which is used in the RocksDB state store. It plays a role as a handler for the RocksDB instance and RocksDBFileManager.

### Why are the changes needed?
Part of the RocksDB state store implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New UT added.

Closes #32928 from xuanyuanking/SPARK-35784.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 3257a30)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
The file was modifiedsql/core/pom.xml (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBLoader.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
Commit 28a201a442812a1d4b35887ef0e89e4e470ae969 by gurwls223
[SPARK-35873][PYTHON] Cleanup the version logic from the pandas API on Spark

### What changes were proposed in this pull request?

This PR proposes removing the legacy Koalas version from pandas API on Spark package.

And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version.

### Why are the changes needed?

Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas.

### Does this PR introduce _any_ user-facing change?

Now the legacy Koalas user should follow the version from PySpark.

### How was this patch tested?

Manually built the package and see it's successfully done.

Closes #33128 from itholic/SPARK-35873.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 28a201a)
The file was removedpython/pyspark/pandas/version.py
The file was modifiedpython/pyspark/pandas/__init__.py (diff)
Commit 7ad682aaa169983f60c0e48547825ed2b255777e by gurwls223
Revert "[SPARK-34549][BUILD] Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844"

### What changes were proposed in this pull request?

This PR reverts the change of SPARK-34549 ( #31658).

### Why are the changes needed?

See #33133.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #33145 from sarutak/revert-SPARK-34549.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 7ad682a)
The file was modifiedpom.xml (diff)
Commit 0a838dcd71c733289e60d9f74e8267027c7b2c4a by gurwls223
[SPARK-35943][PYTHON] Introduce Axis type alias

### What changes were proposed in this pull request?

Introduces `Axis` type alias for `axis` argument to be consistent.

### Why are the changes needed?

There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33144 from ueshin/issues/SPARK-35943/axis.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 0a838dc)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/_typing.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
Commit 24b67ca9a837250d25dbcd189b75c919c06aec26 by kabhwan.opensource
[SPARK-35896][SS] Include more granular metrics for stateful operators in StreamingQueryProgress

### What changes were proposed in this pull request?

Currently the `StateOperatorProgress` in `StreamingQueryProgress` is missing few metrics.

### Why are the changes needed?

The main motivation is find hotspots and have better visibility in the stateful operations. Detailed explanations are in [SPARK-35896](https://issues.apache.org/jira/browse/SPARK-35896).

### Does this PR introduce _any_ user-facing change?

Yes. The `StateOperatorProgress` entries within `StreamingQueryProgress` now contain additional fields as listed in [SPARK-35896](https://issues.apache.org/jira/browse/SPARK-35896). Example `StreamingQueryProgress` output in JSON form.
Before:
```
{

  "id" : "510be3cd-a955-4faf-8456-d97c78d39af5",
  ....
  "durationMs" : {
    "triggerExecution" : 2856,
    ....
  },
  "stateOperators" : [ {
    "numRowsTotal" : 1,
    "numRowsUpdated" : 1,
    "numRowsDroppedByWatermark" : 0,
    "customMetrics" : {
      "loadedMapCacheHitCount" : 0,
      "loadedMapCacheMissCount" : 0,
      "stateOnCurrentVersionSizeBytes" : 392
    }
  }],
  ....
}
```
After:
```
{
  "id" : "510be3cd-a955-4faf-8456-d97c78d39af5",
  ....
  "durationMs" : {
    "triggerExecution" : 2856,
    ....
  },
  "stateOperators" : [ {
    "operatorName" : "dedupe", <-- new
    "numRowsTotal" : 1,
    "numRowsUpdated" : 1, <-- new
    "allUpdatesTimeMs" : 56, <-- new
    "numRowsRemoved" : 2, <-- new
    "allRemovalsTimeMs" : 45, <-- new
    "commitTimeMs" : 40, <-- new
    "numRowsDroppedByWatermark" : 0,
    "numShufflePartitions" : 2, <-- new
    "numStateStoreInstances" : 2, <-- new
    "customMetrics" : {
      "loadedMapCacheHitCount" : 0,
      "loadedMapCacheMissCount" : 0,
      "stateOnCurrentVersionSizeBytes" : 392
    }
  }],
  ....
}
```

### How was this patch tested?

Existing tests for regressions. Added new UTs.

Closes #33091 from vkorukanti/SPARK-35896.

Lead-authored-by: Venki Korukanti <venki.korukanti@gmail.com>
Co-authored-by: Venki Korukanti <venki.korukanti@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 24b67ca)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/streamingLimits.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StateStoreMetricsTest.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala (diff)
Commit 064230de972cfbd31d80bbb82de4f95a25fc9da7 by viirya
[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin

### What changes were proposed in this pull request?

This patch refactors the evaluation of subexpressions.

There are two changes:

1. Clean up subexpression code after evaluation to avoid duplicate evaluation.
2. Evaluate all children subexpressions when evaluating a subexpression.

### Why are the changes needed?

Currently `subexpressionEliminationForWholeStageCodegen` return the gen-ed code of subexpressions. The caller simply puts the code into its code block. We need more flexible evaluation here. For example, for Filter operator's subexpression evaluation, we may need to evaluate particular subexpression for one predicate. Current approach cannot satisfy the requirement.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #32980 from viirya/subexpr-eval.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 064230d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala (diff)
Commit 8d28839689614b497be06743ef04a70f815ae0cb by dongjoon
[SPARK-35946][PYTHON] Respect Py4J server in InheritableThread API

### What changes were proposed in this pull request?

Currently ,we sets the environment variable `PYSPARK_PIN_THREAD` at the client side of `InhertiableThread` API for Py4J (`python/pyspark/util.py`). If the Py4J gateway is created somewhere else (e.g., Zeppelin, etc), it could introduce a breakage at:

```python
from pyspark import SparkContext
jvm = SparkContext._jvm
thread_connection = jvm._gateway_client.get_thread_connection()
# `AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'` (non-pinned thread mode)
# `get_thread_connection` is only in 'ClientServer' (pinned thread mode)
```

This PR proposes to check the given gateway created, and do the pinned thread mode behaviour accordingly so we can avoid any breakage when Py4J server/gateway is created separately from somewhere else without a pinned thread mode.

### Why are the changes needed?

To avoid any potential breakage.

### Does this PR introduce _any_ user-facing change?

No, the change happened only in the master (https://github.com/apache/spark/commit/fdd7ca5f4e35a906090f3c6b160bdba9ac9fd0ca).

### How was this patch tested?

This is actually a partial revert of https://github.com/apache/spark/commit/fdd7ca5f4e35a906090f3c6b160bdba9ac9fd0ca. As long as the existing tests pass, I guess we're all good.

I also manually tested to make doubly sure:

**Before**:

```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
Traceback (most recent call last):
  File "/.../python3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/.../python3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/.../spark/python/pyspark/util.py", line 361, in copy_local_properties
    InheritableThread._clean_py4j_conn_for_current_thread()
  File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
    thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/util.py", line 324, in wrapped
    InheritableThread._clean_py4j_conn_for_current_thread()
  File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
    thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'
```

**After**:

```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
1
```

Closes #33147 from HyukjinKwon/SPARK-35946.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 8d28839)
The file was modifiedpython/pyspark/util.py (diff)
Commit ad4b6796f6940e453feab796484694b6f6e69e54 by gengliang
[SPARK-35937][SQL] Extracting date field from timestamp should work in ANSI mode

### What changes were proposed in this pull request?

Add a new ANSI type coercion rule: when getting a date field from a Timestamp column, cast the column as Date type.

This is Spark's current hack to make the implementation simple. In the default type coercion rules, the implicit cast rule does the work. However, The ANSI implicit cast rule doesn't allow converting Timestamp type as Date type, so we need to have this additional rule to make sure the date field extraction from Timestamp columns works.

### Why are the changes needed?

Fix a bug.

### Does this PR introduce _any_ user-facing change?

No, the new type coercion rules are not released yet.

### How was this patch tested?

Unit test

Closes #33138 from gengliangwang/fixGetDateField.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: ad4b679)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/timestamp.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleIdCollection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala (diff)
Commit 76682268d746e72f0e8aa4cc64860e0bfd90f1ed by max.gekk
Revert "[SPARK-33995][SQL] Expose make_interval as a Scala function"

### What changes were proposed in this pull request?
This reverts commit e6753c9402b5c40d9e2af662f28bd4f07a0bae17.

### Why are the changes needed?
The `make_interval` function aims to construct values of the legacy interval type `CalendarIntervalType` which will be substituted by ANSI interval types (see SPARK-27790). Since the function has not been released yet, it would be better to don't expose it via public API at all.

### Does this PR introduce _any_ user-facing change?
Should not since the `make_interval` function has not been released yet.

### How was this patch tested?
By existing test suites, and GA/jenkins builds.

Closes #33143 from MaxGekk/revert-make_interval.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 7668226)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala (diff)
The file was removedsql/core/src/test/java/test/org/apache/spark/sql/JavaDateFunctionsSuite.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
Commit d28ca9cc9808828118be64a545c3478160fdc170 by max.gekk
[SPARK-35935][SQL] Prevent failure of `MSCK REPAIR TABLE` on table refreshing

### What changes were proposed in this pull request?
In the PR, I propose to catch all non-fatal exceptions coming `refreshTable()` at the final stage of table repairing, and output an error message instead of failing with an exception.

### Why are the changes needed?
1. The uncaught exceptions from table refreshing might be considered as regression comparing to previous Spark versions. Table refreshing was introduced by https://github.com/apache/spark/pull/31066.
2. This should improve user experience with Spark SQL. For instance, when the `MSCK REPAIR TABLE` is performed in a chain of command in SQL where catching exception is difficult or even impossible.

### Does this PR introduce _any_ user-facing change?
Yes. Before the changes the `MSCK REPAIR TABLE` command can fail with the exception portrayed in SPARK-35935. After the changes, the same command outputs error message, and completes successfully.

### How was this patch tested?
By existing test suites.

Closes #33137 from MaxGekk/msck-repair-catch-except.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: d28ca9c)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (diff)
Commit 5312008cca074748cac78a57359c9ad9a729df92 by gurwls223
[SPARK-35947][INFRA] Increase JVM stack size in release-build.sh

### What changes were proposed in this pull request?

Like SPARK-35825, this PR aims to increase JVM stack size via `MAVEN_OPTS` in release-build.sh.

### Why are the changes needed?

This will mitigate the failure in publishing snapshot GitHub Action job and during the release.

- https://github.com/apache/spark/actions/workflows/publish_snapshot.yml (3-day consecutive failures)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33149 from dongjoon-hyun/SPARK-35947.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 5312008)
The file was modifieddev/create-release/release-build.sh (diff)
Commit b218cc90cfa957bdbf443ed3dbdfdf660bc1312d by gurwls223
[SPARK-35948][INFRA] Simplify release scripts by removing Spark 2.4/Java7 parts

### What changes were proposed in this pull request?

This PR aims to clean up Spark 2.4 and Java7 code path from the release scripts.

### Why are the changes needed?

To simplify the logic.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33150 from dongjoon-hyun/SPARK-35948.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: b218cc9)
The file was modifieddev/create-release/release-build.sh (diff)
The file was modifieddev/create-release/release-util.sh (diff)
Commit 6bbfb45ffe75aa6c27a7bf3c3385a596637d1822 by gurwls223
[SPARK-33298][CORE][FOLLOWUP] Add Unstable annotation to `FileCommitProtocol`

### What changes were proposed in this pull request?

This is the followup from https://github.com/apache/spark/pull/33012#discussion_r659440833, where we want to add `Unstable` to `FileCommitProtocol`, to give people a better idea of API.

### Why are the changes needed?

Make it easier for people to follow and understand code. Clean up code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests, as no real logic change.

Closes #33148 from c21/bucket-followup.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 6bbfb45)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala (diff)
Commit 4dd41b96785b1a937b9fd2e62f782d6975f3bb23 by gengliang
[SPARK-34365][AVRO] Add support for positional Catalyst-to-Avro schema matching

### What changes were proposed in this pull request?
Provide the (configurable) ability to perform Avro-to-Catalyst schema field matching using the position of the fields instead of their names. A new `option` is added for the Avro datasource, `positionalFieldMatching`, which instructs `AvroSerializer`/`AvroDeserializer` to perform positional field matching instead of matching by name.

### Why are the changes needed?
This by-name matching is somewhat recent; prior to PR #24635, at least on the write path, schemas were matched by positionally ("structural" comparison). While by-name is better behavior as a default, it will be better to make this configurable by a user. Even at the time that PR #24635 was handled, there was [interest in making this behavior configurable](https://github.com/apache/spark/pull/24635#issuecomment-494205251), but it appears it went unaddressed.

There is precedence for configurability of this behavior as seen in PR #29737, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally ([ref](https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles)), so this is behavior that Hadoop/Hive ecosystem users are familiar with.

### Does this PR introduce _any_ user-facing change?
Yes, a new option is provided for the Avro datasource, `positionalFieldMatching`, which provides compatibility with Hive and pre-3.0.0 Spark behavior.

### How was this patch tested?
New unit tests are added within `AvroSuite`, `AvroSchemaHelperSuite`, and `AvroSerdeSuite`; and most of the existing tests within `AvroSerdeSuite` are adapted to perform the same test using by-name and positional matching to ensure feature parity.

Closes #31490 from xkrogen/xkrogen-SPARK-34365-avro-positional-field-matching.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 4dd41b9)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSchemaHelperSuite.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriter.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSerdeSuite.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroRowReaderSuite.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala (diff)
The file was modifieddocs/sql-data-sources-avro.md (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
Commit e3bd817d65ef65c68e40a2937aab0ec70a4afb6f by wenchen
[SPARK-34920][CORE][SQL] Add error classes with SQLSTATE

### What changes were proposed in this pull request?

Unifies exceptions thrown from Spark under a single base trait `SparkError`, which unifies:
- Error classes
- Parametrized error messages
- SQLSTATE, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Add-error-IDs-td31126.html.

### Why are the changes needed?

- Adding error classes creates a consistent label for exceptions, even as error messages change
- Creating a single, centralized source-of-truth for parametrized error messages improves auditing for error message quality
- Adding SQLSTATE helps ODBC/JDBC users receive standardized error codes

### Does this PR introduce _any_ user-facing change?

Yes, changes ODBC experience by:
- Adding error classes to error messages
- Adding SQLSTATE to TStatus

### How was this patch tested?

Unit tests, as well as local tests with PyODBC.

Closes #32850 from karenfeng/SPARK-34920.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e3bd817)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/postgreSQL/udf-case.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/select_having.sql.out (diff)
The file was addedcore/src/main/resources/error/error-classes.json
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServerErrors.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkException.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/case.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala (diff)
The file was addedcore/src/test/scala/org/apache/spark/SparkErrorSuite.scala
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int8.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/SparkError.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/postgreSQL/udf-select_having.sql.out (diff)
The file was addedcore/src/main/resources/error/README.md
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala (diff)
Commit c6afd6ed5296980e81160e441a4e9bea98c74196 by gengliang
[SPARK-35951][DOCS] Add since versions for Avro options in Documentation

### What changes were proposed in this pull request?

There are two new Avro options `datetimeRebaseMode` and `positionalFieldMatching` after Spark 3.2.
We should document the since version so that users can know whether the option works in their Spark version.

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Manual preview on local setup.
<img width="828" alt="Screen Shot 2021-06-30 at 5 05 54 PM" src="https://user-images.githubusercontent.com/1097932/123934000-ba833b00-d947-11eb-9ca5-ce8ff8add74b.png">

<img width="711" alt="Screen Shot 2021-06-30 at 5 06 34 PM" src="https://user-images.githubusercontent.com/1097932/123934126-d4bd1900-d947-11eb-8d80-69df8f3d9900.png">

Closes #33153 from gengliangwang/version.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: c6afd6e)
The file was modifieddocs/sql-data-sources-avro.md (diff)
Commit e88aa49287592cee7f1aa804cf945a3e0ad99eda by gengliang
[SPARK-35932][SQL] Support extracting hour/minute/second from timestamp without time zone

### What changes were proposed in this pull request?

Support extracting hour/minute/second fields from timestamp without time zone values. In details, the following syntaxes are supported:

- extract [hour | minute | second] from timestampWithoutTZ
- date_part('[hour | minute | second]', timestampWithoutTZ)
- hour(timestampWithoutTZ)
- minute(timestampWithoutTZ)
- second(timestampWithoutTZ)

### Why are the changes needed?

Support basic operations for the new timestamp type.

### Does this PR introduce _any_ user-facing change?

No, the timestamp without time zone type is not release yet.

### How was this patch tested?

Unit test

Closes #33136 from gengliangwang/field.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: e88aa49)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/extract.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/extract.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
Commit 2febd5c3f0c3a0c6660cfb340eb65316a1ca4acd by max.gekk
[SPARK-35735][SQL] Take into account day-time interval fields in cast

### What changes were proposed in this pull request?
Support take into account day-time interval field in cast.

### Why are the changes needed?
To conform to the SQL standard.

### Does this PR introduce _any_ user-facing change?
An user can use `cast(str, DayTimeInterval(DAY, HOUR))`, for instance.

### How was this patch tested?
Added UT.

Closes #32943 from AngersZhuuuu/SPARK-35735.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 2febd5c)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala (diff)
Commit 733e85f1f48d719cfbc7b23852d7d772ec0e55ca by gengliang
[SPARK-35953][SQL] Support extracting date fields from timestamp without time zone

### What changes were proposed in this pull request?

Support extracting date fields from timestamp without time zone, which includes:
- year
- month
- day
- year of week
- week
- day of week
- quarter
- day of month
- day of year

### Why are the changes needed?

Support basic operations for the new timestamp type.

### Does this PR introduce _any_ user-facing change?

No, the timestamp without time zone type is not released yet.

### How was this patch tested?

Unit tests

Closes #33156 from gengliangwang/dateField.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 733e85f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/extract.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/extract.sql (diff)
Commit 2c94fbc71e0811f0023a7776c82ff3d69d6cfad4 by incomplete
initial commit for skeleton ansible for jenkins worker config

### What changes were proposed in this pull request?
this is the skeleton of the ansible used to configure jenkins workers in the riselab/apache spark build system

### Why are the changes needed?
they are not needed, but will help the community understand how to build systems to test multiple versions of spark, as well as propose changes that i can integrate in to the "production" riselab repo.  since we're sunsetting jenkins by EOY 2021, this will potentially be useful for migrating the build system.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ansible-lint and much wailing and gnashing of teeth.

Closes #32178 from shaneknapp/initial-ansible-commit.

Lead-authored-by: shane knapp <incomplete@gmail.com>
Co-authored-by: shane <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
(commit: 2c94fbc)
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/README.md
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/python_environments/base-py3-spec.txt
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/util_scripts/post_github_pr_comment.py
The file was modifieddev/tox.ini (diff)
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/install_minikube.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/jenkins_userspace.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/install_docker.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/install_build_packages.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/main.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/worker-limits.conf
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/python_environments/py36.txt
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/util_scripts/kill_zinc_nailgun.py
The file was addeddev/ansible-for-test-node/roles/common/README.md
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/defaults/main.yml
The file was modifieddev/.rat-excludes (diff)
The file was addeddev/ansible-for-test-node/roles/common/tasks/system_packages.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/cleanup.yml
The file was addeddev/ansible-for-test-node/README.md
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/python_environments/base-py3-pip.txt
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/scripts/jenkins-gitcache-cron
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/python_environments/spark-py3k-spec.txt
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/util_scripts/session_lock_resource.py
The file was addeddev/ansible-for-test-node/deploy-jenkins-worker.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/python_environments/spark-py36-spec.txt
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/vars/main.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/install_anaconda.yml
The file was addeddev/ansible-for-test-node/roles/common/tasks/main.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/files/python_environments/spark-py2-pip.txt
The file was addeddev/ansible-for-test-node/roles/common/tasks/setup_local_userspace.yml
The file was addeddev/ansible-for-test-node/roles/jenkins-worker/tasks/install_spark_build_packages.yml
Commit d46c1e38eccae67bc7ccaa920ceb3abdd8866d10 by wenchen
[SPARK-35725][SQL] Support optimize skewed partitions in RebalancePartitions

### What changes were proposed in this pull request?

* Add a new rule `ExpandShufflePartitions` in AQE `queryStageOptimizerRules`
* Add a new config `spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled` to decide if should enable the new rule

The new rule `OptimizeSkewInRebalancePartitions` only handle two shuffle origin `REBALANCE_PARTITIONS_BY_NONE` and `REBALANCE_PARTITIONS_BY_COL` for data skew issue. And re-use the exists config `ADVISORY_PARTITION_SIZE_IN_BYTES` to decide what partition size should be.

### Why are the changes needed?

Currently, we don't support expand partition dynamically in AQE which is not friendly for some data skew job.

Let's say if we have a simple query:
```
SELECT /*+ REBALANCE(col) */ * FROM table
```

The column of `col` is skewed, then some shuffle partitions would handle too much data than others.

If we haven't inroduced extra shuffle, we can optimize this case by expanding partitions in AQE.

### Does this PR introduce _any_ user-facing change?

Yes, a new config

### How was this patch tested?

Add test

Closes #32883 from ulysses-you/expand-partition.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: d46c1e3)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewInRebalancePartitions.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
Commit a5c886619dd1573e96bbba058db099b47f0c147c by dongjoon
[SPARK-34859][SQL] Handle column index when using vectorized Parquet reader

### What changes were proposed in this pull request?

Make the current vectorized Parquet reader to work with column index introduced in Parquet 1.11. In particular, this PR makes the following changes:
1. in `ParquetReadState`, track row ranges returned via `PageReadStore.getRowIndexes` as well as the first row index for each page via `DataPage.getFirstRowIndex`.
1. introduced a new API `ParquetVectorUpdater.skipValues` which skips a batch of values from a Parquet value reader. As part of the process also renamed existing `updateBatch` to `readValues`, and `update` to `readValue` to keep the method names consistent.
1. in correspondence as above, also introduced new API `VectorizedValuesReader.skipXXX` for different data types, as well as the implementations. These are useful when the reader knows that the given batch of values can be skipped, for instance, due to the batch is not covered in the row ranges generated by column index filtering.
2. changed `VectorizedRleValuesReader` to handle column index filtering. This is done by comparing the range that is going to be read next within the current RLE/PACKED block (let's call this block range), against the current row range. There are three cases:
    * if the block range is before the current row range, skip all the values in the block range
    * if the block range is after the current row range, advance the row range and repeat the steps
    * if the block range overlaps with the current row range, only read the values within the overlapping area and skip the rest.

### Why are the changes needed?

[Parquet Column Index](https://github.com/apache/parquet-format/blob/master/PageIndex.md) is a new feature in Parquet 1.11 which allows very efficient filtering on page level (some benchmark numbers can be found [here](https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/)), especially when data is sorted. The feature is largely implemented in parquet-mr (via classes such as `ColumnIndex` and `ColumnIndexFilter`). In Spark, the non-vectorized Parquet reader can automatically benefit from the feature after upgrading to Parquet 1.11.x, without any code change. However, the same is not true for vectorized Parquet reader since Spark chose to implement its own logic such as reading Parquet pages, handling definition levels, reading values into columnar batches, etc.

Previously, [SPARK-26345](https://issues.apache.org/jira/browse/SPARK-26345) / (#31393) updated Spark to only scan pages filtered by column index from parquet-mr side. This is done by calling `ParquetFileReader.readNextFilteredRowGroup` and `ParquetFileReader.getFilteredRecordCount` API. The implementation, however, only work for a few limited cases: in the scenario where there are multiple columns and their type width are different (e.g., `int` and `bigint`), it could return incorrect result. For this issue, please see SPARK-34859 for a detailed description.

In order to fix the above, Spark needs to leverage the API `PageReadStore.getRowIndexes` and `DataPage.getFirstRowIndex`. The former returns the indexes of all rows (note the difference between rows and values: for flat schema there is no difference between the two, but for nested schema they're different) after filtering within a Parquet row group. The latter returns the first row index within a single data page. With the combination of the two, one is able to know which rows/values should be filtered while scanning a Parquet page.

### Does this PR introduce _any_ user-facing change?

Yes. Now the vectorized Parquet reader should work correctly with column index.

### How was this patch tested?

Borrowed tests from #31998 and added a few more tests.

Closes #32753 from sunchao/SPARK-34859.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Li Xian <lxian2shell@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: a5c8866)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnIndexSuite.scala
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdater.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java (diff)
Commit 9e39415f3a2014444af8301faef9c48a43e45330 by gurwls223
[SPARK-35939][DOCS][PYTHON] Deprecate Python 3.6 in Spark documentation

### What changes were proposed in this pull request?

Deprecate Python 3.6 in Spark documentation

### Why are the changes needed?

According to https://endoflife.date/python, Python 3.6 will be EOL on 23 Dec, 2021.
We should prepare for the deprecation of Python 3.6 support in Spark in advance.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Manual tests.

Closes #33141 from xinrong-databricks/deprecate3.6_doc.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 9e39415)
The file was modifieddocs/rdd-programming-guide.md (diff)
The file was modifieddocs/index.md (diff)
Commit 5ad12611ec23d02b5988b92561b70dbeacf6af78 by gurwls223
[SPARK-35938][PYTHON] Add deprecation warning for Python 3.6

### What changes were proposed in this pull request?

Add deprecation warning for Python 3.6.

### Why are the changes needed?

According to https://endoflife.date/python, Python 3.6 will be EOL on 23 Dec, 2021.
We should prepare for the deprecation of Python 3.6 support in Spark in advance.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Manual tests.

Closes #33139 from xinrong-databricks/deprecate3.6_warn.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 5ad1261)
The file was modifiedpython/pyspark/context.py (diff)
Commit a98c8ae57d2daeedf46c3be8eafa9170db9e4805 by gurwls223
[SPARK-35944][PYTHON] Introduce Name and Label type aliases

### What changes were proposed in this pull request?

Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`.

- `Label`: `Tuple[Any, ...]`
  Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures.
- `Name`: `Union[Any, Label]`
  External expression for user-facing names, which can be scalar values or tuples.

### Why are the changes needed?

Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33159 from ueshin/issues/SPARK-35944/name_and_label.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: a98c8ae)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/ml.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/numeric.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/_typing.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/mlflow.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
Commit cd6a4638110ef3f0db8b6366be680870dfb0bcad by wenchen
[SPARK-35888][SQL][FOLLOWUP] Return partition specs for all the shuffles

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/33079, to fix a bug in corner cases: `ShufflePartitionsUtil.coalescePartitions` should either return the shuffle spec for all the shuffles, or none.

If the input RDD has no partition, the `mapOutputStatistics` is None, and we should still return shuffle specs with size 0.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a new test

Closes #33158 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: cd6a463)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
Commit 5d74ace648422e7a9bff7774ac266372934023b9 by wenchen
[SPARK-35065][SQL] Group exception messages in spark/sql (core)

### What changes were proposed in this pull request?
This PR group all exception messages in `sql/core/src/main/scala/org/apache/spark/sql`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32958 from beliefer/SPARK-35065.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 5d74ace)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverWrapper.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextWrite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/DerbyDialect.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousQueuedDataReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropPartitionExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TruncatePartitionExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressionScheme.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryAdaptiveBroadcastExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchStream.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/pathFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/AggregatingAccumulator.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWrite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ResolveWriteToStream.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ForeachWriterTable.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketSourceProvider.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AddPartitionExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaMergeUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/WriteToContinuousDataSourceExec.scala (diff)
Commit 868a59470650cc12272de0d0b04c6d98b1fe076d by yi.wu
[SPARK-35714][FOLLOW-UP][CORE] Use a shared stopping flag for WorkerWatcher to avoid the duplicate System.exit

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` reuse the `stopping` flag in `CoarseGrainedExecutorBackend` to avoid the duplicate call of `System.exit`.

### Why are the changes needed?

As a followup of https://github.com/apache/spark/pull/32868, this PR tries to give a more robust fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass existing tests.

Closes #33028 from Ngone51/spark-35714-followup.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(commit: 868a594)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala (diff)
Commit dc85b0b51a02b9d6c52ffb1600f26ccdd7d7829a by gengliang
[SPARK-35950][WEBUI] Failed to toggle Exec Loss Reason in the executors page

### What changes were proposed in this pull request?

Update the executor's page, so it can successfully hide the "Exec Loss Reason" column.

### Why are the changes needed?

When unselected the checkbox "Exec Loss Reason" on the executor page,
the "Active tasks" column disappears instead of the "Exec Loss Reason" column.

Before:
![Screenshot from 2021-06-30 15-55-05](https://user-images.githubusercontent.com/37936015/123930908-bd6f4180-d9c2-11eb-9aba-bbfe0a237776.png)
After:
![Screenshot from 2021-06-30 22-21-38](https://user-images.githubusercontent.com/37936015/123977632-bf042e00-d9f1-11eb-910e-93d615d2db47.png)

### Does this PR introduce _any_ user-facing change?

Yes, The Web UI is updated.

### How was this patch tested?

Pass the CIs.

Closes #33155 from pingsutw/SPARK-35950.

Lead-authored-by: Kevin Su <pingsutw@gmail.com>
Co-authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: dc85b0b)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/executorspage.js (diff)
Commit 34286ae5bf120544524a6cfe934755e20eb9f835 by dongjoon
[SPARK-35960][BUILD][TEST] Bump the scalatest version to 3.2.9

### What changes were proposed in this pull request?

Bump the scalatest version to 3.2.9

### Why are the changes needed?

With the scalatestplus change to 3.2.9.0, recent sbt fails to handle the mismatch between scalatest and scalatestplus and resolve resulting in test:compile errors of not being able to find the org.scalatest package.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

sbt tags/test:compile failed before and passes with this change.

Closes #33163 from holdenk/SPARK-35960-test-compile-sbt-issue.

Authored-by: Holden Karau <hkarau@netflix.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 34286ae)
The file was modifiedpom.xml (diff)
Commit ba0a479bdafc2b01a8ed966739b0307dee4871e5 by wenchen
[SPARK-35961][SQL] Only use local shuffle reader when REBALANCE_PARTITIONS_BY_NONE without CustomShuffleReaderExec

### What changes were proposed in this pull request?

Remove dead code in `OptimizeLocalShuffleReader`.

### Why are the changes needed?

After [SPARK-35725](https://issues.apache.org/jira/browse/SPARK-35725), we might expand partition if that partition is skewed. So the partition number check `bytesByPartitionId.length == partitionSpecs.size` would be wrong if some partitions are coalesced and some partitions are splitted into smaller.
Note that, it's unlikely happened in real world since it used RoundRobin.

Otherhand, after [SPARK-34899](https://issues.apache.org/jira/browse/SPARK-34899), we use origin plan if can not coalesce partitions. So the assuming of that shuffle stage has `CustomShuffleReaderExec` with no effect is always false in `REBALANCE_PARTITIONS_BY_NONE` shuffle origin. That said, if no rule was efficient, there would be no `CustomShuffleReaderExec`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CI

Closes #33165 from ulysses-you/SPARK-35961.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: ba0a479)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala (diff)
The file was modifieddocs/index.md (diff)
Commit 74c4641e78c6ae2df9b347baf444dffac68e4d92 by dongjoon
Revert "fix Spark version"

This reverts commit 6a2f4348ae2b0b1f78924b1af2354ca89c6c974d.
(commit: 74c4641)
The file was modifieddocs/index.md (diff)
Commit 912d2b983402e9217c081a9bafc08c5720c3ad12 by gurwls223
[SPARK-35962][DOCS] Deprecate old Java 8 versions prior to 8u201

### What changes were proposed in this pull request?

This PR aims to deprecate old Java 8 versions prior to 8u201.

### Why are the changes needed?

This is a preparation of using G1GC during the migration among Java LTS versions (8/11/17).

8u162 has the following fix.
- JDK-8205376: JVM Crash during G1 GC

8u201 has the following fix.
- JDK-8208873: C1: G1 barriers don't preserve FP registers

### Does this PR introduce _any_ user-facing change?

No, Today's Java8 is usually 1.8.0_292 and this is just a deprecation in documentation.

### How was this patch tested?

N/A

Closes #33166 from dongjoon-hyun/SPARK-35962.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 912d2b9)
The file was modifieddocs/index.md (diff)