Changes

Summary

  1. [SPARK-35608][SQL] Support AQE optimizer side transformUpWithPruning (commit: 2c4598d) (details)
  2. [SPARK-35185][SQL] Improve Distinct statistics estimation (commit: 7be8d8a) (details)
  3. [SPARK-35342][PYTHON] Introduce DecimalOps and make `isnull` method (commit: f84a720) (details)
  4. [SPARK-35478][PYTHON] Enable disallow_untyped_defs mypy check for (commit: 3fb044e) (details)
  5. [SPARK-35478][PYTHON][FOLLOWUP] Fix Jenkins' linter (commit: c879510) (details)
  6. [SPARK-35565][SS] Add config for ignoring metadata directory of (commit: 882122d) (details)
  7. [SPARK-35796][TESTS] Fix SparkSubmitSuite failure on MacOS 10.15+ (commit: d015eff) (details)
  8. [SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps (commit: b7df75a) (details)
  9. [SPARK-35593][K8S][TESTS][FOLLOWUP] Increase timeout in (commit: b9d6473) (details)
  10. [SPARK-35818][BUILD] Upgrade SBT to 1.5.4 (commit: 94f7015) (details)
  11. [SPARK-35726][SQL] Truncate java.time.Duration by fields of day-time (commit: 2ebad72) (details)
  12. [SPARK-35825][INFRA] Increase the heap and stack size for Maven build (commit: 74d647d) (details)
  13. [SPARK-35593][K8S][TESTS][FOLLOWUP] Run (commit: aab37ed) (details)
  14. [SPARK-35824][CORE][TESTS] Convert LevelDBSuite.IntKeyType from a nested (commit: a39f1ea) (details)
  15. [SPARK-35819][SQL] Support Cast between different field (commit: 86bcd1f) (details)
  16. [SPARK-35830][TESTS] Upgrade sbt-mima-plugin to 0.9.2 (commit: 9eaf678) (details)
  17. [SPARK-35472][PYTHON] Fix disallow_untyped_defs mypy checks for (commit: 1589d32) (details)
  18. [SPARK-35303][SPARK-35498][PYTHON][FOLLOW-UP] Copy local properties when (commit: 6d30991) (details)
  19. [SPARK-35771][SQL][FOLLOWUP] IntervalUtils.toYearMonthIntervalString (commit: 4758dc7) (details)
  20. [SPARK-35832][CORE][ML][K8S][TESTS] Add LocalRootDirsTest trait (commit: 4f51e00) (details)
  21. [SPARK-35827][SQL] Show proper error message when update column types to (commit: af20474) (details)
  22. [SPARK-35671][SHUFFLE][CORE] Add support in the ESS to serve merged (commit: 8ce1e34) (details)
Commit 2c4598d02e31f470496ef4c2ed03e500f954dc1a by gengliang
[SPARK-35608][SQL] Support AQE optimizer side transformUpWithPruning

### What changes were proposed in this pull request?

Change `AQEPropagateEmptyRelation` from `transformUp` to `transformUpWithPruning

### Why are the changes needed?

To avoid unnecessary iteration during AQE optimizer.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass CI.

Closes #32742 from ulysses-you/aqe-transformUpWithPruning.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 2c4598d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStage.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala (diff)
Commit 7be8d8a164a2bc12887c83361af401d233de3397 by yumwang
[SPARK-35185][SQL] Improve Distinct statistics estimation

### What changes were proposed in this pull request?

This PR improves `Distinct` statistics estimation by rewrite it to `Aggregate`.

### Why are the changes needed?

1. The current implementation will lack column statistics.
2. Some rules before the `ReplaceDistinctWithAggregate` may use it. For example: https://github.com/apache/spark/pull/31113/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR1808

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #32291 from wangyum/SPARK-35185.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(commit: 7be8d8a)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala (diff)
Commit f84a720fe3bc0b1781d08a448ac2a9c69b6b761b by ueshin
[SPARK-35342][PYTHON] Introduce DecimalOps and make `isnull` method data-type-based

### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based

### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.

https://issues.apache.org/jira/browse/SPARK-35342

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed

Closes #32821 from Yikun/SPARK-35342.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: f84a720)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_udt_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_complex_ops.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_null_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
Commit 3fb044e043a2feab01d79b30c25b93d4fd166b12 by ueshin
[SPARK-35478][PYTHON] Enable disallow_untyped_defs mypy check for pyspark.pandas.window

### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32886 from pingsutw/SPARK-35478.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 3fb044e)
The file was modifiedpython/mypy.ini (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
Commit c879510d2f66697d2e8b28554e4a6dc0782952b7 by dongjoon
[SPARK-35478][PYTHON][FOLLOWUP] Fix Jenkins' linter

### What changes were proposed in this pull request?

This is a follow-up of #32886 to fix the Jenkins' linter.

### Why are the changes needed?

The PR #32886 was mistakenly merged before Jenkins' linter passes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #32965 from ueshin/issues/SPARK-35478/fup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: c879510)
The file was modifiedpython/pyspark/pandas/window.py (diff)
Commit 882122d6b727463a5d971080fe68831b9f2ecd64 by kabhwan.opensource
[SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

### What changes were proposed in this pull request?

This patch proposes to add an internal config for ignoring metadata of `FileStreamSink` when reading the output path.

### Why are the changes needed?

`FileStreamSink` produces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log.

Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark.

Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily.

We need a config for ignoring the metadata of `FileStreamSink` when reading the output path.

### Does this PR introduce _any_ user-facing change?

Added a config for ignoring metadata of FileStreamSink when reading the output.

### How was this patch tested?

Unit tests.

Closes #32702 from viirya/ignore-metadata.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 882122d)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit d015eff16dbd56a7a9104e2d66885e2bb21eb111 by dongjoon
[SPARK-35796][TESTS] Fix SparkSubmitSuite failure on MacOS 10.15+

### What changes were proposed in this pull request?
Change primaryResource assertion from exact match to suffix match in case SparkSubmitSuite.`handles k8s cluster mode`

### Why are the changes needed?
When I run SparkSubmitSuite on MacOs 10.15.7, I got AssertionError for `handles k8s cluster mode` test after pr [SPARK-35691](https://issues.apache.org/jira/browse/SPARK-35691), due to `File(path).getCanonicalFile().toURI()` function  with absolute path as parameter will return path begin with `/System/Volumes/Data` on MacOs higher tha 10.15.
eg.  `/home/testjars.jar` will get `file:/System/Volumes/Data/home/testjars.jar`

In order to pass UT on MacOs higher than 10.15, we change the assertion into suffix match

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. Pass the GitHub Action
2. Manually test
    - environment: MacOs > 10.15
    - commad: `build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -pl core -am -DwildcardSuites=org.apache.spark.deploy.SparkSubmitSuite -Dtest=none`
    - Test result:
        - before this pr, case failed with following exception:
        `- handles k8s cluster mode *** FAILED ***
  Some("file:/System/Volumes/Data/home/thejar.jar") was not equal to Some("file:/home/thejar.jar") (SparkSubmitSuite.scala:485)
  Analysis:
  Some(value: "file:/[System/Volumes/Data/]home/thejar.jar" -> "file:/[]home/thejar.jar")`
        - after this pr, run all test successfully

Closes #32948 from toujours33/SPARK-35796.

Authored-by: toujours33 <wangyazhi@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: d015eff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala (diff)
Commit b7df75a777e5567ce3537dafadcebf591debbaa3 by ueshin
[SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps

### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.

### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
test passed.

Closes #32859 from Yikun/SPARK-XXXXX1.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: b7df75a)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_base.py
The file was modifieddev/sparktestsupport/modules.py (diff)
Commit b9d6473e898cea255bbbc27f657e2958fd4c011b by sarutak
[SPARK-35593][K8S][TESTS][FOLLOWUP] Increase timeout in KubernetesLocalDiskShuffleDataIOSuite

### What changes were proposed in this pull request?

This increases the timeout from 10 seconds to 60 seconds in KubernetesLocalDiskShuffleDataIOSuite to reduce the flakiness.

### Why are the changes needed?

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140003/testReport/

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs

Closes #32967 from dongjoon-hyun/SPARK-35593-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: b9d6473)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleDataIOSuite.scala (diff)
Commit 94f701587de7cd2ebc8c9b6de56e669345538112 by dongjoon
[SPARK-35818][BUILD] Upgrade SBT to 1.5.4

### What changes were proposed in this pull request?

This PR aims to upgrade SBT to 1.5.4.

### Why are the changes needed?

SBT 1.5.4 is released 5 days ago.
- https://github.com/sbt/sbt/releases/tag/v1.5.4

This will bring the latest bug fixes like the following.

- Fixes BSP on ARM Macs by keeping JNI server socket to keep using JNI
- Fixes compiler ClassLoader list to use compilerJars.toList (For Scala 3, this drops support for 3.0.0-M2)
- Fixes undercompilation of package object causing "Symbol 'type X' is missing from the classpath"
- Fixes overcompilation with scalac -release flag
- Fixes build/exit notification not closing BSP channel
- Fixes POM file's Maven repository ID character restriction to match that of Maven

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #32966 from dongjoon-hyun/SPARK-35818.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 94f7015)
The file was modifiedproject/build.properties (diff)
Commit 2ebad727587e25b8bf4a8439593e7402ea4e2827 by max.gekk
[SPARK-35726][SQL] Truncate java.time.Duration by fields of day-time interval type

### What changes were proposed in this pull request?
Support truncate java.time.Duration by fields of day-time interval type.

### Why are the changes needed?
To respect fields of the target day-time interval types.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32950 from AngersZhuuuu/SPARK-35726.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 2ebad72)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/CatalystTypeConvertersSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala (diff)
Commit 74d647d2ca6b0471f0eb90a59bccb1ecc0a9cc8f by dongjoon
[SPARK-35825][INFRA] Increase the heap and stack size for Maven build

### What changes were proposed in this pull request?

Increase memory configuration for Maven build.
Stack size: 64MB => 128MB
Initial heap size: 1024MB => 2048MB
Maximum heap size: 1024MB => 2048MB

The SBT builds are ok so let's keep the current configuration.

### Why are the changes needed?

The jenkins jobs are unstable due to the stackoverflow errors:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2274/

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Jenkins test

Closes #32961 from gengliangwang/increaseXss.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 74d647d)
The file was modifiedpom.xml (diff)
Commit aab37edefca36ccb86b5bac92ab1b47f427405cf by dongjoon
[SPARK-35593][K8S][TESTS][FOLLOWUP] Run KubernetesLocalDiskShuffleDataIOSuite on a dedicated JVM

### What changes were proposed in this pull request?

This PR aims to run `KubernetesLocalDiskShuffleDataIOSuite` on a dedicated JVM.

### Why are the changes needed?

In Jenkins environment, `KubernetesLocalDiskShuffleDataIOSuite` and `ExternalShuffleServiceSuite` currently hit issues.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140019/
![Screen Shot 2021-06-19 at 10 33 20 AM](https://user-images.githubusercontent.com/9700541/122650832-d9810200-d0e9-11eb-9f2a-4fb44bb874f3.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #32976 from dongjoon-hyun/SPARK-35593-3.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: aab37ed)
The file was modifiedproject/SparkBuild.scala (diff)
Commit a39f1eadb7e568eb48a5c108680631b1df3fd1ec by dongjoon
[SPARK-35824][CORE][TESTS] Convert LevelDBSuite.IntKeyType from a nested class to a normal class

### What changes were proposed in this pull request?

This PR aims to promote `LevelDBSuite.IntKeyType` class to a normal class to isolate `InMemoryIteratorSuite` from `LevelDBSuite`.

### Why are the changes needed?

We have the following test suite hierarchy.
```
DBIteratorSuite
- InMemoryIteratorSuite
- LevelDBIteratorSuite
```

`DBIteratorSuite.testRefWithIntNaturalKey` depends on `LevelDBSuite` and `InMemoryIteratorSuite` derived it. `InMemoryIteratorSuite` should not depend not `LevelDB`-specific stuff. This PR will make it sure.
```
public void testRefWithIntNaturalKey() throws Exception {
  LevelDBSuite.IntKeyType i = new LevelDBSuite.IntKeyType();
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

```
$ build/sbt "kvstore/test"
```

Closes #32971 from dongjoon-hyun/SPARK-35824.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: a39f1ea)
The file was addedcommon/kvstore/src/test/java/org/apache/spark/util/kvstore/IntKeyType.java
The file was modifiedcommon/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java (diff)
The file was modifiedcommon/kvstore/src/test/java/org/apache/spark/util/kvstore/DBIteratorSuite.java (diff)
Commit 86bcd1fba09d9b5e4d36a48824354aaae769fa21 by max.gekk
[SPARK-35819][SQL] Support Cast between different field YearMonthIntervalType

### What changes were proposed in this pull request?
Support Cast between different field YearMonthIntervalType

### Why are the changes needed?
Make user convenient to get different field YearMonthIntervalType

### Does this PR introduce _any_ user-facing change?
User can call cast YearMonthIntervalType(YEAR, MONTH) to YearMonthIntervalType(YEAR, YEAR) etc

### How was this patch tested?
Added UT

Closes #32974 from AngersZhuuuu/SPARK-35819.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 86bcd1f)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
Commit 9eaf678099b3b52b707393a1d1cb8ac33385ad3c by gurwls223
[SPARK-35830][TESTS] Upgrade sbt-mima-plugin to 0.9.2

### What changes were proposed in this pull request?

This PR aims to upgrade `sbt-mima-plugin` to 0.9.2 for Apache Spark 3.2.0.

### Why are the changes needed?

`sbt-mima-plugin` 0.9.2 has the following updates including `Scala 3 initial support`.
- https://github.com/lightbend/mima/releases/tag/0.9.2
- https://github.com/lightbend/mima/releases/tag/0.9.1
- https://github.com/lightbend/mima/releases/tag/0.9.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs. Also, I manually deleted some lines from MiMiExclusion and verified that it's detected correctly.

Closes #32981 from dongjoon-hyun/SPARK-35830.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 9eaf678)
The file was modifiedproject/plugins.sbt (diff)
Commit 1589d3273236ef20fb8d00c0f805b0d4a0a8f85a by gurwls223
[SPARK-35472][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.generic

### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/generic.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32957 from ueshin/issues/SPARK-35472/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 1589d32)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/mypy.ini (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
Commit 6d309914df422d9f0c96edfd37924ecb8f29e3a9 by gurwls223
[SPARK-35303][SPARK-35498][PYTHON][FOLLOW-UP] Copy local properties when starting the thread, and use inheritable thread in the current codebase

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.

This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.

### Why are the changes needed?

To mimic the JVM behaviour about thread lifecycle

### Does this PR introduce _any_ user-facing change?

Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.

### How was this patch tested?

Existing tests should cover this.

Closes #32962 from HyukjinKwon/SPARK-35498-SPARK-35303.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 6d30991)
The file was modifiedpython/pyspark/util.py (diff)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/context.py (diff)
Commit 4758dc78a2de1afa31b108bd3ec5c9eec63a5b94 by max.gekk
[SPARK-35771][SQL][FOLLOWUP] IntervalUtils.toYearMonthIntervalString should consider the case year-month type is casted as month type

### What changes were proposed in this pull request?

This PR fixes an issue that `IntervalUtils.toYearMonthIntervalString` doesn't consider the case that year-month interval type is casted as month interval type.
If a year-month interval data is casted as month interval, the value of the year is multiplied by `12` and added to the value of month. For example, `INTERVAL '1-2' YEAR TO MONTH` will be `INTERVAL '14' MONTH` if  it's casted.
If this behavior is intended, it's stringified to be `'INTERVAL 14' MONTH` but currently, it will be `INTERVAL '2' MONTH`

### Why are the changes needed?

It's a bug if the behavior of cast is intended.

### Does this PR introduce _any_ user-facing change?

No, because this feature is not released yet.

### How was this patch tested?

Modified the tests added in SPARK-35771 (#32924).

Closes #32982 from sarutak/fix-toYearMonthIntervalString.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 4758dc7)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala (diff)
Commit 4f51e0045e2754035e304979182c0f583c65f174 by dongjoon
[SPARK-35832][CORE][ML][K8S][TESTS] Add LocalRootDirsTest trait

### What changes were proposed in this pull request?

To make the test suite more robust, this PR aims to add a new trait, `LocalRootDirsTest`, by refactoring `SortShuffleSuite`'s helper functions and applying it to the following:
- ShuffleNettySuite
- ShuffleOldFetchProtocolSuite
- ExternalShuffleServiceSuite
- KubernetesLocalDiskShuffleDataIOSuite
- LocalDirsSuite
- RDDCleanerSuite
- ALSCleanerSuite

In addition, this fixes a UT in `KubernetesLocalDiskShuffleDataIOSuite`.

### Why are the changes needed?

`ShuffleSuite` is extended by four classes but only `SortShuffleSuite` does the clean-up correctly.
```
ShuffleSuite
- SortShuffleSuite
- ShuffleNettySuite
- ShuffleOldFetchProtocolSuite
- ExternalShuffleServiceSuite
```

Since `KubernetesLocalDiskShuffleDataIOSuite` is looking for the other storage directory, the leftover of `ShuffleSuite` causes flakiness.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/2649/testReport/junit/org.apache.spark.shuffle/KubernetesLocalDiskShuffleDataIOSuite/recompute_is_not_blocked_by_the_recovery/
```
org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 1.0 (TID 3) had a not serializable result: org.apache.spark.ShuffleSuite$NonJavaSerializableClass
...
org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIOSuite.$anonfun$new$2(KubernetesLocalDiskShuffleDataIOSuite.scala:52)
```

For the other suites, the clean-up implementation is used but not complete. So, they are refactored to use new trait.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only change.

### How was this patch tested?

Pass the CIs.

Closes #32986 from dongjoon-hyun/SPARK-35832.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 4f51e00)
The file was modifiedcore/src/test/scala/org/apache/spark/rdd/RDDCleanerSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleDataIOSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ShuffleSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/SortShuffleSuite.scala (diff)
The file was addedcore/src/test/scala/org/apache/spark/LocalRootDirsTest.scala
The file was modifiedcore/src/test/scala/org/apache/spark/storage/LocalDirsSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala (diff)
Commit af20474c67a61190829fa100a17ded58cb9a2102 by max.gekk
[SPARK-35827][SQL] Show proper error message when update column types to year-month/day-time interval

### What changes were proposed in this pull request?

This PR fixes error message shown when changing a column type to year-month/day-time interval type is attempted.

### Why are the changes needed?

It's for consistent behavior.
Updating column types to interval types are prohibited for V2 source tables.
So, if we attempt to update the type of a column to the conventional interval type, an error message like `Error in query: Cannot update <table> field <column> to interval type;`.

But, for year-month/day-time interval types, another error message like `Error in query: Cannot update <table> field <column>:<type> cannot be cast to interval year;`.

You can reproduce with the following procedure.
```
$ bin/spark-sql
spark-sql> SET spark.sql.catalog.mycatalog=<a catalog implementation class>;
spark-sql> CREATE TABLE mycatalog.t1(c1 int) USING <V2 datasource implementation class>;
spark-sql> ALTER TABLE mycatalog.t1 ALTER COLUMN c1 TYPE interval year to month;
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified an existing test.

Closes #32978 from sarutak/err-msg-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: af20474)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala (diff)
Commit 8ce1e344e58dbfbddecd9e9fd9f0a5a6f15dbea9 by mridulatgmail.com
[SPARK-35671][SHUFFLE][CORE] Add support in the ESS to serve merged shuffle block meta and data to executors

### What changes were proposed in this pull request?
This adds support in the ESS to serve merged shuffle block meta and data requests to executors.
This change is needed for fetching remote merged shuffle data from the remote shuffle services. This is part of push-based shuffle SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

This change introduces new messages between clients and the external shuffle service:

1. `MergedBlockMetaRequest`: The client sends this to external shuffle to get the meta information for a merged block. The response to this is one of these :
  - `MergedBlockMetaSuccess` : contains request id, number of chunks, and a `ManagedBuffer` which is a `FileSegmentBuffer` backed by the merged block meta file.
  - `RpcFailure`: this is sent back to client in case of failure. This is an existing message.

2. `FetchShuffleBlockChunks`: This is similar to `FetchShuffleBlocks` message but it is to fetch merged shuffle chunks instead of blocks.

### Why are the changes needed?
These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Min Shen mshenlinkedin.com

Closes #32811 from otterc/SPARK-35671.

Lead-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 8ce1e34)
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalBlockHandlerSuite.java (diff)
The file was addedcommon/network-common/src/main/java/org/apache/spark/network/protocol/MergedBlockMetaRequest.java
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java (diff)
The file was addedcommon/network-common/src/main/java/org/apache/spark/network/client/MergedBlockMetaResponseCallback.java
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/protocol/MessageDecoder.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java (diff)
The file was addedcommon/network-common/src/main/java/org/apache/spark/network/client/BaseResponseCallback.java
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/AbstractFetchShuffleBlocks.java
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocks.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/ExternalShuffleServiceMetricsSuite.scala (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/protocol/Message.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/server/AbstractAuthRpcHandler.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java (diff)
The file was modifiedcommon/network-common/src/test/java/org/apache/spark/network/TransportRequestHandlerSuite.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java (diff)
The file was addedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlockChunksSuite.java
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/BlockTransferMessage.java (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/network/yarn/YarnShuffleServiceMetricsSuite.scala (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/MergedBlocksMetaListener.java
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlockChunks.java
The file was addedcommon/network-common/src/test/java/org/apache/spark/network/protocol/MergedBlockMetaSuccessSuite.java
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/OneForOneBlockFetcherSuite.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java (diff)
The file was addedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocksSuite.java
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/client/RpcResponseCallback.java (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/network/yarn/YarnShuffleServiceSuite.scala (diff)
The file was modifiedcommon/network-common/src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java (diff)
The file was addedcommon/network-common/src/main/java/org/apache/spark/network/protocol/MergedBlockMetaSuccess.java