Changes

Summary

  1. [SPARK-37026][ML][BUILD] Ensure the element type of (commit: 0782024) (details)
  2. [SPARK-36151][INFRA] MiMa updates after 3.2.0 release: bump (commit: 47111af) (details)
  3. [SPARK-37032][SQL] Fix broken SQL syntax link in SQL Reference page (commit: e7815b1) (details)
  4. [SPARK-36978][SQL] InferConstraints rule should create IsNotNull (commit: 0bba90b) (details)
  5. [SPARK-36965][PYTHON] Extend python test runner by logging out the temp (commit: c29bb02) (details)
  6. [SPARK-35925][SQL] Support DayTimeIntervalType in width-bucket function (commit: 21fa3ce) (details)
  7. [SPARK-36886][PYTHON] Inline type hints for (commit: 25fc495) (details)
  8. [SPARK-36933][CORE] Clean up TaskMemoryManager.acquireExecutionMemory() (commit: 1ef6c13) (details)
  9. [SPARK-36945][PYTHON] Inline type hints for python/pyspark/sql/udf.py (commit: c2ba498) (details)
  10. [SPARK-36834][SHUFFLE] Add support for namespacing log lines emitted by (commit: 4072a22) (details)
  11. [SPARK-36871][SQL][FOLLOWUP] Move error checking from create cmd to (commit: ebca523) (details)
  12. [SPARK-37052][CORE] Spark should only pass --verbose argument to main (commit: a6d3a2c) (details)
  13. [SPARK-37017][SQL] Reduce the scope of synchronized to prevent potential (commit: 875963a) (details)
  14. [SPARK-37057][INFRA] Fix wrong DocSearch facet filter in release-tag.sh (commit: db89320) (details)
  15. [SPARK-36796][BUILD][CORE][SQL] Pass all `sql/core` and dependent (commit: 3849340) (details)
Commit 0782024045a3f024168686fff2fa8d04a399de6d by gurwls223
[SPARK-37026][ML][BUILD] Ensure the element type of `ResolvedRFormula.terms` is scala.Seq for Scala 2.13

### What changes were proposed in this pull request?

This PR fixes the issue that `scala.Seq[scala.collection.mutable.ArraySeq$ofRef]` will be passed to `ResolvedRFormula.terms` though it expects `scala.Seq[scala.Seq[String]]` with Scala 2.13.
As of Scala 2.13, `scala.Seq` is `scala.collection.immutable.Seq`, so this issue happens.

### Why are the changes needed?

Bug fix.
Due to this issue, `ResolvedRFormula.toString` throws `ClassCastException`.
```
java.lang.ClassCastException: scala.collection.mutable.ArraySeq$ofRef cannot be cast to scala.collection.immutable.Seq
        at scala.collection.immutable.List.map(List.scala:246)
        at scala.collection.immutable.List.map(List.scala:79)
        at org.apache.spark.ml.feature.ResolvedRFormula.toString(RFormulaParser.scala:143)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:748)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test is added and `build/sbt -Pscala-2.13 "testOnly org.apache.spark.ml.feature.RFormulaSuite"` passes.
CIs should ensure that this change works with Scala 2.12 too.

Closes #34301 from sarutak/fix-rformula-scala-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 0782024)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala (diff)
Commit 47111af57908a50919f644c8827310d5c086763e by gurwls223
[SPARK-36151][INFRA] MiMa updates after 3.2.0 release: bump previousSparkVersion and re-enable tests for Scala 2.13 artifacts

### What changes were proposed in this pull request?

This PR updates MiMa checks following Spark 3.2.0's release:

- Bump `previousSparkVersion` to `3.2.0`
- Re-enable MiMa checks for Scala 2.13 artifacts (see #33355)

### Why are the changes needed?

To ensure that MiMa checks cover new APIs added in Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ dev/mima -Pscala-2.12
$ dev/mima -Pscala-2.13
```

Closes #34306 from JoshRosen/update-mima-after-3.2.0-release.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 47111af)
The file was modifiedproject/MimaBuild.scala (diff)
The file was modifieddev/mima (diff)
The file was modifiedproject/MimaExcludes.scala (diff)
Commit e7815b1b34bbf47656423d2dc1af82f7c5c5ffcb by wenchen
[SPARK-37032][SQL] Fix broken SQL syntax link in SQL Reference page

### What changes were proposed in this pull request?
Current 4 link about SQL syntax under SQL Reference page is broken, this pr will fix the broken link. And re-link to the subsection under SQL Syntax doc

![image](https://user-images.githubusercontent.com/46485123/137661935-d303ef9f-3596-4cac-896d-a53d19c1ca97.png)

### Why are the changes needed?
Fix SQL Reference doc's broken link

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #34307 from AngersZhuuuu/SPARK-37032.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e7815b1)
The file was modifieddocs/sql-ref.md (diff)
Commit 0bba90b8f3276c8a8fedc7ef5d523eb1ce2246a7 by wenchen
[SPARK-36978][SQL] InferConstraints rule should create IsNotNull constraints on the accessed nested field instead of the root nested type

### What changes were proposed in this pull request?
The PR modifies `IsNotNull` constraint generation to generate constraints on the referenced nested field instead of generating a constraint on the top level nested type. See the following section for an example.

### Why are the changes needed?
[InferFiltersFromConstraints](https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206) optimization rule generates `IsNotNull` constraints corresponding to null intolerant predicates. The `IsNotNull` constraints are generated on the attribute inside the corresponding predicate.
e.g. A predicate `a > 0` on an integer column a will result in a constraint `IsNotNull(a)`. On the other hand a predicate on a nested int column `structCol.b` where `structCol` is a struct column results in a constraint `IsNotNull(structCol)`.

This generation of constraints on the root level nested type is extremely conservative as it could lead to materialization of the the entire struct. The constraint should instead be generated on the nested field being referenced by the predicate. In the above example, the constraint should be `IsNotNull(structCol.b)` instead of `IsNotNull(structCol)`.

The new constraints also create opportunities for nested pruning. Currently `IsNotNull(structCol)` constraint would preclude pruning of `structCol`. However the constraint `IsNotNull(structCol.b)` could create opportunities to prune `structCol`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added test to `InferFiltersFromConstraintsSuite`.

Closes #34263 from utkarsh39/infer-nested-constraints.

Authored-by: Utkarsh <utkarsh.agarwal@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0bba90b)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromConstraintsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (diff)
Commit c29bb0207754c2018856bda842c6bf7a34d4e93b by piros.attila.zsolt
[SPARK-36965][PYTHON] Extend python test runner by logging out the temp output files

### What changes were proposed in this pull request?

Extending the python test runner by logging out the temp output files.

### Why are the changes needed?

I was running a python test which was extremely slow and I was surprised the unit-tests.log has not been even created. Looked into the code and as I got the tests can be executed in parallel and each one has its own temporary output file which is only added to the unit-tests.log when a test is finished with a failure (after acquiring a lock to avoid parallel write on unit-tests.log).

To avoid such a confusion it would make sense to log out the path of those temporary output files this way when a test got stuck we can peek into its log file.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I was running the python tests:
```
./python/run-tests
Running PySpark tests. Output is in /Users/attilazsoltpiros/git/attilapiros/spark/python/unit-tests.log
Will test against the following Python executables: ['/usr/local/bin/python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming']
/usr/local/bin/python3 python_implementation is CPython
/usr/local/bin/python3 version is: Python 3.9.7
Starting test(/usr/local/bin/python3): pyspark.ml.tests.test_feature (temp output: /tmp/usr_local_bin_python3__pyspark.ml.tests.test_feature__yc5_5mjk.log)
Starting test(/usr/local/bin/python3): pyspark.ml.tests.test_algorithms (temp output: /tmp/usr_local_bin_python3__pyspark.ml.tests.test_algorithms__icc6xxai.log)
Starting test(/usr/local/bin/python3): pyspark.ml.tests.test_base (temp output: /tmp/usr_local_bin_python3__pyspark.ml.tests.test_base__4m6xyiv5.log)
Starting test(/usr/local/bin/python3): pyspark.ml.tests.test_evaluation (temp output: /tmp/usr_local_bin_python3__pyspark.ml.tests.test_evaluation__fkzjlfmm.log)
Finished test(/usr/local/bin/python3): pyspark.ml.tests.test_base (16s)
Starting test(/usr/local/bin/python3): pyspark.ml.tests.test_image (temp output: /tmp/usr_local_bin_python3__pyspark.ml.tests.test_image__iuckk_c0.log)
Finished test(/usr/local/bin/python3): pyspark.ml.tests.test_evaluation (20s)
Starting test(/usr/local/bin/python3): pyspark.ml.tests.test_linalg (temp output: /tmp/usr_local_bin_python3__pyspark.ml.tests.test_linalg__3tncana4.log)
...
```

Closes #34233 from attilapiros/temp_output_at_py_tests.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
(commit: c29bb02)
The file was modifiedpython/run-tests.py (diff)
Commit 21fa3ce1650543d5a087266be1925eb495bf2ad7 by max.gekk
[SPARK-35925][SQL] Support DayTimeIntervalType in width-bucket function

### What changes were proposed in this pull request?
Add support DayTimeIntervalType for width_bucket function.

### Why are the changes needed?
[SPARK-35925](https://issues.apache.org/jira/browse/SPARK-35925)
1. The `WIDTH_BUCKET` function assigns values to buckets (individual segments) in an equiwidth histogram.
2. DayTimeIntervalType is necessary as an input data type for `WIDTH_BUCKET`

### Does this PR introduce _any_ user-facing change?
Yes. The user can use `width_bucket` with DayTimeIntervalType.

### How was this patch tested?
Add ut test

Closes #34309 from Peng-Lei/SPARK-35925.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 21fa3ce)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/interval.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/interval.sql (diff)
Commit 25fc49571c824218017e2e8eff3e010e1e7c4451 by ueshin
[SPARK-36886][PYTHON] Inline type hints for python/pyspark/sql/context.py

### What changes were proposed in this pull request?
Inline type hints for python/pyspark/sql/context.py from Inline type hints for python/pyspark/sql/context.pyi.

### Why are the changes needed?
Currently, there is type hint stub files python/pyspark/sql/context.pyi to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test.

Closes #34185 from dchvn/SPARK-36886.

Authored-by: dch nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 25fc495)
The file was modifiedpython/pyspark/sql/context.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was removedpython/pyspark/sql/context.pyi
The file was modifiedpython/pyspark/sql/observation.py (diff)
Commit 1ef6c13e37bfb64b0f9dd9b624b436064ea86593 by joshrosen
[SPARK-36933][CORE] Clean up TaskMemoryManager.acquireExecutionMemory()

### What changes were proposed in this pull request?
* Factor out a method `trySpillAndAcquire()` from `acquireExecutionMemory()` that handles the details of how to spill a `MemoryConsumer` and acquire the spilled memory. This logic was duplicated twice.
* Combine the two loops (spill other consumers and self-spill) into a single loop that implements equivalent logic. I made self-spill the lowest priority consumer and this is exactly equivalent.
* Consolidate comments a little to explain what the policy is trying to achieve and how at a high level
* Add a couple more debug log messages to make it easier to follow

### Why are the changes needed?
Reduce code duplication and better separate the policy decision of which MemoryConsumer to spill from the mechanism of requesting it to spill.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added some unit tests to verify the details of the spilling decisions in some scenarios that are not covered by current unit tests. Ran these on Spark master without the TaskMemoryManager changes to confirm that the behaviour is the same before and after my refactoring.

The SPARK-35486 test also provides some coverage for the retry loop.

Closes #34186 from timarmstrong/cleanup-task-memory-manager.

Authored-by: Tim Armstrong <tim.armstrong@databricks.com>
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
(commit: 1ef6c13)
The file was modifiedcore/src/main/java/org/apache/spark/memory/TaskMemoryManager.java (diff)
The file was modifiedcore/src/test/java/org/apache/spark/memory/TaskMemoryManagerSuite.java (diff)
Commit c2ba498ff678ddda034cedf45cc17fbeefe922fd by ueshin
[SPARK-36945][PYTHON] Inline type hints for python/pyspark/sql/udf.py

### What changes were proposed in this pull request?
Inline type hints for python/pyspark/sql/udf.py from Inline type hints for python/pyspark/sql/udf.pyi.

### Why are the changes needed?
Currently, there is type hint stub files python/pyspark/sql/udf.pyi to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test.

Closes #34289 from dchvn/SPARK-36945.

Authored-by: dch nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: c2ba498)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was removedpython/pyspark/sql/udf.pyi
Commit 4072a22aa2bf15e95d3043f937a3468057f4fd36 by mridulatgmail.com
[SPARK-36834][SHUFFLE] Add support for namespacing log lines emitted by external shuffle service

### What changes were proposed in this pull request?
Added a config `spark.yarn.shuffle.service.logs.namespace` which can be used to add a namespace suffix to log lines emitted by the External Shuffle Service.

### Why are the changes needed?
Since many instances of ESS can be running on the same NM, it would be easier to distinguish between them.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

Closes #34079 from thejdeep/SPARK-36834.

Authored-by: Thejdeep Gudivada <tgudivada@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 4072a22)
The file was modifieddocs/running-on-yarn.md (diff)
The file was modifiedcommon/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java (diff)
Commit ebca5232811cb0701d4062ac7ddc21fccc936490 by wenchen
[SPARK-36871][SQL][FOLLOWUP] Move error checking from create cmd to parser

### What changes were proposed in this pull request?
Move error checking from create cmd to parser

### Why are the changes needed?
catch error earlier and also make code consistent between parsing CreateFunction and CreateView

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #34283 from huaxingao/create_view_followup.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: ebca523)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
Commit a6d3a2c84e5cdc642ed57602612f0303585c4b6e by wenchen
[SPARK-37052][CORE] Spark should only pass --verbose argument to main class when is sql shell

### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/32163 spark pass `--verbose` to main class o support spark-sql shell can use verbose argument too.
But for other shell main class such as saprk-shell, it's intercepter don't support `--verbose`, so we should only pass `--verbose` for sql shell

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #34322 from AngersZhuuuu/SPARK-37052.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: a6d3a2c)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
Commit 875963a28a75532871010fcdcdb916bf093dab34 by wenchen
[SPARK-37017][SQL] Reduce the scope of synchronized to prevent potential deadlock

### What changes were proposed in this pull request?

There is a `synchronize` block in `CatalogManager.currentNamespace` function.
This PR pulls `SessionCatalog.getCurrentDatabase` out from this `synchronize` block to prevent potential deadlock.

### Why are the changes needed?

In our case, we have implemented an external catalog, and there is a thread that directly calls SessionCatalog.getTempViewOrPermanentTableMetadata and holds the lock of SessionCatalog. It eventually goes into our external catalog, unfortunately, we then call some functions of SparkSession, e.g. sql. When it calls CatalogManager.currentNamespace, it tries to hold the lock of CatalogManager.

In the meantime, there are some query threads that execute sqls via DataFrame interface.

This is how deadlock occurs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

use existed test.

Closes #34292 from chenzhx/bug-fix.

Authored-by: chenzhx <chen@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 875963a)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala (diff)
Commit db893207ba444a303b1915afeb90b82ef3808cf8 by gengliang
[SPARK-37057][INFRA] Fix wrong DocSearch facet filter in release-tag.sh

### What changes were proposed in this pull request?

In release-tag.sh, the DocSearch facet filter should be updated as the release version before creating git tag.
Otherwise, the facet filter would be wrong in the new release doc: https://github.com/apache/spark/blame/v3.2.0/docs/_config.yml#L42

### Why are the changes needed?

Fix a bug in release-tag.sh

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #34328 from gengliangwang/fixFacetFilters.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: db89320)
The file was modifieddev/create-release/release-tag.sh (diff)
Commit 38493401d18d42a6cb176bf515536af97ba1338b by srowen
[SPARK-36796][BUILD][CORE][SQL] Pass all `sql/core` and dependent modules UTs with JDK 17 except one case in `postgreSQL/text.sql`

### What changes were proposed in this pull request?
In order to pass the UTs related to sql/core and dependent modules, this pr mainly does the following change:

- Add a new property named `extraJavaTestArgs` to `pom.xml`, It includes all `--add-opens` configurations required for UTs with Java 17 and It also include `-XX:+IgnoreUnrecognizedVMOptions` to compatible with Java 8.
- Add a new helper class named `JavaModuleOptions` to place constant and methods related to `--add-opens`
- `--add-opens` options and `-XX:+IgnoreUnrecognizedVMOptions` is added by default when `SparkSubmitCommandBuilder` builds the submit command, this can help `local-cluster` related UTs pass and help users reduce the cost of using java 17
- `--add-opens` options and `-XX:+IgnoreUnrecognizedVMOptions` is supplemented to `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when init SparkContext, this also can help UTs pass and help users reduce the cost of using java 17

### Why are the changes needed?
Pass Spark UTs with JDK 17

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test `mvn clean install -pl sql/core -am` with Java 17, only `select format_string('%0$s', 'Hello')` in `postgreSQL/text.sql` failed due to the different behavior of Java 8 and Java 17

Closes #34153 from LuciferYang/SPARK-36796.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 3849340)
The file was addedlauncher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java
The file was modifiedsql/core/pom.xml (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (diff)
The file was modifiedproject/SparkBuild.scala (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedpom.xml (diff)