FailedChanges

Summary

  1. [SPARK-34036][DOCS] Update ORC data source documentation (commit: 9b5df2a) (details)
  2. [SPARK-34021][R] Fix hyper links in SparkR documentation for CRAN (commit: 0ba3ab4) (details)
  3. [SPARK-34028][SQL] Cleanup "unreachable code" compilation warning (commit: 26b6039) (details)
  4. [SPARK-33861][SQL][FOLLOWUP] Simplify conditional in predicate should (commit: 3aa4e11) (details)
  5. [SPARK-34031][SQL] Union operator missing rowCount when CBO enabled (commit: aa509c1) (details)
  6. Revert "[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and (commit: 194edc8) (details)
  7. [SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid (commit: d36cdd5) (details)
  8. [SPARK-33100][SQL][FOLLOWUP] Find correct bound of bracketed comment in (commit: 7b06acc) (details)
  9. [SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark (commit: aa388cf) (details)
  10. [SPARK-34044][DOCS] Add spark.sql.hive.metastore.jars.path to (commit: 5b16d70) (details)
  11. [SPARK-34018][K8S] NPE in ExecutorPodsSnapshot (commit: 8e11ce5) (details)
  12. [SPARK-33818][SQL][DOC] Add descriptions about (commit: 9b54da4) (details)
  13. [SPARK-34039][SQL] ReplaceTable should invalidate cache (commit: 0de7f2f) (details)
  14. [SPARK-34005][CORE] Update peak memory metrics for each Executor on task (commit: cc20154) (details)
  15. [SPARK-34046][SQL][TESTS] Use join hint for constructing joins in (commit: b95a847) (details)
  16. [SPARK-34003][SQL] Fix Rule conflicts between (commit: 0f8e5dd) (details)
  17. [SPARK-34032][SS] Add truststore and keystore type config possibility (commit: 71d261a) (details)
  18. [SPARK-33591][SQL] Recognize `null` in partition spec values (commit: 157b72a) (details)
  19. [SPARK-33796][DOCS][FOLLOWUP] Tweak the width of left-menu of Spark SQL (commit: 023eba2) (details)
  20. [MINOR][SQL][TESTS] Fix the incorrect unicode escape test in (commit: 0781ed4) (details)
  21. [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle (commit: d00f069) (details)
  22. [SPARK-34049][SS] DataSource V2: Use Write abstraction in (commit: 6b34745) (details)
  23. [SPARK-34048][SQL][TESTS] Check the amount of calls to Hive external (commit: 0af3874) (details)
  24. [SPARK-34030][SQL] Fold RepartitionByExpression num partition should at (commit: 48cd11c) (details)
  25. Revert "[SPARK-33933][SQL] Materialize BroadcastQueryStage first to (commit: 105ba6e) (details)
  26. [SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` (commit: e0e06c1) (details)
  27. [SPARK-32668][SQL] HiveGenericUDTF initialize UDTF should use (commit: 48b9611) (details)
  28. [SPARK-34055][SQL][TESTS][FOLLOWUP] Increase the expected number of (commit: 9a8d275) (details)
  29. [SPARK-34059][SQL][CORE] Use for/foreach rather than map to make sure (commit: 8302492) (details)
  30. [SPARK-34053][INFRA] Cancel the previous build (commit: 3e5e086) (details)
Commit 9b5df2afaa5df85f149ccf73b7a6b78ab0f393bc by dhyun
[SPARK-34036][DOCS] Update ORC data source documentation

### What changes were proposed in this pull request?

This PR aims to update SQL documentation about ORC data sources.

New structure looks like the following.
- ORC Implementation
- Vectorized Reader
- Schema Merging
- Zstandard
- Bloom Filters
- Columnar Encryption
- Hive metastore ORC table conversion
- Configuration

### Why are the changes needed?

This document is not up-to-date. Apache Spark 3.2.0 can utilize new improvements from Apache ORC 1.6.6.

### Does this PR introduce _any_ user-facing change?

No, this is a documentation.

### How was this patch tested?

Manual.
```
SKIP_API=1 jekyll build
```

---
**BEFORE**
![Screen Shot 2021-01-06 at 5 08 19 PM](https://user-images.githubusercontent.com/9700541/103838399-d0bbd880-5041-11eb-8757-297728d2793f.png)

---
**AFTER**
![Screen Shot 2021-01-06 at 7 03 38 PM](https://user-images.githubusercontent.com/9700541/103845972-0963ae00-5052-11eb-905e-8e8b335c760a.png)
![Screen Shot 2021-01-06 at 7 03 49 PM](https://user-images.githubusercontent.com/9700541/103845971-08cb1780-5052-11eb-9b2a-d3acfa4b9278.png)
![Screen Shot 2021-01-06 at 7 03 59 PM](https://user-images.githubusercontent.com/9700541/103845970-08328100-5052-11eb-8982-7079fd7b0efc.png)
![Screen Shot 2021-01-06 at 7 04 10 PM](https://user-images.githubusercontent.com/9700541/103845968-08328100-5052-11eb-9ef5-db99c7cc64d3.png)
![Screen Shot 2021-01-06 at 7 04 16 PM](https://user-images.githubusercontent.com/9700541/103845963-07015400-5052-11eb-955f-8126d417e8aa.png)

Closes #31075 from dongjoon-hyun/SPARK-34036.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 9b5df2a)
The file was modifieddocs/sql-data-sources-orc.md (diff)
Commit 0ba3ab4c23ee1cd3785caa0fde76862dce478530 by gurwls223
[SPARK-34021][R] Fix hyper links in SparkR documentation for CRAN submission

### What changes were proposed in this pull request?

3.0.1 CRAN submission was failed as the reason below:

```
   Found the following (possibly) invalid URLs:
     URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
       From: man/read.json.Rd
             man/write.json.Rd
       Status: 200
       Message: OK
     URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
https://dl.acm.org/doi/10.1109/MC.2009.263)
       From: inst/doc/sparkr-vignettes.html
       Status: 200
       Message: OK
```

The links were being redirected now. This PR checked all hyperlinks in the docs such as `href{...}` and `url{...}`, and fixed all in SparkR:

- Fix two problems above.
- Fix http to https
- Fix `https://www.apache.org/ https://spark.apache.org/` -> `https://www.apache.org https://spark.apache.org`.

### Why are the changes needed?

For CRAN submission.

### Does this PR introduce _any_ user-facing change?

Virtually no because it's just cleanup that CRAN requires.

### How was this patch tested?

Manually tested by clicking the links

Closes #31058 from HyukjinKwon/SPARK-34021.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 0ba3ab4)
The file was modifiedR/pkg/R/mllib_clustering.R (diff)
The file was modifiedR/pkg/R/mllib_stat.R (diff)
The file was modifiedR/pkg/R/mllib_recommendation.R (diff)
The file was modifiedR/pkg/R/mllib_tree.R (diff)
The file was modifiedR/pkg/R/stats.R (diff)
The file was modifiedR/pkg/R/mllib_classification.R (diff)
The file was modifiedR/pkg/R/mllib_regression.R (diff)
The file was modifiedR/pkg/vignettes/sparkr-vignettes.Rmd (diff)
The file was modifiedR/pkg/DESCRIPTION (diff)
The file was modifiedR/pkg/R/SQLContext.R (diff)
The file was modifiedR/pkg/R/DataFrame.R (diff)
The file was modifiedR/pkg/R/install.R (diff)
Commit 26b603992c4b9b5a58e46e0566c1547b86249709 by gurwls223
[SPARK-34028][SQL] Cleanup "unreachable code" compilation warning

### What changes were proposed in this pull request?
There is one compilation warning as follow:

```
[WARNING] [Warn] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1555: [other-match-analysis  org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction.catalogFunction] unreachable code
```

This compilation warning is due to `NoSuchPermanentFunctionException` is sub-class of `AnalysisException` and if there is `NoSuchPermanentFunctionException` be thrown out,  it will be catch by `case _: AnalysisException => failFunctionLookup(name)`,  so `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` is `unreachable code`.

This pr remove `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` directly because both these 2 branches handle exceptions in the same way: `failFunctionLookup(name)`

### Why are the changes needed?
Cleanup "unreachable code" compilation warnings.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31064 from LuciferYang/SPARK-34028.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 26b6039)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
Commit 3aa4e113c5162f5de12c2aa43b6af65a7f2110af by gurwls223
[SPARK-33861][SQL][FOLLOWUP] Simplify conditional in predicate should consider deterministic

### What changes were proposed in this pull request?

This pr address https://github.com/apache/spark/pull/30865#pullrequestreview-562344089 to fix simplify conditional in predicate should consider deterministic.

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31067 from wangyum/SPARK-33861-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 3aa4e11)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalsInPredicate.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalsInPredicateSuite.scala (diff)
Commit aa509c1eeed688ddf21553aefe7b48cdf072fc5b by gurwls223
[SPARK-34031][SQL] Union operator missing rowCount when CBO enabled

### What changes were proposed in this pull request?

This pr add row count to `Union` operator when CBO enabled.
```scala
spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost")
```

Before this pr:
```
== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B)
:- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
```

After this pr:
```
== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20)
:- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
```

### Why are the changes needed?

Improve query performance,  [`JoinEstimation.estimateInnerOuterJoin`](https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala#L55-L156) need the row count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31068 from wangyum/SPARK-34031.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: aa509c1)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5.sf100/explain.txt (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q5a.sf100/simplified.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q2.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q54.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q2.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q54.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q5a.sf100/explain.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala (diff)
Commit 194edc86a2959f912b4e4d0bb4867b5cb2fd0813 by dhyun
Revert "[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider"

This reverts commit 8bb70bf0d646f6d54d17690d23ee935e452e747e.
(commit: 194edc8)
The file was removedsql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/orc/FakeKeyProvider.java
The file was modifiedproject/SparkBuild.scala (diff)
The file was removedsql/core/src/test/resources/META-INF/services/org.apache.hadoop.crypto.key.KeyProviderFactory
The file was removedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcEncryptionSuite.scala
Commit d36cdd55419c104134f88930206bedccdbe4f3c0 by wenchen
[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE

### What changes were proposed in this pull request?
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout.

### Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
1. Add UT
2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

Closes #30998 from zhongyu09/aqe-broadcast.

Authored-by: Yu Zhong <yzhong@freewheel.tv>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: d36cdd5)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
Commit 7b06acc28b5c37da6c48bc44c3d921309d4ad3a8 by yamamuro
[SPARK-33100][SQL][FOLLOWUP] Find correct bound of bracketed comment in spark-sql

### What changes were proposed in this pull request?

This PR help find correct bound of bracketed comment in spark-sql.

Here is the log for UT of SPARK-33100 in CliSuite before:
```
2021-01-05 13:22:34.768 - stdout> spark-sql> /* SELECT 'test';*/ SELECT 'test';
2021-01-05 13:22:41.523 - stderr> Time taken: 6.716 seconds, Fetched 1 row(s)
2021-01-05 13:22:41.599 - stdout> test
2021-01-05 13:22:41.6 - stdout> spark-sql> ;;/* SELECT 'test';*/ SELECT 'test';
2021-01-05 13:22:41.709 - stdout> test
2021-01-05 13:22:41.709 - stdout> spark-sql> /* SELECT 'test';*/;; SELECT 'test';
2021-01-05 13:22:41.902 - stdout> spark-sql> SELECT 'test'; -- SELECT 'test';
2021-01-05 13:22:41.902 - stderr> Time taken: 0.129 seconds, Fetched 1 row(s)
2021-01-05 13:22:41.902 - stderr> Error in query:
2021-01-05 13:22:41.902 - stderr> mismatched input '<EOF>' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 19)
2021-01-05 13:22:42.006 - stderr>
2021-01-05 13:22:42.006 - stderr> == SQL ==
2021-01-05 13:22:42.006 - stderr> /* SELECT 'test';*/
2021-01-05 13:22:42.006 - stderr> -------------------^^^
2021-01-05 13:22:42.006 - stderr>
2021-01-05 13:22:42.006 - stderr> Time taken: 0.226 seconds, Fetched 1 row(s)
2021-01-05 13:22:42.006 - stdout> test
```
The root cause is that the insideBracketedComment is not accurate.

For `/* comment */`, the last character `/` is not insideBracketedComment and it would be treat as beginning of statements.

In this PR, this issue is fixed.

### Why are the changes needed?
To fix the issue described above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UT

Closes #31054 from turboFei/SPARK-33100-followup.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 7b06acc)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala (diff)
Commit aa388cf3d0ff230eb0397876fe2db03bbe51658e by gurwls223
[SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation

### What changes were proposed in this pull request?

This PR proposes to:
- Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs
- `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page
- Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html).
- Mention other migration guides as well because PySpark can get affected by it.

### Why are the changes needed?

For better documentation.

### Does this PR introduce _any_ user-facing change?

It fixes user-facing docs. However, it's not released out yet.

### How was this patch tested?

Manually tested by running:

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

Closes #31082 from HyukjinKwon/SPARK-34041.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: aa388cf)
The file was modifieddocs/index.md (diff)
The file was modifiedpython/docs/source/getting_started/index.rst (diff)
The file was modifieddocs/_layouts/global.html (diff)
The file was modifiedpython/docs/source/user_guide/index.rst (diff)
The file was modifiedpython/docs/source/reference/pyspark.ml.rst (diff)
The file was modifiedpython/docs/source/reference/pyspark.mllib.rst (diff)
The file was modifiedpython/docs/source/migration_guide/index.rst (diff)
Commit 5b16d70d6a51720660e7607c859fae4f28691952 by gurwls223
[SPARK-34044][DOCS] Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

### What changes were proposed in this pull request?

This PR adds new configuration to `sql-data-sources-hive-tables`.

### Why are the changes needed?

SPARK-32852 added a new configuration, `spark.sql.hive.metastore.jars.path`.

### Does this PR introduce _any_ user-facing change?

Yes, but a document only.

### How was this patch tested?

**BEFORE**
![Screen Shot 2021-01-07 at 2 57 57 PM](https://user-images.githubusercontent.com/9700541/103954318-cc9ec200-50f8-11eb-86d3-cd89b07fcd21.png)

**AFTER**
![Screen Shot 2021-01-07 at 2 56 34 PM](https://user-images.githubusercontent.com/9700541/103954221-9d885080-50f8-11eb-8938-fb91394a33cb.png)

Closes #31085 from dongjoon-hyun/SPARK-34044.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 5b16d70)
The file was modifieddocs/sql-data-sources-hive-tables.md (diff)
Commit 8e11ce5378a2cf69ec87501e86f7ed5963649cbf by dhyun
[SPARK-34018][K8S] NPE in ExecutorPodsSnapshot

### What changes were proposed in this pull request?

Label both the statuses and ensure the ExecutorPodSnapshot starts with the default config to match.

### Why are the changes needed?

The current test depends on the order rather than testing the desired property.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Labeled the containers statuses, observed failures, added the default label as the initialization point, tests passed again.

Built Spark, ran on K8s cluster verified no NPE in driver log.

Closes #31071 from holdenk/SPARK-34018-finishedExecutorWithRunningSidecar-doesnt-correctly-constructt-the-test-case.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 8e11ce5)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorLifecycleTestUtils.scala (diff)
Commit 9b54da490d55d8c12e0a6b2b4b6e3a2d5b6bed86 by dhyun
[SPARK-33818][SQL][DOC] Add descriptions about `spark.sql.parser.quotedRegexColumnNames` in the SQL documents

### What changes were proposed in this pull request?
According to https://github.com/apache/spark/pull/30805#issuecomment-747179899,
doc `spark.sql.parser.quotedRegexColumnNames` since  we need user know about this in doc and it's useful.

![image](https://user-images.githubusercontent.com/46485123/103656543-afa4aa80-4fa3-11eb-8cd3-a9d1b87a3489.png)
![image](https://user-images.githubusercontent.com/46485123/103656551-b2070480-4fa3-11eb-9ce7-95cc424242a6.png)

### Why are the changes needed?
Complete doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #30816 from AngersZhuuuu/SPARK-33818.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 9b54da4)
The file was modifieddocs/sql-ref-syntax-qry-select.md (diff)
Commit 0de7f2ff1ebb9b3339ecf30074a0d7ffc1ff6325 by dhyun
[SPARK-34039][SQL] ReplaceTable should invalidate cache

### What changes were proposed in this pull request?

This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec.

### Why are the changes needed?

Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results.

### Does this PR introduce _any_ user-facing change?

Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced.

### How was this patch tested?

Added a new unit test.

Closes #31081 from sunchao/SPARK-34039.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 0de7f2f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ReplaceTableExec.scala (diff)
Commit cc201545626ffe556682f45edc370ac6fe29e9df by dhyun
[SPARK-34005][CORE] Update peak memory metrics for each Executor on task end

### What changes were proposed in this pull request?

This PR makes `AppStatusListener` update the peak memory metrics for each Executor on task end like other peak memory metrics (e.g, stage, executors in a stage).

### Why are the changes needed?

When `AppStatusListener#onExecutorMetricsUpdate` is called, peak memory metrics for Executors, stages and executors in a stage are updated but currently, the metrics only for Executors are not updated on task end.

### Does this PR introduce _any_ user-facing change?

Yes. Executor peak memory metrics is updated more accurately.

### How was this patch tested?

After I run a job with `local-cluster[1,1,1024]` and visited `/api/v1/<appid>/executors`, I confirmed `peakExecutorMemory` metrics is shown for an Executor even though the life time of each job is very short .
I also modify the json files for `HistoryServerSuite`.

Closes #31029 from sarutak/update-executor-metrics-on-taskend.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: cc20154)
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusListener.scala (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/executor_memory_usage_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/executor_list_json_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/executor_node_excludeOnFailure_unexcluding_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/executor_node_excludeOnFailure_expectation.json (diff)
Commit b95a847ce1686dd1e1c6555afe2436caec6130e6 by wenchen
[SPARK-34046][SQL][TESTS] Use join hint for constructing joins in JoinSuite and WholeStageCodegenSuite

### What changes were proposed in this pull request?

There are some existing test cases that constructing various joins by tuning the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc.

This can be tricky and not straight-forward. In the future development we might have to tweak the configurations again .
This PR is to construct specific joins by using join hint in test cases.
### Why are the changes needed?

Make test cases for join simpler and more robust.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #31087 from gengliangwang/joinhintInTest.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: b95a847)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala (diff)
Commit 0f8e5dd445b03161a27893ba714db57919d8bcab by wenchen
[SPARK-34003][SQL] Fix Rule conflicts between PaddingAndLengthCheckForCharVarchar and ResolveAggregateFunctions

### What changes were proposed in this pull request?

ResolveAggregateFunctions is a hacky rule and it calls `executeSameContext` to generate a `resolved agg` to determine which unresolved sort attribute should be pushed into the agg. However, after we add the PaddingAndLengthCheckForCharVarchar rule which will rewrite the query output, thus, the `resolved agg` cannot match original attributes anymore.

It causes some dissociative sort attribute to be pushed in and fails the query

``` logtalk
[info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'testcat.t1.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
[info]   Project [v#14, sum(i)#11L]
[info]   +- Sort [aggOrder#12 ASC NULLS FIRST], true
[info]      +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12]
[info]         +- SubqueryAlias testcat.t1
[info]            +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string),  exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3,  ) AS v#14, i#7]
[info]               +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1
[info]
[info]   Project [v#14, sum(i)#11L]
[info]   +- Sort [aggOrder#12 ASC NULLS FIRST], true
[info]      +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12]
[info]         +- SubqueryAlias testcat.t1
[info]            +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string),  exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3,  ) AS v#14, i#7]
[info]               +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1
```

### Why are the changes needed?

bugfix
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

new tests

Closes #31027 from yaooqinn/SPARK-34003.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0f8e5dd)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 71d261ab8fb3e7fb22d2687b8e038129ca766a65 by kabhwan.opensource
[SPARK-34032][SS] Add truststore and keystore type config possibility for Kafka delegation token

### What changes were proposed in this pull request?
Kafka delegation token is obtained with `AdminClient` where security settings can be set. Keystore and trustrore type however can't be set. In this PR I've added these new configurations. This can be useful when the type is different. A good example is to make Spark FIPS compliant where the default JKS is not accepted.

### Why are the changes needed?
Missing configurations.

### Does this PR introduce _any_ user-facing change?
Yes, adding 2 additional config parameters.

### How was this patch tested?
Existing + modified unit tests + simple Kafka to Kafka app on cluster.

Closes #31070 from gaborgsomogyi/SPARK-34032.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 71d261a)
The file was modifiedexternal/kafka-0-10-token-provider/src/main/scala/org/apache/spark/kafka010/KafkaTokenUtil.scala (diff)
The file was modifieddocs/structured-streaming-kafka-integration.md (diff)
The file was modifiedexternal/kafka-0-10-token-provider/src/main/scala/org/apache/spark/kafka010/KafkaTokenSparkConf.scala (diff)
The file was modifiedexternal/kafka-0-10-token-provider/src/test/scala/org/apache/spark/kafka010/KafkaTokenSparkConfSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-token-provider/src/test/scala/org/apache/spark/kafka010/KafkaTokenUtilSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-token-provider/src/test/scala/org/apache/spark/kafka010/KafkaDelegationTokenTest.scala (diff)
Commit 157b72ac9fa0057d5fd6d7ed52a6c4b22ebd1dfc by wenchen
[SPARK-33591][SQL] Recognize `null` in partition spec values

### What changes were proposed in this pull request?
1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values.
2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`.
3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec.

### Why are the changes needed?
Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example:
```sql
spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false
```
Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works as expected:
```sql
spark-sql> SELECT isnull(p1) FROM tbl5;
true
```

### How was this patch tested?
1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`.
2. Compiling by Scala 2.13:
```
$  ./dev/change-scala-version.sh 2.13
$ ./build/sbt -Pscala-2.13 compile
```

Closes #30538 from MaxGekk/partition-spec-value-null.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 157b72a)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableDropPartitionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableDropPartitionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala (diff)
Commit 023eba2ad72f5119350c6c797808dadcfd1eaa19 by srowen
[SPARK-33796][DOCS][FOLLOWUP] Tweak the width of left-menu of Spark SQL Guide

### What changes were proposed in this pull request?

This PR tweaks the width of left-menu of Spark SQL Guide.
When I view the Spark SQL Guide with browsers on macOS, the title `Spark SQL Guide` looks prettily.
But I often use Pop!_OS, an Ubuntu variant, and the title is overlapped with browsers on it.
![spark-sql-guide-layout-before](https://user-images.githubusercontent.com/4736016/104002743-d56cc200-51e4-11eb-9e3a-28abcd46e0bf.png)

After this change, the title is no longer overlapped.
![spark-sql-guide-layout-after](https://user-images.githubusercontent.com/4736016/104002847-f9c89e80-51e4-11eb-85c0-01d69cee46b7.png)

### Why are the changes needed?

For the pretty layout.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Built the document with `cd docs && SKIP_API=1 jekyll build` and confirmed the layout.

Closes #31091 from sarutak/modify-layout-sparksql-guide.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 023eba2)
The file was modifieddocs/css/main.css (diff)
Commit 0781ed4f5b7f692656651f9bb51f823c82e24e2d by srowen
[MINOR][SQL][TESTS] Fix the incorrect unicode escape test in ParserUtilsSuite

### What changes were proposed in this pull request?

This PR fixes an incorrect unicode literal test in `ParserUtilsSuite`.
In that suite, string literals in queries have unicode escape characters like `\u7328` but the backslash should be escaped because
the queriy strings are given as Java strings.

### Why are the changes needed?

Correct the test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Run `ParserUtilsSuite` and it passed.

Closes #31088 from sarutak/fix-incorrect-unicode-test.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 0781ed4)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala (diff)
Commit d00f0695b7513046e42e47f35b280d7aa494de5b by mridulatgmail.com
[SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion

### What changes were proposed in this pull request?
This is the shuffle writer side change where executors can push data to remote shuffle services. This is needed for push-based shuffle - SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
Summary of changes:
- This adds support for executors to push shuffle blocks after map tasks complete writing shuffle data.
- This also introduces a timeout specifically for creating connection to remote shuffle services.

### Why are the changes needed?
- These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
- The main reason to create a separate connection creation timeout is because the existing `connectionTimeoutMs` is overloaded and is used for connection creation timeouts as well as connection idle timeout. The connection creation timeout should be much lower than the idle timeouts. The default for `connectionTimeoutMs` is 120s. This is quite high for just establishing the connections.  If a shuffle server node is bad then the connection creation will fail within few seconds. However, an overloaded shuffle server may take much longer to respond to a request and the channel can stay idle for a much longer time which is expected.  Another reason is that with push-based shuffle, an executor may be fetching shuffle data and pushing shuffle data (next stage) simultaneously. Both these tasks will share the same connections with the shuffle service. If there is a bad shuffle server node and the connection creation timeout is very high then both these tasks end up waiting a long time time eventually impacting the performance.

### Does this PR introduce _any_ user-facing change?
Yes. This PR introduces client-side configs for push-based shuffle. If push-based shuffle is turned-off then the users will not see any change.

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Min Shen mshenlinkedin.com
Co-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Ye Zhou yezhoulinkedin.com

Closes #30312 from otterc/SPARK-32917.

Lead-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Co-authored-by: Min Shen <mshen@linked.in.com>
Co-authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: d00f069)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java (diff)
The file was addedcore/src/test/scala/org/apache/spark/shuffle/ShuffleBlockPusherSuite.scala
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/ShuffleWriter.scala (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockId.scala (diff)
Commit 6b34745cb9b294c91cd126c2ea44c039ee83cb84 by dhyun
[SPARK-34049][SS] DataSource V2: Use Write abstraction in StreamExecution

### What changes were proposed in this pull request?

This PR makes `StreamExecution` use the `Write` abstraction introduced in SPARK-33779.

Note: we will need separate plans for streaming writes in order to support the required distribution and ordering in SS. This change only migrates to the `Write` abstraction.

### Why are the changes needed?

These changes prevent exceptions from data sources that implement only the `build` method in `WriteBuilder`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #31093 from aokolnychyi/spark-34049.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 6b34745)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala (diff)
Commit 0af387480cbb29a8582cfdba33c38fa64c2b3f04 by dhyun
[SPARK-34048][SQL][TESTS] Check the amount of calls to Hive external catalog

### What changes were proposed in this pull request?
Add new tests to unified test suites to check the total amount of calls via the Hive client.

### Why are the changes needed?
1. To improve test coverage
2. To make foundation for future optimizations

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites like:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31092 from MaxGekk/access-to-catalog-refreshTable.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 0af3874)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableRenamePartitionSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/ShowTablesSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableAddPartitionSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/ShowNamespacesSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/CommandSuiteBase.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DropTableSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableDropPartitionSuite.scala (diff)
Commit 48cd11c4839b2563cc7f7dcdabc6a2727052e4c9 by gurwls223
[SPARK-34030][SQL] Fold RepartitionByExpression num partition should at Optimizer

### What changes were proposed in this pull request?

Move `RepartitionByExpression` fold partition number code to a new rule at `Optimizer`.

### Why are the changes needed?

We meet some ploblem when backport SPARK-33806. It is because the UnresolvedFunction.foldable will throw a exception. It's ok with master branch, but it's better to do it at Optimizer. Some reason:

1. It's not always safe to call Expression.foldable before analysis.
2. fold num partition to 1 more like a optimize behavior.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31077 from ulysses-you/SPARK-34030.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 48cd11c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
Commit 105ba6e5f0cc792bd561f016e97decb12c6a0099 by gurwls223
Revert "[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE"

This reverts commit d36cdd55419c104134f88930206bedccdbe4f3c0.
(commit: 105ba6e)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
Commit e0e06c18fd479ac86963ec26ab82b0dc5536bc83 by gurwls223
[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION`

### What changes were proposed in this pull request?
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`.

### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0 0
spark-sql> show table extended like 'tbl' partition(part=0);
default tbl false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0
...
```
Create new partition by copying the existing one:
```
$ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1
```
```sql
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0 0
```

The last query must return `0 1` since it has been added by `ALTER TABLE .. ADD PARTITION`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0 0
0 1
```

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
```

Closes #31101 from MaxGekk/add-partition-refresh-cache-2.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: e0e06c1)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableAddPartitionSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (diff)
Commit 48b9611ba3bc3380a6a44fe355a2c785507f400d by yumwang
[SPARK-32668][SQL] HiveGenericUDTF initialize UDTF should use StructObjectInspector method

### What changes were proposed in this pull request?

Use `initialize(StructObjectInspector argOIs)` instead `initialize(ObjectInspector[] args)` in `HiveGenericUDTF`.

### Why are the changes needed?

In our case, we implement a Hive `GenericUDTF` and override `initialize(StructObjectInspector argOIs)`. Then it's ok to execute with Hive, but failed with Spark SQL. Here is the Spark SQL error msg:
```
No handler for UDF/UDAF/UDTF 'com.xxxx.xxxUDTF': java.lang.IllegalStateException:
Should not be called directly Please make sure your function overrides
`public StructObjectInspector initialize(ObjectInspector[] args)`.
```

The reason is Spark `HiveGenericUDTF` call `initialize(ObjectInspector[] argOIs)` to init a UDTF, but it's a Deprecated method.
```
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();
        ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()];

        for(int i = 0; i < inputFields.size(); ++i) {
            udtfInputOIs[i] = ((StructField)inputFields.get(i)).getFieldObjectInspector();
        }

        return this.initialize(udtfInputOIs);
    }

    Deprecated
    public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
        throw new IllegalStateException("Should not be called directly");
    }
```

We should use `initialize(StructObjectInspector argOIs)` to do this so that we can be compatible both of the two method. Same as Hive.

### Does this PR introduce _any_ user-facing change?

Yes, fix UDTF initialize method.

### How was this patch tested?

manual test and passed `HiveUDFDynamicLoadSuite`

Closes #29490 from ulysses-you/SPARK-32668.

Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(commit: 48b9611)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala (diff)
Commit 9a8d2752262543c7f5c6b787a73748e9c70ab6c9 by gurwls223
[SPARK-34055][SQL][TESTS][FOLLOWUP] Increase the expected number of calls to Hive external catalog in partition adding

### What changes were proposed in this pull request?
Increase the number of calls to Hive external catalog in the test for `ALTER TABLE .. ADD PARTITION`.

### Why are the changes needed?
There is a logical conflict between https://github.com/apache/spark/pull/31101 and https://github.com/apache/spark/pull/31092. The first one fixes a caching issue and increases the number of calls to Hive external catalog.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```

Closes #31111 from MaxGekk/add-partition-refresh-cache-2-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 9a8d275)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableAddPartitionSuite.scala (diff)
Commit 830249284df4f5574aba7762cd981244d7b2dfaa by viirya
[SPARK-34059][SQL][CORE] Use for/foreach rather than map to make sure execute it eagerly

### What changes were proposed in this pull request?

This PR is basically a followup of https://github.com/apache/spark/pull/14332.
Calling `map` alone might leave it not executed due to lazy evaluation, e.g.)

```
scala> val foo = Seq(1,2,3)
foo: Seq[Int] = List(1, 2, 3)

scala> foo.map(println)
1
2
3
res0: Seq[Unit] = List((), (), ())

scala> foo.view.map(println)
res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...)

scala> foo.view.foreach(println)
1
2
3
```

We should better use `foreach` to make sure it's executed where the output is unused or `Unit`.

### Why are the changes needed?

To prevent the potential issues by not executing `map`.

### Does this PR introduce _any_ user-facing change?

No, the current codes look not causing any problem for now.

### How was this patch tested?

I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally.

Closes #31110 from HyukjinKwon/SPARK-34059.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 8302492)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/columnar/compression/PassThroughEncodingSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/SetCatalogAndNamespaceExec.scala (diff)
The file was modifiedresource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala (diff)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala (diff)
Commit 3e5e08640e5cfb0d4b1bd98277e17c8c09d877b1 by dhyun
[SPARK-34053][INFRA] Cancel the previous build

Similar to: https://github.com/apache/spark/pull/31098 https://github.com/apache/calcite/pull/2318 (solution suggestted by vlsi - https://github.com/apache/pulsar/issues/9154#issuecomment-756984731)

I used the action, which was maintained by potiuk instead of the original author, for two reasons:
- the original action was abandoned and is not supported (Proof: https://github.com/n1hility/cancel-previous-runs/issues/7)
- this action works with forks.  The original action only worked when the contribution was run in the same repository and the action had a token with full accesses.

> If you use forks, you should create a separate "Cancelling" workflow_run triggered workflow. The workflow_run should be responsible for all canceling actions. The examples below show the possible ways the action can be utilized.

### What changes were proposed in this pull request?
This PR aims to reduce the GitHub Action usage by cancelling the previous build.

### Why are the changes needed?
In most case, the last commit is meaningful.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Due to the nature of the change, testing of this change is difficult.

> Note: This event will only trigger a workflow run if the workflow file is on the default branch.

https://docs.github.com/en/free-pro-teamlatest/actions/reference/events-that-trigger-workflows#workflow_run

However, you can see on my fork that this action is triggered.
https://github.com/mik-laj/spark/actions?query=workflow%3A%22Cancelling+Duplicates%22

I also asked the author of this action to review this change - potiuk (PMC of Apache Airflow) and I have a positive review.

Closes #31104 from mik-laj/patch-1.

Lead-authored-by: Kamil Breguła <kamil.bregula@polidea.com>
Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 3e5e086)
The file was added.github/workflows/cancel_duplicate_workflow_runs.yml