Changes

Summary

  1. [SPARK-36127][PYTHON] Support comparison between a Categorical and a (commit: 8dd4335) (details)
  2. [SPARK-36167][PYTHON][FOLLOWUP] Fix test failures with older versions of (commit: c459c70) (details)
  3. [SPARK-36176][PYTHON] Expose tableExists in pyspark.sql.catalog (commit: d7d961f) (details)
  4. [SPARK-36179][SQL] Support TimestampNTZType in SparkGetColumnsOperation (commit: 0c76fb9) (details)
  5. [SPARK-36216][PYTHON][TESTS] Increase timeout for (commit: d6b974f) (details)
  6. [SPARK-35546][SHUFFLE] Enable push-based shuffle when multiple app (commit: c77acf0) (details)
  7. [SPARK-34051][DOCS][FOLLOWUP] Document about unicode literals (commit: ba1294e) (details)
  8. [SPARK-36183][SQL] Push down limit 1 through Aggregate if it is group (commit: af978c8) (details)
  9. [SPARK-36221][SQL] Make sure CustomShuffleReaderExec has at least one (commit: b70c258) (details)
  10. [SPARK-31907][DOCS][SQL] Adding location of SQL API documentation (commit: e9b18b0) (details)
  11. [SPARK-36201][SQL][FOLLOWUP] Schema check should check inner field too (commit: 2518857) (details)
  12. [SPARK-36207][PYTHON] Expose databaseExists in pyspark.sql.catalog (commit: 463fcb3) (details)
  13. [SPARK-36204][INFRA][BUILD] Deduplicate Scala 2.13 daily build (commit: 801b369) (details)
  14. [SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ (commit: 033a573) (details)
  15. [SPARK-36079][SQL] Null-based filter estimate should always be in the (commit: ddc61e6) (details)
  16. [SPARK-36210][SQL] Preserve column insertion order in (commit: bf680bf) (details)
  17. [SPARK-36222][SQL] Step by days in the Sequence expression for dates (commit: c0d84e6) (details)
  18. [SPARK-36186][PYTHON] Add as_ordered/as_unordered to CategoricalAccessor (commit: 376fadc) (details)
  19. [SPARK-36172][SS] Document session window into Structured Streaming (commit: 0eb31a0) (details)
  20. [SPARK-35027][CORE] Close the inputStream in FileAppender when writin… (commit: 1a8c675) (details)
  21. [SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL (commit: 7ceefca) (details)
  22. [SPARK-36153][SQL][DOCS] Update transform doc to match the current code (commit: 305d563) (details)
  23. [SPARK-36030][SQL] Support DS v2 metrics at writing path (commit: 2653201) (details)
  24. [SPARK-36030][SQL][FOLLOW-UP] Avoid procedure syntax deprecated in Scala (commit: 99006e5) (details)
  25. [SPARK-36030][SQL][FOLLOW-UP] Remove duplicated test suite (commit: df798ed) (details)
  26. [SPARK-36132][SS][SQL] Support initial state for batch mode of (commit: efcce23) (details)
  27. [SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should retain the (commit: 94aece4) (details)
  28. [SPARK-36208][SQL] SparkScriptTransformation should support ANSI (commit: f56c7b7) (details)
  29. [SPARK-36228][SQL] Skip splitting a skewed partition when some map (commit: 9c8a3d3) (details)
  30. [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` (commit: 685c3fd) (details)
  31. [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command (commit: 4cd6cfc) (details)
  32. [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and (commit: d506815) (details)
  33. [SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] updating a (commit: ad528a0) (details)
  34. [SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` (commit: 09bebc8) (details)
  35. [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about (commit: dcb7db5) (details)
  36. [SPARK-36063][SQL] Optimize OneRowRelation subqueries (commit: de8e4be) (details)
Commit 8dd43351d5d9b21b11a805b12749b099263a71d2 by ueshin
[SPARK-36127][PYTHON] Support comparison between a Categorical and a scalar

### What changes were proposed in this pull request?
Support comparison between a Categorical and a scalar.
There are 3 main changes:
- Modify `==` and `!=` from comparing **codes** of the Categorical to the scalar to comparing **actual values** of the Categorical to the scalar.
- Support `<`, `<=`, `>`, `>=` between a Categorical and a scalar.
- TypeError message fix.

### Why are the changes needed?
pandas supports comparison between a Categorical and a scalar, we should follow pandas' behaviors.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0     True
1    False
2    False
dtype: bool
>>> psser <= 1
Traceback (most recent call last):
...
NotImplementedError: <= can not be applied to categoricals.
```

After:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0    False
1     True
2    False
dtype: bool
>>> psser <= 1
0    True
1    True
2    True
dtype: bool

```

### How was this patch tested?
Unit tests.

Closes #33373 from xinrong-databricks/categorical_eq.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 8dd4335)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_udt_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
Commit c459c707c5850f89322dcc242e4181a2f3280426 by gurwls223
[SPARK-36167][PYTHON][FOLLOWUP] Fix test failures with older versions of pandas

### What changes were proposed in this pull request?

Fix test failures with `pandas < 1.2`.

### Why are the changes needed?

There are some test failures with `pandas < 1.2`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Fixed tests.

Closes #33398 from ueshin/issues/SPARK-36167/test.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: c459c70)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
Commit d7d961fabe59382ee8fb31b8d88e586ec4ff378f by gurwls223
[SPARK-36176][PYTHON] Expose tableExists in pyspark.sql.catalog

### What changes were proposed in this pull request?
exposing tableExists in pyspark.sql.catalog

### Why are the changes needed?
avoids pyspark users having to go through listTables

### Does this PR introduce _any_ user-facing change?
Yes, additional tableExists method available in pyspark

### How was this patch tested?
test added

Closes #33388 from dominikgehl/feature/SPARK-36176.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: d7d961f)
The file was modifiedpython/pyspark/sql/tests/test_catalog.py (diff)
The file was modifiedpython/docs/source/reference/pyspark.sql.rst (diff)
The file was modifiedpython/pyspark/sql/catalog.pyi (diff)
The file was modifiedpython/pyspark/sql/catalog.py (diff)
Commit 0c76fb9c0101ff22f8ead4b94e33bd98f570f353 by gurwls223
[SPARK-36179][SQL] Support TimestampNTZType in SparkGetColumnsOperation

### What changes were proposed in this pull request?

Support TimestampNTZType in SparkGetColumnsOperation

### Why are the changes needed?

TimestampNTZType coverage

### Does this PR introduce _any_ user-facing change?

yes, jdbc end-users will be aware of TimestampNTZType

### How was this patch tested?

add new test

Closes #33393 from yaooqinn/SPARK-36179.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 0c76fb9)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala (diff)
Commit d6b974f8ceb5383e5f01cf87a267b7580f992ac1 by gurwls223
[SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence

### What changes were proposed in this pull request?

Test is flaky (https://github.com/apache/spark/runs/3109815586):

```
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 391, in test_parameter_convergence
    eventually(condition, catch_assertions=True)
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in eventually
    raise lastValue
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in eventually
    lastValue = condition()
  File "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 387, in condition
    self.assertEqual(len(model_weights), len(batches))
AssertionError: 9 != 10
```

Should probably increase timeout

### Why are the changes needed?

To avoid flakiness in the test.

### Does this PR introduce _any_ user-facing change?

Nope, dev-only.

### How was this patch tested?

CI should test it out.

Closes #33427 from HyukjinKwon/SPARK-36216.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: d6b974f)
The file was modifiedpython/pyspark/mllib/tests/test_streaming_algorithms.py (diff)
Commit c77acf0bbc25341de2636649fdd76f9bb4bdf4ed by mridulatgmail.com
[SPARK-35546][SHUFFLE] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better way

### What changes were proposed in this pull request?
This is one of the patches for SPIP SPARK-30602 which is needed for push-based shuffle.

### Summary of the change:
When Executor registers with Shuffle Service, it will encode the merged shuffle dir created and also the application attemptId into the ShuffleManagerMeta into Json. Then in Shuffle Service, it will decode the Json string and get the correct merged shuffle dir and also the attemptId. If the registration comes from a newer attempt, the merged shuffle information will be updated to store the information from the newer attempt.

This PR also refactored the management of the merged shuffle information to avoid concurrency issues.
### Why are the changes needed?
Refer to the SPIP in SPARK-30602.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in SPARK-30602.
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Closes #33078 from zhouyejoe/SPARK-35546.

Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: c77acf0)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/DiskBlockManagerSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/UtilsSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/OneForOneBlockPusherSuite.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FinalizeShuffleMerge.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/PushBlockStream.java (diff)
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalBlockHandlerSuite.java (diff)
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockPusher.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
Commit ba1294ea5a25e908c19e3d4e73c6b7a420bb006e by wenchen
[SPARK-34051][DOCS][FOLLOWUP] Document about unicode literals

### What changes were proposed in this pull request?

This PR documents about unicode literals added in SPARK-34051 (#31096) and a past PR in `sql-ref-literals.md`.

### Why are the changes needed?

Notice users about the literals.

### Does this PR introduce _any_ user-facing change?

Yes, but just add a sentence.

### How was this patch tested?

Built the document and confirmed the result.
```
SKIP_API=1 bundle exec jekyll build
```
![unicode-literals](https://user-images.githubusercontent.com/4736016/126283923-944dc162-1817-47bc-a7e8-c3145225586b.png)

Closes #33434 from sarutak/unicode-literal-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: ba1294e)
The file was modifieddocs/sql-ref-literals.md (diff)
Commit af978c87f10c89ee3a7c927ab9b039a2b84a492a by yumwang
[SPARK-36183][SQL] Push down limit 1 through Aggregate if it is group only

### What changes were proposed in this pull request?

Push down limit 1 and turn `Aggregate` into `Project` through `Aggregate` if it is group only. For example:
```sql
create table t1 using parquet as select id from range(100000000L);
create table t2 using parquet as select id from range(100000000L);
create view v1 as select * from t1 union select * from t2;
select * from v1 limit 1;
```

Before this PR | After this PR
-- | --
![image](https://user-images.githubusercontent.com/5399861/125975690-55663515-c4c5-4a04-aedf-f8ba37581ba7.png) | ![image](https://user-images.githubusercontent.com/5399861/126168972-b2675e09-4f93-4026-b1be-af317205e57f.png)

### Why are the changes needed?

Improve query performance. This is a real case from the cluster:
![image](https://user-images.githubusercontent.com/5399861/125976597-18cb68d6-b22a-4d80-b270-01b2b13d1ef5.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33397 from wangyum/SPARK-36183.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(commit: af978c8)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushdownSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
Commit b70c25881c8cbac299991a62457ad4373a11cfe4 by wenchen
[SPARK-36221][SQL] Make sure CustomShuffleReaderExec has at least one partition

### What changes were proposed in this pull request?

* Add non-empty partition check in `CustomShuffleReaderExec`
* Make sure `OptimizeLocalShuffleReader` doesn't return empty partition

### Why are the changes needed?

Since SPARK-32083, AQE coalesce always return at least one partition, it should be robust to add non-empty check in `CustomShuffleReaderExec`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

not need

Closes #33431 from ulysses-you/non-empty-partition.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: b70c258)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala (diff)
Commit e9b18b07992e7871fdd2ea75fd5fd9a1c04f6e07 by srowen
[SPARK-31907][DOCS][SQL] Adding location of SQL API documentation

### What changes were proposed in this pull request?
Linking to location of SQL API documentation, making it easier and quicker to find it.

### Why are the changes needed?
documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33435 from dominikgehl/feature/SPARK-31907.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: e9b18b0)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
Commit 251885772d41a572655e950a8e298315f222a803 by wenchen
[SPARK-36201][SQL][FOLLOWUP] Schema check should check inner field too

### What changes were proposed in this pull request?
When inner field have wrong schema filed name should check field name too.
![image](https://user-images.githubusercontent.com/46485123/126101009-c192d87f-1e18-4355-ad53-1419dacdeb76.png)

### Why are the changes needed?
Early check early faield

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33409 from AngersZhuuuu/SPARK-36201.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 2518857)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (diff)
Commit 463fcb3723d4c5cffd4b787e2d7254ceaf2bca98 by gurwls223
[SPARK-36207][PYTHON] Expose databaseExists in pyspark.sql.catalog

### What changes were proposed in this pull request?
Expose databaseExists in pyspark.sql.catalog

### Why are the changes needed?
Was available in scala, but not in pyspark

### Does this PR introduce _any_ user-facing change?
New method databaseExists

### How was this patch tested?
Unit tests in codebase

Closes #33416 from dominikgehl/feature/SPARK-36207.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 463fcb3)
The file was modifiedpython/pyspark/sql/tests/test_catalog.py (diff)
The file was modifiedpython/pyspark/sql/catalog.pyi (diff)
The file was modifiedpython/docs/source/reference/pyspark.sql.rst (diff)
The file was modifiedpython/pyspark/sql/catalog.py (diff)
Commit 801b369bd03cd4a09637afbc899dadbbed609887 by gurwls223
[SPARK-36204][INFRA][BUILD] Deduplicate Scala 2.13 daily build

### What changes were proposed in this pull request?

Scala 2.13 daily job was added but ideally we should deduplicate it. This PR targets to deduplicate it by creating one more job (`configure-jobs`) that the main job depends on.

`configure-jobs` will properly set the branch, envs, etc. to run the main build properly.

### Why are the changes needed?

To make the maintenance easier

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

See
- https://github.com/HyukjinKwon/spark/actions/runs/1044636792 for a PR
- https://github.com/HyukjinKwon/spark/actions/runs/1048542984 for a cron job

Closes #33410 from HyukjinKwon/SPARK-36204.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 801b369)
The file was removed.github/workflows/build_and_test_scala213_daily.yml
The file was modifieddev/run-tests.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 033a5731b44723fd7434c5ee0a021d3787a621ef by gengliang
[SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ

### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/33299 and implement `prettyName` for `MakeTimestampNTZ` and `MakeTimestampLTZ` based on the discussion show below
https://github.com/apache/spark/pull/33299/files#r668423810

### Why are the changes needed?
This PR fix the incorrect alias usecase.

### Does this PR introduce _any_ user-facing change?
'No'.
Modifications are transparent to users.

### How was this patch tested?
Jenkins test.

Closes #33430 from beliefer/SPARK-36046-followup.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 033a573)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
Commit ddc61e62b9af5deff1b93e22f466f2a13f281155 by wenchen
[SPARK-36079][SQL] Null-based filter estimate should always be in the range [0, 1]

### What changes were proposed in this pull request?

Forces the selectivity estimate for null-based filters to be in the range `[0,1]`.

### Why are the changes needed?

I noticed in a few TPC-DS query tests that the column statistic null count can be higher than the table statistic row count. In the current implementation, the selectivity estimate for `IsNotNull` is negative.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #33286 from karenfeng/bound-selectivity-est.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: ddc61e6)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q19.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q19.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q68.sf100/simplified.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q68.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q73.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q73.sf100/explain.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala (diff)
Commit bf680bf25aae9619d462caee05c41cc33909338a by viirya
[SPARK-36210][SQL] Preserve column insertion order in Dataset.withColumns

### What changes were proposed in this pull request?
Preserve the insertion order of columns in Dataset.withColumns

### Why are the changes needed?
It is the expected behavior. We preserve insertion order in all other places.

### Does this PR introduce _any_ user-facing change?
No. Currently Dataset.withColumns is not actually used anywhere to insert more than one column. This change is to make sure it behaves as expected when it is used for that purpose in future.

### How was this patch tested?
Added test in DatasetSuite

Closes #33423 from koertkuipers/feat-withcolumns-preserve-order.

Authored-by: Koert Kuipers <koert@tresata.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: bf680bf)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala (diff)
Commit c0d84e6cf1046b7944796038414ef21fe9c7e3b5 by max.gekk
[SPARK-36222][SQL] Step by days in the Sequence expression for dates

### What changes were proposed in this pull request?
The current implement of `Sequence` expression not support step by days for dates.
```
spark-sql> select sequence(date'2021-07-01', date'2021-07-10', interval '3' day);
Error in query: cannot resolve 'sequence(DATE '2021-07-01', DATE '2021-07-10', INTERVAL '3' DAY)' due to data type mismatch:
sequence uses the wrong parameter type. The parameter type must conform to:
1. The start and stop expressions must resolve to the same type.
2. If start and stop expressions resolve to the 'date' or 'timestamp' type
then the step expression must resolve to the 'interval' or
'interval year to month' or 'interval day to second' type,
otherwise to the same type as the start and stop expressions.
         ; line 1 pos 7;
'Project [unresolvedalias(sequence(2021-07-01, 2021-07-10, Some(INTERVAL '3' DAY), Some(Europe/Moscow)), None)]
+- OneRowRelation
```

### Why are the changes needed?
`DayTimeInterval` has day granularity should as step for dates.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Sequence expression will supports step by `DayTimeInterval` has day granularity for dates.

### How was this patch tested?
New tests.

Closes #33439 from beliefer/SPARK-36222.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: c0d84e6)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
Commit 376fadc89cffac97aebe49a7cf4a4bc978b1d09e by ueshin
[SPARK-36186][PYTHON] Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex

### What changes were proposed in this pull request?

Add `as_ordered`/`as_unordered` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `as_ordered`/`as_unordered` in `CategoricalAccessor` and `CategoricalIndex` yet.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `as_ordered`/`as_unordered`.

### How was this patch tested?

Added some tests.

Closes #33400 from ueshin/issues/SPARK-36186/as_ordered_unordered.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 376fadc)
The file was modifiedpython/docs/source/reference/pyspark.pandas/indexing.rst (diff)
The file was modifiedpython/pyspark/pandas/missing/indexes.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
The file was modifiedpython/docs/source/reference/pyspark.pandas/series.rst (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
Commit 0eb31a06d6f2b7583b6a9c646baeff58094f8d6c by kabhwan.opensource
[SPARK-36172][SS] Document session window into Structured Streaming guide doc

### What changes were proposed in this pull request?

This PR documents a new feature "native support of session window" into Structured Streaming guide doc.

Screenshots are following:

![스크린샷 2021-07-20 오후 5 04 20](https://user-images.githubusercontent.com/1317309/126284848-526ec056-1028-4a70-a1f4-ae275d4b5437.png)

![스크린샷 2021-07-20 오후 3 34 38](https://user-images.githubusercontent.com/1317309/126276763-763cf841-aef7-412a-aa03-d93273f0c850.png)

### Why are the changes needed?

This change is needed to explain a new feature to the end users.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Documentation changes.

Closes #33433 from HeartSaVioR/SPARK-36172.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 0eb31a0)
The file was modifieddocs/img/structured-streaming.pptx (diff)
The file was addeddocs/img/structured-streaming-time-window-types.jpg
The file was modifieddocs/structured-streaming-programming-guide.md (diff)
Commit 1a8c6755a1802afdb9a73793e9348d322176125a by srowen
[SPARK-35027][CORE] Close the inputStream in FileAppender when writin…

### What changes were proposed in this pull request?

1. add "closeStreams" to FileAppender and RollingFileAppender
2. set "closeStreams" to "true" in ExecutorRunner

### Why are the changes needed?

The executor will hang when due disk full or other exceptions which happened in writting to outputStream: the root cause is the "inputStream" is not closed after the error happens:
1. ExecutorRunner creates two files appenders for pipe: one for stdout, one for stderr
2. FileAppender.appendStreamToFile exits the loop when writing to outputStream
3. FileAppender closes the outputStream, but left the inputStream which refers the pipe's stdout and stderr opened
4. The executor will hang when printing the log message if the pipe is full (no one consume the outputs)
5. From the driver side, you can see the task can't be completed for ever

With this fix, the step 4 will throw an exception, the driver can catch up the exception and reschedule the failed task to other executors.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add new tests for the "closeStreams" in FileAppenderSuite

Closes #33263 from jhu-chang/SPARK-35027.

Authored-by: Jie <gt.hu.chang@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 1a8c675)
The file was modifiedcore/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/logging/FileAppender.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala (diff)
Commit 7ceefcace99915b83340dd4dc8ee22b7c3830491 by srowen
[SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL

### What changes were proposed in this pull request?

Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the column encryption feature which can be called from Spark SQL. The aim of this PR is to document the use of Parquet encryption in Spark.

### Why are the changes needed?

- To provide information on how to use Parquet column encryption

### Does this PR introduce _any_ user-facing change?

Yes, documents a new feature.

### How was this patch tested?

bundle exec jekyll build

Closes #32895 from ggershinsky/parquet-encryption-doc.

Authored-by: Gidon Gershinsky <ggershinsky@apple.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 7ceefca)
The file was modifieddocs/sql-data-sources-parquet.md (diff)
Commit 305d563329bfb7a4ef582655b88a72d826a4e8aa by srowen
[SPARK-36153][SQL][DOCS] Update transform doc to match the current code

### What changes were proposed in this pull request?
Update trasform's doc to latest code.
![image](https://user-images.githubusercontent.com/46485123/126175747-672cccbc-4e42-440f-8f1e-f00b6dc1be5f.png)

### Why are the changes needed?
keep consistence

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #33362 from AngersZhuuuu/SPARK-36153.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 305d563)
The file was modifieddocs/sql-ref-syntax-qry-select-transform.md (diff)
Commit 2653201b0a50a651ebc0c4e1fabb47d32dee77c4 by viirya
[SPARK-36030][SQL] Support DS v2 metrics at writing path

### What changes were proposed in this pull request?

We add the interface for DS v2 metrics in SPARK-34366. It is only added for reading path, though. This patch extends the metrics interface to writing path.

### Why are the changes needed?

Complete DS v2 metrics interface support in writing path.

### Does this PR introduce _any_ user-facing change?

No. For developer, yes, as this adds metrics support at DS v2 writing path.

### How was this patch tested?

Added test.

Closes #33239 from viirya/v2-write-metrics.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 2653201)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/WriteToMicroBatchDataSource.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/WriteToContinuousDataSourceExec.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/WriteToContinuousDataSource.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DataWriter.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/write/Write.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousWriteRDD.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/InMemoryTableMetricSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/CustomMetricsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/SimpleWritableDataSource.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
Commit 99006e515ba736fcf23ddda7f015d0f8e8da4cf5 by gurwls223
[SPARK-36030][SQL][FOLLOW-UP] Avoid procedure syntax deprecated in Scala 2.13

### What changes were proposed in this pull request?

This PR avoid using procedure syntax deprecated in Scala 2.13.

https://github.com/apache/spark/runs/3120481756?check_suite_focus=true

```
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala:44:90: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testMetricOnDSv2`'s return type
[error]   private def testMetricOnDSv2(func: String => Unit, checker: Map[Long, String] => Unit) {
[error]                                                                                          ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/InMemoryTableMetricSuite.scala:44:90: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testMetricOnDSv2`'s return type
[error]   private def testMetricOnDSv2(func: String => Unit, checker: Map[Long, String] => Unit) {
[error]                                                                                          ^
[warn] 100 warnings found
[error] two errors found
[error] (sql / Test / compileIncremental) Compilation failed
[error] Total time: 579 s (09:39), completed Jul 21, 2021 4:14:26 AM
```

### Why are the changes needed?

To make the build compatible with Scala 2.13 in Spark.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested:

```bash
./dev/change-scala-version.sh 2.13
./build/mvn -DskipTests -Phive-2.3 -Phive clean package -Pscala-2.13
```

Closes #33452 from HyukjinKwon/SPARK-36030.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 99006e5)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/InMemoryTableMetricSuite.scala (diff)
Commit df798ed301d954b72a72d7eebd0b731e8b4fdb24 by viirya
[SPARK-36030][SQL][FOLLOW-UP] Remove duplicated test suite

### What changes were proposed in this pull request?

Removes `FileFormatDataWriterMetricSuite` which duplicated.

### Why are the changes needed?

`FileFormatDataWriterMetricSuite` should be renamed to `InMemoryTableMetricSuite`. But it was wrongly copied.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33453 from viirya/SPARK-36030-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: df798ed)
The file was removedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala
Commit efcce23b913ce0de961ac261050e3d6dbf261f6e by tathagata.das1565
[SPARK-36132][SS][SQL] Support initial state for batch mode of flatMapGroupsWithState

### What changes were proposed in this pull request?
Adding support for accepting an initial state with flatMapGroupsWithState in batch mode.

### Why are the changes needed?
SPARK-35897  added support for accepting an initial state for streaming queries using flatMapGroupsWithState. the code flow is separate for batch and streaming and required a different PR.

### Does this PR introduce _any_ user-facing change?

Yes as discussed above flatMapGroupsWithState in batch mode can accept an initialState, previously this would throw an UnsupportedOperationException

### How was this patch tested?

Added relevant unit tests in FlatMapGroupsWithStateSuite and modified the  tests `JavaDatasetSuite`

Closes #33336 from rahulsmahadev/flatMapGroupsWithStateBatch.

Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
(commit: efcce23)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java (diff)
Commit 94aece4325af8c07160f124997d51b79a4abd242 by wenchen
[SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should retain the LOGICAL_PLAN_TAG tag

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/33222 .

https://github.com/apache/spark/pull/33222 made a mistake that, `RemoveRedundantProjects` may lose the `LOGICAL_PLAN_TAG` tag, even though the logical plan link is retained. This was actually caught by the test `LogicalPlanTagInSparkPlanSuite`, but was not being taken care of.

There is no problem so far, but losing information can always lead to potential bugs.

### Why are the changes needed?

fix a mistake

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing test

Closes #33442 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 94aece4)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/LogicalPlanTagInSparkPlanSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala (diff)
Commit f56c7b71ff27e6f5379f3699c2dcb5f79a0ae791 by max.gekk
[SPARK-36208][SQL] SparkScriptTransformation should support ANSI interval types

### What changes were proposed in this pull request?

This PR changes `BaseScriptTransformationExec` for `SparkScriptTransformationExec` to support ANSI interval types.

### Why are the changes needed?

`SparkScriptTransformationExec` support `CalendarIntervalType` so it's better to support ANSI interval types as well.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33419 from sarutak/script-transformation-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: f56c7b7)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala (diff)
Commit 9c8a3d3975fab1e21d9482ed327919f9904e25df by wenchen
[SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed

### What changes were proposed in this pull request?

Sometimes, AQE skew join optimization can fail with NPE. This is because AQE tries to get the shuffle block sizes, but some map outputs are missing due to the executor lost or something.

This PR fixes this bug by skipping skew join handling if some map outputs are missing in the `MapOutputTracker`.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a new UT

Closes #33445 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 9c8a3d3)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
Commit 685c3fd05bf8e9d85ea9b33d4e28807d436cd5ca by wenchen
[SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables

### What changes were proposed in this pull request?

For non-datasource Hive tables, e.g. tables written outside of Spark (through Hive or Trino), we have certain optimzations in Spark where we use Spark ORC and Parquet datasources to read these tables ([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L128)) rather than using the Hive serde.
If such a table contains a `path` property, Spark will try to list this path property in addition to the table location when creating an `InMemoryFileIndex`. ([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575)) This can lead to wrong data if `path` property points to a directory location or an error if `path` is not a location. A concrete example is provided in [SPARK-28266 (comment)](https://issues.apache.org/jira/browse/SPARK-28266?focusedCommentId=17380170&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17380170).

Since these tables were not written through Spark, Spark should not interpret this `path` property as it can be set by an external system with a different meaning.

### Why are the changes needed?

For better compatibility with Hive tables generated by other platforms (non-Spark)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test

Closes #33328 from shardulm94/spark-28266.

Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 685c3fd)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala (diff)
Commit 4cd6cfc773da726a90d41bfc590ea9188c17d5ae by yao
[SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec

### What changes were proposed in this pull request?

This fixes a case sensitivity issue for desc table commands with partition specified.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

yes, but it's a bugfix

### How was this patch tested?

new tests

#### before
```
+-- !query
+DESC EXTENDED t PARTITION (C='Us', D=1)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+Partition spec is invalid. The spec (C, D) must match the partition spec (c, d) defined in table '`default`.`t`'
+
```

#### after

https://github.com/apache/spark/pull/33424/files#diff-554189c49950974a948f99fa9b7436f615052511660c6a0ae3062fa8ca0a327cR328

Closes #33424 from yaooqinn/SPARK-36213.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 4cd6cfc)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/describe.sql (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/describe.sql.out (diff)
Commit d506815a92510ff4cbd2c14cf17d41202f3ed609 by ueshin
[SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex

### What changes were proposed in this pull request?

Add categories setter to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement categories setter in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use categories setter.

### How was this patch tested?

Added some tests.

Closes #33448 from ueshin/issues/SPARK-36188/categories_setter.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: d506815)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
Commit ad528a007a57bdcb92559530d0ebc90e9fec47ca by incomplete
[SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] updating a bunch of python packages

### What changes were proposed in this pull request?
updating the anaconda py36 environment file

### Why are the changes needed?
see:
https://issues.apache.org/jira/browse/SPARK-32666
https://issues.apache.org/jira/browse/SPARK-33242
https://issues.apache.org/jira/browse/SPARK-32391
https://issues.apache.org/jira/browse/SPARK-32797

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
jenkins will test this

Closes #33469 from shaneknapp/updating-python-paks.

Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
(commit: ad528a0)
The file was modifieddev/ansible-for-test-node/roles/jenkins-worker/files/python_environments/spark-py36-spec.txt (diff)
Commit 09bebc8bdef49a9c35bd5e5edf9d513dfa6c9557 by gurwls223
[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv`

### What changes were proposed in this pull request?

Rework [PR](https://github.com/apache/spark/pull/33212) with suggestions.

This PR make `spark.read.json()` has the same behavior with Datasource API `spark.read.format("json").load("path")`. Spark should turn a non-nullable schema into nullable when using API `spark.read.json()` by default.

Here is an example:

```scala
  val schema = StructType(Seq(StructField("value",
    StructType(Seq(
      StructField("x", IntegerType, nullable = false),
      StructField("y", IntegerType, nullable = false)
    )),
    nullable = true
  )))

  val testDS = Seq("""{"value":{"x":1}}""").toDS
  spark.read
    .schema(schema)
    .json(testDS)
    .printSchema()

  spark.read
    .schema(schema)
    .format("json")
    .load("/tmp/json/t1")
    .printSchema()
  // root
  //  |-- value: struct (nullable = true)
  //  |    |-- x: integer (nullable = true)
  //  |    |-- y: integer (nullable = true)
```

Before this pr:
```
// output of spark.read.json()
root
|-- value: struct (nullable = true)
|    |-- x: integer (nullable = false)
|    |-- y: integer (nullable = false)
```

After this pr:
```
// output of spark.read.json()
root
|-- value: struct (nullable = true)
|    |-- x: integer (nullable = true)
|    |-- y: integer (nullable = true)
```

- `spark.read.csv()` also has the same problem.
- Datasource API `spark.read.format("json").load("path")` do this logical when resolve relation.

https://github.com/apache/spark/blob/c77acf0bbc25341de2636649fdd76f9bb4bdf4ed/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L415-L421

### Does this PR introduce _any_ user-facing change?

Yes, `spark.read.json()` and `spark.read.csv()` not respect the user-given schema and always turn it into a nullable schema by default.

### How was this patch tested?

New test.

Closes #33436 from cfmcgrady/SPARK-35912-v3.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 09bebc8)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala (diff)
Commit dcb7db5370cffc1c309671195a325bef7829ec10 by dongjoon
[SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation

### What changes were proposed in this pull request?

This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`.
`1.5.0-3` was released few days ago.
This release resolves an issue about buffer size calculation, which can affect usage in Spark.
https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3

### Why are the changes needed?

It might be a corner case that skipping length is greater than `2^31 - 1` but it's possible to affect Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33464 from sarutak/upgrade-zstd-jni-1.5.0-3.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: dcb7db5)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
Commit de8e4be92c2275d2dd465be435ec31b8acbee9cb by wenchen
[SPARK-36063][SQL] Optimize OneRowRelation subqueries

### What changes were proposed in this pull request?
This PR adds optimization for scalar and lateral subqueries with OneRowRelation as leaf nodes. It inlines such subqueries before decorrelation to avoid rewriting them as left outer joins. It also introduces a flag to turn on/off this optimization: `spark.sql.optimizer.optimizeOneRowRelationSubquery` (default: True).

For example:
```sql
select (select c1) from t
```
Analyzed plan:
```
Project [scalar-subquery#17 [c1#18] AS scalarsubquery(c1)#22]
:  +- Project [outer(c1#18)]
:     +- OneRowRelation
+- LocalRelation [c1#18, c2#19]
```

Optimized plan before this PR:
```
Project [c1#18#25 AS scalarsubquery(c1)#22]
+- Join LeftOuter, (c1#24 <=> c1#18)
   :- LocalRelation [c1#18]
   +- Aggregate [c1#18], [c1#18 AS c1#18#25, c1#18 AS c1#24]
      +- LocalRelation [c1#18]
```

Optimized plan after this PR:
```
LocalRelation [scalarsubquery(c1)#22]
```

### Why are the changes needed?
To optimize query plans.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new unit tests.

Closes #33284 from allisonwang-db/spark-36063-optimize-subquery-one-row-relation.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: de8e4be)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala (diff)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (diff)