Changes

Summary

  1. [SPARK-36131][SQL][TEST] Refactor ParquetColumnIndexSuite (commit: 7a7b086) (details)
  2. [SPARK-36129][BUILD] Upgrade commons-compress to 1.21 to deal with CVEs (commit: fd06cc2) (details)
  3. [SPARK-36037][SQL] Support ANSI SQL LOCALTIMESTAMP datetime value (commit: b4f7758) (details)
  4. [SPARK-35639][SQL] Make hasCoalescedPartition return true if something (commit: 4033b2a) (details)
  5. [SPARK-36130][SQL] UnwrapCastInBinaryComparison should skip In (commit: 103d16e) (details)
  6. [SPARK-34892][SS] Introduce MergingSortWithSessionWindowStateIterator (commit: 12a576f) (details)
  7. [SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range (commit: b866457) (details)
  8. [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values (commit: e980c7a) (details)
  9. [SPARK-35639][SQL][FOLLOWUP] Make hasCoalescedPartition return true if (commit: 3819641) (details)
  10. [SPARK-36069][SQL] Add field info to `from_json`'s exception in the (commit: 1e86345) (details)
  11. [SPARK-36125][PYTHON] Implement non-equality comparison operators (commit: 0cb120f) (details)
  12. [SPARK-36106][SQL][CORE] Label error classes for subset of (commit: e92b8ea) (details)
  13. [SPARK-36144][INFRA][TESTS] Use Python 3.9 in `run-pip-tests` conda (commit: 5202944) (details)
  14. [SPARK-36037][SQL][FOLLOWUP] Fix flaky test for datetime function (commit: 0973397) (details)
  15. [SPARK-36139][INFRA][TESTS] Remove Python 3.6 from `pyspark` GitHub (commit: 416a7fd) (details)
  16. [SPARK-36150][INFRA][TESTS] Disable MiMa for Scala 2.13 artifacts (commit: 5acfecb) (details)
  17. [SPARK-36037][TESTS][FOLLOWUP] Avoid wrong test results on daylight (commit: 564d3de) (details)
  18. [SPARK-36149][PYTHON] Clarify documentation for dayofweek and weekofyear (commit: b5ee6d0) (details)
  19. [SPARK-36148][SQL] Fix input data types check for regexp_replace (commit: 4dfd266) (details)
  20. [SPARK-33898][SQL][FOLLOWUP] Fix the behavior of `SHOW CREATE TABLE` to (commit: f95ca31) (details)
  21. [SPARK-29519][SQL][FOLLOWUP] Keep output is deterministic for show (commit: e05441c) (details)
  22. [SPARK-36154][DOCS] Documenting week and quarter as valid formats in (commit: 802f632) (details)
  23. [SPARK-36159][BUILD] Replace 'python' to 'python3' in (commit: 6bd385f) (details)
  24. [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to (commit: a71dd6a) (details)
  25. [SPARK-32792][SQL][FOLLOWUP] Fix Parquet filter pushdown NOT IN (commit: 0062c03) (details)
  26. [SPARK-36158][PYTHON][DOCS] Improving pyspark sql/functions (commit: 2db7ed7) (details)
  27. [SPARK-36157][SQL][SS] TimeWindow expression: apply filter before (commit: 1ceb753) (details)
  28. [SPARK-36135][SQL] Support TimestampNTZ type in file partitioning (commit: 96c2919) (details)
  29. [SPARK-36165][INFRA] Fix SQL doc generation in GitHub Action (commit: d69f981) (details)
  30. [SPARK-36164][INFRA] run-test.py should not fail when APACHE_SPARK_REF (commit: c8a3c22) (details)
  31. [SPARK-36034][SQL] Rebase datetime in pushed down filters to parquet (commit: b09b7f7) (details)
  32. [SPARK-36164][INFRA][FOLLOWUP] Add empty string check back (commit: 5f41a27) (details)
  33. [SPARK-36167][PYTHON] Revisit more InternalField managements (commit: c22f7a4) (details)
  34. [SPARK-36166][TESTS] Support Scala 2.13 test in `dev/run-tests.py` (commit: f66153d) (details)
  35. [SPARK-36169][SQL] Make 'spark.sql.sources.disabledJdbcConnProviderList' (commit: fba61ad) (details)
  36. [SPARK-36171][BUILD] Upgrade GenJavadoc to 0.18 (commit: ad744fb) (details)
  37. [SPARK-35985][SQL] push partitionFilters for empty readDataSchema (commit: f06aa4a) (details)
  38. [SPARK-35710][SQL] Support DPP + AQE when there is no reused broadcast (commit: c1b3f86) (details)
  39. [SPARK-34893][SS] Support session window natively (commit: f2bf8b0) (details)
  40. [SPARK-36160][PYTHON][DOCS] Clarifying documentation for pyspark (commit: 2d8d7b4) (details)
  41. [SPARK-36152][INFRA][TESTS] Add Scala 2.13 daily build and test GitHub (commit: 3218e4e) (details)
  42. [SPARK-36128][SQL] Apply spark.sql.hive.metastorePartitionPruning for (commit: 37dc3f9) (details)
  43. [SPARK-35276][CORE] Calculate checksum for shuffle data and write as (commit: 4783fb7) (details)
  44. [SPARK-32922][SHUFFLE][CORE][FOLLOWUP] Fixes few issues when the (commit: 6d2cbad) (details)
  45. [SPARK-35785][SS][FOLLOWUP] Remove ignored test from RocksDBSuite (commit: 8009f0d) (details)
  46. [SPARK-36170][SQL] Change quoted interval literal (interval constructor) (commit: 71ea25d) (details)
Commit 7a7b086534ebb6d2d0f077897f43018f65334c6c by dongjoon
[SPARK-36131][SQL][TEST] Refactor ParquetColumnIndexSuite

### What changes were proposed in this pull request?

Refactor `ParquetColumnIndexSuite` and allow better code reuse.

### Why are the changes needed?

A few methods in the test suite can share the same utility method `checkUnalignedPages` so it's better to do that and remove code duplication.

Additionally, `parquet.enable.dictionary` is tested for both `true` and `false` combination.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33334 from sunchao/SPARK-35743-test-refactoring.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 7a7b086)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnIndexSuite.scala (diff)
Commit fd06cc211d7d1579067ad717da9976aabd71b70d by dongjoon
[SPARK-36129][BUILD] Upgrade commons-compress to 1.21 to deal with CVEs

### What changes were proposed in this pull request?

This PR upgrades `commons-compress` from `1.20` to `1.21` to deal with CVEs.

### Why are the changes needed?

Some CVEs which affect `commons-compress 1.20` are reported and fixed in `1.21`.
https://commons.apache.org/proper/commons-compress/security-reports.html

* CVE-2021-35515
* CVE-2021-35516
* CVE-2021-35517
* CVE-2021-36090

The severities are reported as low for all the CVEs but it would be better to deal with them just in case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33333 from sarutak/upgrade-commons-compress-1.21.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: fd06cc2)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
Commit b4f7758944505c7c957e2e0c2c70da5ea746099b by wenchen
[SPARK-36037][SQL] Support ANSI SQL LOCALTIMESTAMP datetime value function

### What changes were proposed in this pull request?
`LOCALTIMESTAMP()` is a datetime value function from ANSI SQL.
The syntax show below:
```
<datetime value function> ::=
    <current date value function>
  | <current time value function>
  | <current timestamp value function>
  | <current local time value function>
  | <current local timestamp value function>
<current date value function> ::=
CURRENT_DATE
<current time value function> ::=
CURRENT_TIME [ <left paren> <time precision> <right paren> ]
<current local time value function> ::=
LOCALTIME [ <left paren> <time precision> <right paren> ]
<current timestamp value function> ::=
CURRENT_TIMESTAMP [ <left paren> <timestamp precision> <right paren> ]
<current local timestamp value function> ::=
LOCALTIMESTAMP [ <left paren> <timestamp precision> <right paren> ]
```

`LOCALTIMESTAMP()` returns the current timestamp at the start of query evaluation as TIMESTAMP WITH OUT TIME ZONE. This is similar to `CURRENT_TIMESTAMP()`.
Note we need to update the optimization rule `ComputeCurrentTime` so that Spark returns the same result in a single query if the function is called multiple times.

### Why are the changes needed?
`CURRENT_TIMESTAMP()` returns the current timestamp at the start of query evaluation.
`LOCALTIMESTAMP()` returns the current timestamp without time zone at the start of query evaluation.
The `LOCALTIMESTAMP` function is an ANSI SQL.
The `LOCALTIMESTAMP` function is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'. Support new function `LOCALTIMESTAMP()`.

### How was this patch tested?
New tests.

Closes #33258 from beliefer/SPARK-36037.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: b4f7758)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/datetime.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ComputeCurrentTimeSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/timestampNTZ/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala (diff)
Commit 4033b2a3f4faa48bfdc0802d5637b7ec726b06e7 by wenchen
[SPARK-35639][SQL] Make hasCoalescedPartition return true if something was actually coalesced

### What changes were proposed in this pull request?
Fix `CustomShuffleReaderExec.hasCoalescedPartition` so that it returns true only if some original partitions got combined

### Why are the changes needed?
W/o this change `CustomShuffleReaderExec` description can report `coalesced` even though partitions are unchanged

### Does this PR introduce _any_ user-facing change?
Yes, the `Arguments` in the node description is now accurate:
```
(16) CustomShuffleReader
Input [3]: [registration#4, sum#85, count#86L]
Arguments: coalesced
```

### How was this patch tested?
Existing tests

Closes #32872 from ekoifman/PRISM-77023-fix-hasCoalescedPartition.

Authored-by: Eugene Koifman <eugene.koifman@workday.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 4033b2a)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/CoalesceShufflePartitionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala (diff)
Commit 103d16e868e3caaa08401e0398c20b4a4574c6b7 by wenchen
[SPARK-36130][SQL] UnwrapCastInBinaryComparison should skip In expression when in.list contains an expression that is not literal

### What changes were proposed in this pull request?

Fix [comment](https://github.com/apache/spark/pull/32488#issuecomment-879315179)
This PR fix rule `UnwrapCastInBinaryComparison` bug. Rule UnwrapCastInBinaryComparison should skip In expression when in.list contains an expression that is not literal.

- In

Before this pr, the following example will throw an exception.
```scala
  withTable("tbl") {
    sql("CREATE TABLE tbl (d decimal(33, 27)) USING PARQUET")
    sql("SELECT d FROM tbl WHERE d NOT IN (d + 1)")
  }
```
- InSet

As the analyzer guarantee that all the elements in the `inSet.hset` are literal, so this is not an issue for `InSet`.

https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L264-L279

### Does this PR introduce _any_ user-facing change?

No, only bug fix.

### How was this patch tested?

New test.

Closes #33335 from cfmcgrady/SPARK-36130.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 103d16e)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala (diff)
Commit 12a576f1759aabdcf37c5c841638428e1e33a591 by kabhwan.opensource
[SPARK-34892][SS] Introduce MergingSortWithSessionWindowStateIterator sorting input rows and rows in state efficiently

Introduction: this PR is a part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

### What changes were proposed in this pull request?

This PR introduces MergingSortWithSessionWindowStateIterator, which does "merge sort" between input rows and sessions in state based on group key and session's start time.

Note that the iterator does merge sort among input rows and sessions grouped by grouping key. The iterator doesn't provide sessions in state which keys don't exist in input rows. For input rows, the iterator will provide all rows regardless of the existence of matching sessions in state.

MergingSortWithSessionWindowStateIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort.

### Why are the changes needed?

This part is a one of required on implementing SPARK-10816.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UT added.

Closes #33077 from HeartSaVioR/SPARK-34892-SPARK-10816-PR-31570-part-4.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 12a576f)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MergingSortWithSessionWindowStateIteratorSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MergingSortWithSessionWindowStateIterator.scala
Commit b86645776b96b28bf34f5c208dbea1e2019f5f74 by wenchen
[SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range

### What changes were proposed in this pull request?
DATE/TIMESTAMP literals support years 0000 to 9999. However, internally we support a range that is much larger.
We can add or subtract large intervals from a date/timestamp and the system will happily process and display large negative and positive dates.

Since we obviously cannot put this genie back into the bottle the only thing we can do is allow matching DATE/TIMESTAMP literals.

### Why are the changes needed?
make spark more usable and bug fix

### Does this PR introduce _any_ user-facing change?
Yes, after this PR, below SQL will have different results
```sql
select cast('-10000-1-2' as date) as date_col
-- before PR: NULL
-- after PR: -10000-1-2
```

```sql
select cast('2021-4294967297-11' as date) as date_col
-- before PR: 2021-01-11
-- after PR: NULL
```

### How was this patch tested?
newly added test cases

Closes #32959 from linhongliu-db/SPARK-35780-full-range-datetime.

Lead-authored-by: Linhong Liu <linhong.liu@databricks.com>
Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: b866457)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/AnsiCastSuiteBase.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/timestampNTZ/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/datetime.sql (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out (diff)
Commit e980c7a8404ab290998c2dec0e2e2437d675d67c by wenchen
[SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

### What changes were proposed in this pull request?

Fix the skipping values logic in Parquet vectorized reader when column index is effective, by considering nulls and only call `ParquetVectorUpdater.skipValues` when the values are non-null.

### Why are the changes needed?

Currently, the Parquet vectorized reader may not work correctly if column index filtering is effective, and the data page contains null values. For instance, let's say we have two columns `c1: BIGINT` and `c2: STRING`, and the following pages:
```
   * c1        500       500       500       500
   *  |---------|---------|---------|---------|
   *  |-------|-----|-----|---|---|---|---|---|
   * c2     400   300   300 200 200 200 200 200
```

and suppose we have a query like the following:
```sql
SELECT * FROM t WHERE c1 = 500
```

this will create a Parquet row range `[500, 1000)` which, when applied to `c2`, will require us to skip all the rows in `[400,500)`. However the current logic for skipping rows is via `updater.skipValues(n, valueReader)` which is incorrect since this skips the next `n` non-null values. In the case when nulls are present, this will not work correctly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a new test in `ParquetColumnIndexSuite`.

Closes #33330 from sunchao/SPARK-36123-skip-nulls.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e980c7a)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnIndexSuite.scala (diff)
Commit 3819641201fedcbf5d6dedd93a066784aca960e6 by wenchen
[SPARK-35639][SQL][FOLLOWUP] Make hasCoalescedPartition return true if something was actually coalesced

### What changes were proposed in this pull request?

Add `CoalescedPartitionSpec(0, 0, _)` check if a `CoalescedPartitionSpec` is coalesced.

### Why are the changes needed?

Fix corner case.

### Does this PR introduce _any_ user-facing change?

yes, UI may be changed

### How was this patch tested?

Add test

Closes #33342 from ulysses-you/SPARK-35639-FOLLOW.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 3819641)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala (diff)
Commit 1e86345ae37042e334484dc41e4d21722e6b22ac by max.gekk
[SPARK-36069][SQL] Add field info to `from_json`'s exception in the FAILFAST mode

### What changes were proposed in this pull request?

spark function from_json output field name, field type and field value when FAILFAST mode throw exception.

### Why are the changes needed?

This infoormation is very important for devlops to find where error input data is located.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

org/apache/spark/sql/JsonFunctionsSuite.scala:598
test("[SPARK-36069] from_json invalid json schema - check field name and field value")

Closes #33297 from geekyouth/feature/FAILFAST_output_fidelaName_fieldValue_dataType.

Lead-authored-by: Geek <forsupergeeker@gmail.com>
Co-authored-by: 极客青年 <forsupergeeker@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 1e86345)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
Commit 0cb120f390cac96c09ae99c8fbaec2ac06cd2848 by ueshin
[SPARK-36125][PYTHON] Implement non-equality comparison operators between two Categoricals

### What changes were proposed in this pull request?
Implement non-equality comparison operators between two Categoricals.
Non-goal: supporting Scalar input will be a follow-up task.

### Why are the changes needed?
pandas supports non-equality comparisons between two Categoricals. We should follow that.

### Does this PR introduce _any_ user-facing change?
Yes. No `NotImplementedError` for `<`, `<=`, `>`, `>=` operators between two Categoricals. An example is shown as below:

From:
```py
>>> import pyspark.pandas as ps
>>> from pandas.api.types import CategoricalDtype
>>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     psser <= other_psser
...
Traceback (most recent call last):
...
NotImplementedError: <= can not be applied to categoricals.
```

To:
```py
>>> import pyspark.pandas as ps
>>> from pandas.api.types import CategoricalDtype
>>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     psser <= other_psser
...
0    False
1     True
2     True
dtype: bool
```
### How was this patch tested?
Unit tests.

Closes #33331 from xinrong-databricks/categorical_compare.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 0cb120f)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
Commit e92b8ea6f881db4be6e1f2c10588cd2eb5055db8 by gurwls223
[SPARK-36106][SQL][CORE] Label error classes for subset of QueryCompilationErrors

### What changes were proposed in this pull request?

Adds error classes to some of the exceptions in QueryCompilationErrors.

### Why are the changes needed?

Improves auditing for developers and adds useful fields for users (error class and SQLSTATE).

### Does this PR introduce _any_ user-facing change?

Yes, fills in missing error class and SQLSTATE fields.

### How was this patch tested?

Existing tests and new unit tests.

Closes #33309 from karenfeng/group-compilation-errors-1.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: e92b8ea)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala (diff)
The file was modifiedcore/src/main/resources/error/error-classes.json (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala (diff)
Commit 5202944b5095b6755a98c4c1fa386fb50c71c43d by gurwls223
[SPARK-36144][INFRA][TESTS] Use Python 3.9 in `run-pip-tests` conda environment

### What changes were proposed in this pull request?

This PR aims to use Python 3.9 instead of 3.6 during `run-pip-tests` conda environment for Apache Spark 3.3.

### Why are the changes needed?

Python 3.6 is deprecated via SPARK-35938 at Apache Spark 3.2. We had better have Python 3.9 test coverage in Apache Spark 3.3.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Pass the CIs.

Closes #33351 from dongjoon-hyun/SPARK-36144.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 5202944)
The file was modifieddev/run-pip-tests (diff)
Commit 0973397721a1172710c28376a7957d30ed8cbcad by gengliang
[SPARK-36037][SQL][FOLLOWUP] Fix flaky test for datetime function localtimestamp

### What changes were proposed in this pull request?

The threshold of the test case "datetime function localtimestamp" is small, which leads to flaky test results
https://github.com/gengliangwang/spark/runs/3067396143?check_suite_focus=true

This PR is to increase the threshold for checking two the different current local datetimes from 5ms to 1 second. (The test case of current_timestamp uses 5 seconds)
### Why are the changes needed?

Fix flaky test
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #33346 from gengliangwang/fixFlaky.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 0973397)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
Commit 416a7fd49002e207dd188bd1547b8b52de5baf06 by dongjoon
[SPARK-36139][INFRA][TESTS] Remove Python 3.6 from `pyspark` GitHub Action job

### What changes were proposed in this pull request?

This PR aims to remove Python 3.6 installation from `pyspark` job in `build and test` GitHub Action Workflow for Apache Spark 3.3.

### Why are the changes needed?

Python 3.6 is deprecated via SPARK-35938. This will save the GitHub Action resource by removing python3.6 testing.

**BEFORE**
```
Will test against the following Python executables: ['python3.6', 'python3.9', 'pypy3']
```

**AFTER**
```
Will test against the following Python executables: ['python3.9', 'pypy3']
```

Note that Python 3.6 is still used in the following cases.
- In another jobs like `Linter`
- In `dev/run-pip-tests` script, pip packaing testing via `conda`.
  - This is handled via https://github.com/apache/spark/pull/33351

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #33349 from dongjoon-hyun/SPARK-36139.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 416a7fd)
The file was modifieddev/run-tests.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 5acfecbf972f506c2436dc4c5f0692a95b9bded1 by dongjoon
[SPARK-36150][INFRA][TESTS] Disable MiMa for Scala 2.13 artifacts

### What changes were proposed in this pull request?

This PR aims to disable MiMa check for Scala 2.13 artifacts.

### Why are the changes needed?

Apache Spark doesn't have Scala 2.13 Maven artifacts yet.
SPARK-36151 will enable this after Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual. The following should succeed without real testing.
```
$ dev/mima -Pscala-2.13
```

Closes #33355 from dongjoon-hyun/SPARK-36150.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 5acfecb)
The file was modifieddev/mima (diff)
Commit 564d3de7c62fa89c6db1e08e809400b339a704cd by max.gekk
[SPARK-36037][TESTS][FOLLOWUP] Avoid wrong test results on daylight saving time

### What changes were proposed in this pull request?

Only use the zone ids that has no daylight saving for testing `localtimestamp`

### Why are the changes needed?

https://github.com/apache/spark/pull/33346#discussion_r670135296 MaxGekk suggests that we should avoid wrong results if possible.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Unit test

Closes #33354 from gengliangwang/FIxDST.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 564d3de)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
Commit b5ee6d008d9cc3ad5061ad7415f7099e27769853 by max.gekk
[SPARK-36149][PYTHON] Clarify documentation for dayofweek and weekofyear

### What changes were proposed in this pull request?
Clearly state which weekday corresponds to which integer

### Why are the changes needed?
Documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
only documentation change

Closes #33345 from dominikgehl/doc/pyspark-dayofweek.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: b5ee6d0)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedR/pkg/R/functions.R (diff)
Commit 4dfd266b27fea6954593c6b9e3a2819b290f0aec by max.gekk
[SPARK-36148][SQL] Fix input data types check for regexp_replace

### What changes were proposed in this pull request?
`RegExpReplace` overrides `checkInputDataTypes` but doesn't do the basic type check.
This PR adds the type check so that the error message is more readable.

### Why are the changes needed?
bugfix

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
newly added test case

Closes #33357 from linhongliu-db/SPARK-36148-regexp-replace-check.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 4dfd266)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
Commit f95ca31c0f841975ac57e1d4ea2fc6ea4f622bab by gurwls223
[SPARK-33898][SQL][FOLLOWUP] Fix the behavior of `SHOW CREATE TABLE` to output deterministic results

### What changes were proposed in this pull request?

This PR fixes a behavior of `SHOW CREATE TABLE` added in `SPARK-33898` (#32931) to output deterministic result.
A test `SPARK-33898: SHOW CREATE TABLE` in `DataSourceV2SQLSuite` compares two `CREATE TABLE` statements. One is generated by `SHOW CREATE TABLE` against a created table and the other is expected `CREATE TABLE` statement.

The created table has options `from` and `to`, and they are declared in this order.
```
CREATE TABLE $t (
  a bigint NOT NULL,
  b bigint,
  c bigint,
  `extra col` ARRAY<INT>,
  `<another>` STRUCT<x: INT, y: ARRAY<BOOLEAN>>
)
USING foo
OPTIONS (
  from = 0,
  to = 1)
COMMENT 'This is a comment'
TBLPROPERTIES ('prop1' = '1')
PARTITIONED BY (a)
LOCATION '/tmp'
```

And the expected `CREATE TABLE` in the test code is like as follows.
```
"CREATE TABLE testcat.ns1.ns2.tbl (",
"`a` BIGINT NOT NULL,",
"`b` BIGINT,",
"`c` BIGINT,",
"`extra col` ARRAY<INT>,",
"`<another>` STRUCT<`x`: INT, `y`: ARRAY<BOOLEAN>>)",
"USING foo",
"OPTIONS(",
"'from' = '0',",
"'to' = '1')",
"PARTITIONED BY (a)",
"COMMENT 'This is a comment'",
"LOCATION '/tmp'",
"TBLPROPERTIES(",
"'prop1' = '1')"
```
As you can see, the order of `from` and `to` is expected.
But options are implemented as `Map` so the order of key cannot be kept.

In fact, this test fails with Scala 2.13.
```
[info] - SPARK-33898: SHOW CREATE TABLE *** FAILED *** (515 milliseconds)
[info]   Array("CREATE TABLE testcat.ns1.ns2.tbl (", "`a` BIGINT NOT NULL,", "`b` BIGINT,", "`c` BIGINT,", "`extra col` ARRAY<INT>,", "`<another>` STRUCT<`x`: INT, `y`: ARRAY<BOOLEAN>>)", "USING foo", "OPTIONS(", "'to' = '1',", "'from' = '0')", "PARTITIONED BY (a)", "COMMENT 'This is a comment'", "LOCATION '/tmp'", "TBLPROPERTIES(", "'prop1' = '1')") did not equal Array("CREATE TABLE testcat.ns1.ns2.tbl (", "`a` BIGINT NOT NULL,", "`b` BIGINT,", "`c` BIGINT,", "`extra col` ARRAY<INT>,", "`<another>` STRUCT<`x`: INT, `y`: ARRAY<BOOLEAN>>)", "USING foo", "OPTIONS(", "'from' = '0',", "'to' = '1')", "PARTITIONED BY (a)", "COMMENT 'This is a comment'", "LOCATION '/tmp'", "TBLPROPERTIES(", "'prop1' = '1')") (DataSourceV2SQLSuite.scala:1997)
```
In the current master, the test doesn't fail with Scala 2.12 but it's still non-deterministic.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed that the modified test passed with both Scala 2.12 and Scala 2.13 with this change.

Closes #33343 from sarutak/fix-show-create-table-test.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: f95ca31)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala (diff)
Commit e05441c223bc3a0479bdae7969b4827c85fdf0ed by wenchen
[SPARK-29519][SQL][FOLLOWUP] Keep output is deterministic for show tblproperties

### What changes were proposed in this pull request?
Keep the output order is deterministic for `SHOW TBLPROPERTIES`

### Why are the changes needed?
[#33343](https://github.com/apache/spark/pull/33343#issue-689828187).
Keep the output order deterministic meaningful.

Since the properties are sorted and then compare result in the testcase for `SHOW TBLPROPERTIES`,  it does not fail, but ideally, the output is ordered and deterministic.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed ut test

Closes #33353 from Peng-Lei/order-ouput-properties.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e05441c)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablePropertiesExec.scala (diff)
Commit 802f632a28e46538053c056d1ce43374f80454ae by max.gekk
[SPARK-36154][DOCS] Documenting week and quarter as valid formats in pyspark sql/functions trunc

### What changes were proposed in this pull request?
Added missing documentation of week and quarter as valid formats to pyspark sql/functions trunc

### Why are the changes needed?
Pyspark documentation and scala documentation didn't mentioned the same supported formats

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33359 from dominikgehl/feature/SPARK-36154.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 802f632)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
Commit 6bd385f1e34a03507dd59a6fa02aab976547614b by dongjoon
[SPARK-36159][BUILD] Replace 'python' to 'python3' in dev/test-dependencies.sh

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/26330. There is the last place to fix in `dev/test-dependencies.sh`

### Why are the changes needed?

To stick to Python 3 instead of using Python 2 mistakenly.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

Closes #33368 from HyukjinKwon/change-python-3.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 6bd385f)
The file was modifieddev/test-dependencies.sh (diff)
Commit a71dd6af2f4fb82c1b7ef978be27ecb4e86429cd by dongjoon
[SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs

### What changes were proposed in this pull request?

This PR proposes to use Python 3.9 in documentation and linter at GitHub Actions. This PR also contains the fixes for mypy check (introduced by Python 3.9 upgrade)

```
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name "np.ndarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name "np.recarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name "np.ndarray" is not defined
python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/pandas/typedef/typehints.py:163: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?
python/pyspark/pandas/typedef/typehints.py:174: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float96"?
python/pyspark/pandas/typedef/typehints.py:180: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/ml.py:81: error: Value of type variable "_DTypeScalar_co" of "dtype" cannot be "object"
python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute "tolist"
python/pyspark/pandas/series.py:1030: error: Module has no attribute "_NoValue"
python/pyspark/pandas/series.py:1031: error: Module has no attribute "_NoValue"
python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" has incompatible type "float"; expected "str"
python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment (expression has type "Type[floating[Any]]", variable has type "str")
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:100: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is not defined
python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" defined the type as "List[ndarray[Any, Any]]")
python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" incompatible with supertype "LinearClassificationModel"
Found 32 errors in 15 files (checked 315 source files)
1
```

### Why are the changes needed?

Python 3.6 is deprecated at SPARK-35938

### Does this PR introduce _any_ user-facing change?

No. Maybe static analysis, etc. by some type hints but they are really non-breaking..

### How was this patch tested?

I manually checked by GitHub Actions build in forked repository.

Closes #33356 from HyukjinKwon/SPARK-36146.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: a71dd6a)
The file was modifiedpython/pyspark/mllib/classification.pyi (diff)
The file was modifiedpython/pyspark/ml/linalg/__init__.pyi (diff)
The file was modifiedpython/pyspark/mllib/clustering.pyi (diff)
The file was modifiedpython/pyspark/pandas/typedef/typehints.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_typedef.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifiedpython/pyspark/pandas/mlflow.py (diff)
The file was modifiedpython/pyspark/mllib/_typing.pyi (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was modifiedpython/pyspark/sql/pandas/_typing/protocols/frame.pyi (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/ml.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
Commit 0062c03c1584b0ab1443fa24f288e611bfbb5115 by max.gekk
[SPARK-32792][SQL][FOLLOWUP] Fix Parquet filter pushdown NOT IN predicate

### What changes were proposed in this pull request?

This pr fix Parquet filter pushdown `NOT` `IN` predicate if its values exceeds `spark.sql.parquet.pushdown.inFilterThreshold`. For example: `Not(In(a, Array(2, 3, 7))`. We can not push down `not(and(gteq(a, 2), lteq(a, 7)))`.

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33365 from wangyum/SPARK-32792-3.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 0062c03)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala (diff)
Commit 2db7ed7964cc121fea525a14b2c146ae371237ea by max.gekk
[SPARK-36158][PYTHON][DOCS] Improving pyspark sql/functions months_between documentation

### What changes were proposed in this pull request?
Updating pyspark months_between documentation to the precision in the scala documentation

### Why are the changes needed?
Documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33366 from dominikgehl/feature/SPARK-36158.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 2db7ed7)
The file was modifiedpython/pyspark/sql/functions.py (diff)
Commit 1ceb753ef56aab1574251363e17a1bd5f1b67314 by viirya
[SPARK-36157][SQL][SS] TimeWindow expression: apply filter before project

### What changes were proposed in this pull request?

This PR proposes to change the application of the operators for TimeWindow, from project -> filter, to filter -> project.

Currently Spark applies project, and filter, while filter is not dependent on project. That said, if the input rows are going to be filtered out via filter predicate, applying projection on these input rows are simply waste of time.

### Why are the changes needed?

This is a simple improvement requiring changes from a couple of lines.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33367 from HeartSaVioR/SPARK-36157.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 1ceb753)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 96c2919988ddf78d104103876d8d8221e8145baa by gengliang
[SPARK-36135][SQL] Support TimestampNTZ type in file partitioning

### What changes were proposed in this pull request?

Support TimestampNTZ type in file partitioning
* When there is no provided schema and the default Timestamp type is TimestampNTZ , Spark should infer and parse the timestamp value partitions as TimestampNTZ.
* When the provided Partition schema is TimestampNTZ, Spark should be able to parse the TimestampNTZ type partition column.

### Why are the changes needed?

File partitioning is an important feature and Spark should support TimestampNTZ type in it.

### Does this PR introduce _any_ user-facing change?

Yes, Spark supports TimestampNTZ type in file partitioning

### How was this patch tested?

Unit tests

Closes #33344 from gengliangwang/partition.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 96c2919)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala (diff)
Commit d69f9818692c09e39246b875c9cbb2fc8954cf24 by dongjoon
[SPARK-36165][INFRA] Fix SQL doc generation in GitHub Action

### What changes were proposed in this pull request?

This PR aims to fix SQL doc generation in GitHub Action by specifying the mkdocs-installed python version explicitly.

### Why are the changes needed?

Currently, the SQL doc generation is using `spark-submit` and picked up another `Python 3` binaries.
```
Generating SQL configuration table HTML file.
Traceback (most recent call last):
  File "/__w/spark/spark/sql/gen-sql-config-docs.py", line 25, in <module>
    from mkdocs.structure.pages import markdown
ModuleNotFoundError: No module named 'mkdocs'
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action linter job.

Closes #33372 from dongjoon-hyun/fix_mkdocs.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: d69f981)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit c8a3c226283ae6e4ece0520f0a996c4a8685381b by dongjoon
[SPARK-36164][INFRA] run-test.py should not fail when APACHE_SPARK_REF is not defined

### What changes were proposed in this pull request?
This PR aims to change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

### Why are the changes needed?
Currently, the run-test.py ends with an error.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33371 from williamhyun/SPARK-36164.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: c8a3c22)
The file was modifieddev/run-tests.py (diff)
Commit b09b7f7cc024d3054debd7bdb51caec3b53764d7 by max.gekk
[SPARK-36034][SQL] Rebase datetime in pushed down filters to parquet

### What changes were proposed in this pull request?
In the PR, I propose to propagate either the SQL config `spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option `datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via `FilterApi` to the `parquet-column` lib.

Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the `CORRECTED` mode but in the `LEGACY` mode, filter's values could not match to actual values.

After the changes, timestamp/dates values of filters pushed down to parquet libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase settings. For the example, if the rebase mode is `CORRECTED`, **-719162** is pushed down as is but if the current rebase mode is `LEGACY`, the number of days is rebased to **-719164**. For more context, the PR description https://github.com/apache/spark/pull/28067 shows the diffs between two calendars.

### Why are the changes needed?
The changes fix the bug portrayed by the following example from SPARK-36034:
```scala
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
>>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show()
+----+
|date|
+----+
+----+
```
The result must have the date value `0001-01-01`.

### Does this PR introduce _any_ user-facing change?
In some sense, yes. Query results can be different in some cases. For the example above:
```scala
scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")
scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false)
+----------+
|date      |
+----------+
|0001-01-01|
+----------+
```

### How was this patch tested?
By running the modified test suite `ParquetFilterSuite`:
```
$ build/sbt "test:testOnly *ParquetV1FilterSuite"
$ build/sbt "test:testOnly *ParquetV2FilterSuite"
```

Closes #33347 from MaxGekk/fix-parquet-ts-filter-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: b09b7f7)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala (diff)
Commit 5f41a2752fba617524c072ee4f0e3a4e5368be13 by dongjoon
[SPARK-36164][INFRA][FOLLOWUP] Add empty string check back

### What changes were proposed in this pull request?

This is a follow-up of #33371.
At the branch commit GitHub run, we have an empty environment variable.
This PR adds back the empty string check logic.

### Why are the changes needed?

Currently, the failure happens when we use `--modules` in GitHub Action.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
fatal: ambiguous argument '': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 663, in main
    changed_files = identify_changed_files_from_git_commits(
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 91, in identify_changed_files_from_git_commits
    raw_output = subprocess.check_output(['git', 'diff', '--name-only', patch_sha, diff_target],
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'diff', '--name-only', 'HEAD', '']' returned non-zero exit status 128.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually. The following failure is correct in local environment because it passed `identify_changed_files_from_git_commits` already.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 668, in main
    os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'GITHUB_SHA'
```

Closes #33374 from dongjoon-hyun/SPARK-36164.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 5f41a27)
The file was modifieddev/run-tests.py (diff)
Commit c22f7a4834e6fb7b69c4cc4af87c61c2fbbe0786 by ueshin
[SPARK-36167][PYTHON] Revisit more InternalField managements

### What changes were proposed in this pull request?

Revisit and manage `InternalField` in more places.

### Why are the changes needed?

There are other places we can manage `InternalField`, and we can keep extension dtypes or `CategoricalDtype`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some tests.

Closes #33377 from ueshin/issues/SPARK-36167/internal_field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: c22f7a4)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
Commit f66153de787cb3c51c40032c3a5aba3a2eb84680 by dongjoon
[SPARK-36166][TESTS] Support Scala 2.13 test in `dev/run-tests.py`

### What changes were proposed in this pull request?

For Apache Spark 3.2, this PR aims to support Scala 2.13 test in `dev/run-tests.py` by adding `SCALA_PROFILE` and in `dev/run-tests-jenkins.py` by adding `AMPLAB_JENKINS_BUILD_SCALA_PROFILE`.

In addition, `test-dependencies.sh` is skipped for Scala 2.13 because we currently don't maintain the dependency manifests yet. This will be handled after Apache Spark 3.2.0 release.

### Why are the changes needed?

To test Scala 2.13 with `dev/run-tests.py`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual. The following is the result. Note that this PR aims to **run** Scala 2.13 tests instead of **passing** them. We will have daily GitHub Action job via #33358 and will fix UT failures if exists.
```
$ dev/change-scala-version.sh 2.13

$ SCALA_PROFILE=scala2.13 dev/run-tests.py
...
========================================================================
Running Scala style checks
========================================================================
[info] Checking Scala style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pkubernetes -Phadoop-cloud -Phive -Phive-thriftserver -Pyarn -Pmesos -Pdocker-integration-tests -Pkinesis-asl -Pspark-ganglia-lgpl
...
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test:package streaming-kinesis-asl-assembly/assembly
...

[info] Building Spark assembly using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud assembly/package
...

========================================================================
Running Java style checks
========================================================================
[info] Checking Java style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud
...

========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud unidoc
...

========================================================================
Running Spark unit tests
========================================================================
[info] Running Spark tests using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test
...
```

Closes #33376 from dongjoon-hyun/SPARK-36166.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: f66153d)
The file was modifieddev/run-tests-jenkins.py (diff)
The file was modifieddev/run-tests.py (diff)
The file was modifieddev/test-dependencies.sh (diff)
Commit fba61ad68bb12b22055d5d475e95e2681685eed7 by gurwls223
[SPARK-36169][SQL] Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

### What changes were proposed in this pull request?

This PR proposes to move `spark.sql.sources.disabledJdbcConnProviderList` from SQLConf to StaticSQLConf which disallows to set in runtime.

### Why are the changes needed?

It's documented as a static configuration. we should make it as a static configuration properly.

### Does this PR introduce _any_ user-facing change?

Previously, the configuration can be set to different value but not effective.
Now it throws an exception if users try to set in runtime.

### How was this patch tested?

Existing unittest was fixed. That should verify the change.

Closes #33381 from HyukjinKwon/SPARK-36169.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: fba61ad)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProviderSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit ad744fb4bf4790db2c500bf905f431ccf274055a by dongjoon
[SPARK-36171][BUILD] Upgrade GenJavadoc to 0.18

### What changes were proposed in this pull request?

This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`.

### Why are the changes needed?

`0.18` includes a bug fix for `Scala 2.13`.
```
This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases.
```
https://github.com/lightbend/genjavadoc/releases/tag/v0.18

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Built the doc for Scala 2.13.
```
build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc
```

Closes #33383 from sarutak/upgrade-genjavadoc-0.18.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: ad744fb)
The file was modifiedproject/SparkBuild.scala (diff)
Commit f06aa4a3f34d89e7821ffee19a43a02773dc52e1 by wenchen
[SPARK-35985][SQL] push partitionFilters for empty readDataSchema

this commit makes sure that for File Source V2 partition filters are
also taken into account when the readDataSchema is empty.
This is the case for queries like:

    SELECT count(*) FROM tbl WHERE partition=foo
    SELECT input_file_name() FROM tbl WHERE partition=foo

### What changes were proposed in this pull request?

As described in SPARK-35985 there is bug in the File Datasource V2 which prevents it to push down to the FileScanner for queries like the ones listed above.

### Why are the changes needed?

If partitions filters are not pushed down, the whole dataset will be scanned while only one partition is interesting.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

An extra test was added which relies on the output of explain, as is done in other places.

Closes #33191 from steven-aerts/SPARK-35985.

Authored-by: Steven Aerts <steven.aerts@airties.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f06aa4a)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PrunePartitionSuiteBase.scala (diff)
Commit c1b3f86c58ff9a54d2d422a8332ed3e1080cc86c by wenchen
[SPARK-35710][SQL] Support DPP + AQE when there is no reused broadcast exchange

### What changes were proposed in this pull request?
This PR add the DPP + AQE support when spark can't reuse the broadcast but executing the DPP subquery is cheaper.

### Why are the changes needed?
Improve AQE + DPP

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Adding new ut

Closes #32861 from JkSelf/supportDPP3.

Lead-authored-by: Ke Jia <ke.a.jia@intel.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c1b3f86)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryAdaptiveBroadcastExec.scala (diff)
Commit f2bf8b051beb6a3f4b714c4fd6d1a5c5ac942d8e by kabhwan.opensource
[SPARK-34893][SS] Support session window natively

Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

### What changes were proposed in this pull request?

This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR).

The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped).

This PR leverages two different approaches on merging session windows:

1. merging session windows with Spark's aggregation logic (a variant of sort aggregation)
2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards

First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases.

This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge.

The state format is versioned, so that we can bring a new state format if we find a better one.

### Why are the changes needed?

For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks:

1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows
2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves
3. (flat)MapGroupsWithState is only available in Scala/Java.

With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations.

Quoting the query example from test suite:

```
    val inputData = MemoryStream[(String, Long)]

    // Split the lines into words, treat words as sessionId of events
    val events = inputData.toDF()
      .select($"_1".as("value"), $"_2".as("timestamp"))
      .withColumn("eventTime", $"timestamp".cast("timestamp"))
      .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime")
      .withWatermark("eventTime", "30 seconds")

    val sessionUpdates = events
      .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId)
      .agg(count("*").as("numEvents"))
      .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)",
        "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs",
        "numEvents")
```

which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes).

https://github.com/apache/spark/blob/39542bb81f8570219770bb6533c077f44f6cbd2a/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala#L66-L105

(Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.)

### Does this PR introduce _any_ user-facing change?

Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window".

### How was this patch tested?

New test suites. Also tested with benchmark code.

Closes #33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5.

Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: f2bf8b0)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSessionWindowingSuite.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSessionWindowSuite.scala
The file was modifiedpython/pyspark/sql/functions.pyi (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StreamingSessionWindowStateManager.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameTimeWindowingSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/UpdatingSessionsIterator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/UpdatingSessionsIteratorSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SessionWindow.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
Commit 2d8d7b4aae224ee8f8cd3cea10ab5ee509b5ed2b by gurwls223
[SPARK-36160][PYTHON][DOCS] Clarifying documentation for pyspark sql/column

### What changes were proposed in this pull request?
Adapting documentation of `between`, `getField`, `dropFields` and `cast` to the corresponding scala doc

### Why are the changes needed?
Documentation clarity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only documentation change

Closes #33369 from dominikgehl/feature/SPARK-36160.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 2d8d7b4)
The file was modifiedpython/pyspark/sql/column.py (diff)
Commit 3218e4e14b6822e53073dbeff85c02df957957ed by dongjoon
[SPARK-36152][INFRA][TESTS] Add Scala 2.13 daily build and test GitHub Action job

### What changes were proposed in this pull request?

This PR aims to add a new GitHub Action daily workflow for Scala 2.13 build and test.

### Why are the changes needed?

Apache Spark 3.2.0 aims to support Scala 2.13 officially. We need a test coverage for master/3.2.

The following is the test result on my repository. The daily schedule triggered correctly for both master/3.2 branches.

- https://github.com/dongjoon-hyun/spark/actions/runs/1036083268
![Screen Shot 2021-07-15 at 10 09 22 PM](https://user-images.githubusercontent.com/9700541/125894950-a5a2ff9c-48f9-4184-913c-5422da5305a3.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is a daily job. Since there is no way to see this, I tested this in my repository first as described in the above.

Closes #33358 from dongjoon-hyun/SPARK-36152.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 3218e4e)
The file was added.github/workflows/build_and_test_scala213_daily.yml
Commit 37dc3f9ea724379d7d183caddc19b47bf9c7f4c2 by viirya
[SPARK-36128][SQL] Apply spark.sql.hive.metastorePartitionPruning for non-Hive tables that uses Hive metastore for partition management

### What changes were proposed in this pull request?

In `CatalogFileIndex.filterPartitions`, check the config `spark.sql.hive.metastorePartitionPruning` and don't pushdown predicates to remote HMS if it is false. Instead, fallback to the `listPartitions` API and do the filtering on the client side.

### Why are the changes needed?

Currently the config `spark.sql.hive.metastorePartitionPruning` is only effective for Hive tables, and for non-Hive tables we'd always use the `listPartitionsByFilter` API from HMS client. On the other hand, by default all data source tables also manage their partitions through HMS, when the config `spark.sql.hive.manageFilesourcePartitions` is turned on. Therefore, it seems reasonable to extend the above config for non-Hive tables as well.

In certain cases the remote HMS service could throw exceptions when using the `listPartitionsByFilter` API, which, on the Spark side, is unrecoverable at the current state. Therefore it would be better to allow users to disable the API by using the above config.

For instance, HMS only allow pushdown date column when direct SQL is used instead of JDO for interacting with the underlying RDBMS, and will throw exception otherwise. Even though the Spark Hive client will attempt to recover itself when the exception happens, it only does so when the config `hive.metastore.try.direct.sql` from remote HMS is `false`. There could be cases where the value of `hive.metastore.try.direct.sql` is true but remote HMS still throws exception.

### Does this PR introduce _any_ user-facing change?

Yes now the config `spark.sql.hive.metastorePartitionPruning` is extended for non-Hive tables which use HMS to manage their partition metadata.

### How was this patch tested?

Added a new unit test:
```
build/sbt "hive/testOnly *PruneFileSourcePartitionsSuite -- -z SPARK-36128"
```

Closes #33348 from sunchao/SPARK-36128-by-filter.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 37dc3f9)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala (diff)
Commit 4783fb72aff4a550ca0cc680dcc0d730e3e36dac by mridulatgmail.com
[SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file

### What changes were proposed in this pull request?

This is the initial work of add checksum support of shuffle. This is a piece of https://github.com/apache/spark/pull/32385. And this PR only adds checksum functionality at the shuffle writer side.

Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation:

* `BypassMergeSortShuffleWriter` -  wrap on each partition file
* `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting
* `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting

\* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime.

And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes, added a new conf: `spark.shuffle.checksum`.

### How was this patch tested?

Added unit tests.

Closes #32401 from Ngone51/add-checksum-files.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 4783fb7)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/api/ShuffleMapOutputWriter.java (diff)
The file was modifiedcore/src/test/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriterSuite.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleWriterSuite.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/io/MutableCheckedOutputStream.scala
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/api/ShufflePartitionWriter.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockId.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/ShufflePartitionPairsWriter.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was addedcore/src/main/java/org/apache/spark/shuffle/checksum/ShuffleChecksumHelper.java
The file was addedcore/src/test/scala/org/apache/spark/shuffle/ShuffleChecksumTestHelper.scala
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/io/LocalDiskSingleSpillMapOutputWriter.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/io/LocalDiskShuffleMapOutputWriterSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriterSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/api/SingleSpillShuffleMapOutputWriter.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/io/LocalDiskShuffleMapOutputWriter.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala (diff)
The file was modifiedproject/MimaExcludes.scala (diff)
Commit 6d2cbadcfe3b4badad9400c3bedf697ea6196ffa by mridulatgmail.com
[SPARK-32922][SHUFFLE][CORE][FOLLOWUP] Fixes few issues when the executor tries to fetch push-merged blocks

### What changes were proposed in this pull request?
Below 2 bugs were introduced with https://github.com/apache/spark/pull/32140
1. Instead of requesting the local-dirs for push-merged-local blocks from the ESS, `PushBasedFetchHelper` requests it from other executors. Push-based shuffle is only enabled when the ESS is enabled so it should always fetch the dirs from the ESS and not from other executors which is not yet supported.
2. The size of the push-merged blocks is logged incorrectly.

### Why are the changes needed?
This fixes the above mentioned bugs and is needed for push-based shuffle to work properly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested this by running an application on the cluster. The UTs mock the call `hostLocalDirManager.getHostLocalDirs` which is why didn't catch (1) with the UT. However, the fix is trivial and checking this in the UT will require a lot more effort so I haven't modified it in the UT.
Logs of the executor with the bug
```
21/07/15 15:42:46 WARN ExternalBlockStoreClient: Error while trying to get the host local dirs for [shuffle-push-merger]
21/07/15 15:42:46 WARN PushBasedFetchHelper: Error while fetching the merged dirs for push-merged-local blocks: shuffle_0_-1_13. Fetch the original blocks instead
java.lang.RuntimeException: java.lang.IllegalStateException: Invalid executor id: shuffle-push-merger, expected 92.
at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:130)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
```
After the fix, the executors were able to fetch the local push-merged blocks.

Closes #33378 from otterc/SPARK-32922-followup.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 6d2cbad)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/PushBasedFetchHelper.scala (diff)
Commit 8009f0dd92adc62916c1b89d995a7d6b0c65e9e7 by dongjoon
[SPARK-35785][SS][FOLLOWUP] Remove ignored test from RocksDBSuite

### What changes were proposed in this pull request?

This patch removes an ignored test from `RocksDBSuite`.

### Why are the changes needed?

The removed test is now ignored. The test itself doesn't look making sense. For example, the condition for capturing exception is never matched. The test runs updates to RocksDB instances at same remote dir with same versions. This doesn't look like a case it will run through in practice.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33401 from viirya/remove-ignore-test.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 8009f0d)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala (diff)
Commit 71ea25d4f567c7ef598aa87b6e4a091a3119fead by max.gekk
[SPARK-36170][SQL] Change quoted interval literal (interval constructor) to be converted to ANSI interval types

### What changes were proposed in this pull request?

This PR changes the behavior of the quoted interval literals like `SELECT INTERVAL '1 year 2 month'` to be converted to ANSI interval types.

### Why are the changes needed?

The tnit-to-unit interval literals and the unit list interval literals are converted to ANSI interval types but quoted interval literals are still converted to CalendarIntervalType.

```
-- Unit list interval literals
spark-sql> select interval 1 year 2 month;
1-2
-- Quoted interval literals
spark-sql> select interval '1 year 2 month';
1 years 2 months
```

### Does this PR introduce _any_ user-facing change?

Yes but the following sentence in `sql-migration-guide.md` seems to cover this change.
```
  - In Spark 3.2, the unit list interval literals can not mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND).
For example, `INTERVAL 1 day 1 hour` is invalid in Spark 3.2. In Spark 3.1 and earlier,
there is no such limitation and the literal returns value of `CalendarIntervalType`.
To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`.
```

### How was this patch tested?

Modified existing tests and add new tests.

Closes #33380 from sarutak/fix-interval-constructor.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 71ea25d)
The file was modifiedsql/core/src/test/resources/sql-tests/results/literals.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/interval.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/interval.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/misc-functions.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala (diff)