SuccessChanges

Summary

  1. [SPARK-34737][SQL] Cast input float to double in `TIMESTAMP_SECONDS` (commit: 7aaed76) (details)
  2. [SPARK-34722][CORE][SQL][TEST] Clean up deprecated API usage related to (commit: e757091) (details)
  3. [SPARK-34729][SQL] Faster execution for broadcast nested loop join (left (commit: a0f3b72) (details)
  4. [SPARK-34639][SQL] Always remove unnecessary Alias in (commit: be888b2) (details)
  5. [SPARK-34743][SQL][TESTS] ExpressionEncoderSuite should use deepEquals (commit: 363a7f0) (details)
  6. [SPARK-34739][SQL] Support add/subtract of a year-month interval to/from (commit: 9809a2f) (details)
  7. [MINOR][SQL] Remove unused variable in NewInstance.constructor (commit: 0a70dff) (details)
  8. [SPARK-21449][SPARK-23745][SQL] add ShutdownHook to cloes HiveClient's (commit: 202529e) (details)
  9. [SPARK-34729][SQL][FOLLOWUP] Broadcast nested loop join to use (commit: bb05dc9) (details)
  10. [SPARK-34752][BUILD] Bump Jetty to 9.4.37 to address CVE-2020-27223 (commit: 4a6f534) (details)
  11. Revert "[SPARK-33428][SQL] Conv UDF use BigInt to avoid  Long value (commit: cef6650) (details)
  12. [SPARK-34504][SQL] Avoid unnecessary resolving of SQL temp views for DDL (commit: af55373) (details)
Commit 7aaed76125c82aff8683fe319f8047c2cb87afdd by gurwls223
[SPARK-34737][SQL] Cast input float to double in `TIMESTAMP_SECONDS`

### What changes were proposed in this pull request?
In the PR, I propose to cast the input float to double in the `SecondsToTimestamp` expression in the same way as in the `Cast` expression.

### Why are the changes needed?
To have the same results from `CAST(<float> AS TIMESTAMP)` and from `TIMESTAMP_SECONDS`:
```sql
spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP);
1970-07-14 07:20:15
spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
1970-07-14 07:20:14.951424
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```sql
spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
1970-07-14 07:20:15
```

### How was this patch tested?
By running new test:
```
$ build/sbt "test:testOnly *DateExpressionsSuite"
```

Closes #31831 from MaxGekk/adjust-SecondsToTimestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 7aaed76)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
Commit e7570918207ffad0964a15133fe625ae1e2247b1 by dhyun
[SPARK-34722][CORE][SQL][TEST] Clean up deprecated API usage related to JUnit4

### What changes were proposed in this pull request?
The main change of this pr as follows:

- Use `org.junit.Assert.assertThrows(String, Class, ThrowingRunnable)` method instead of  `ExpectedException.none()`
- Use `org.hamcrest.MatcherAssert.assertThat()` method instead of   `org.junit.Assert.assertThat(T, org.hamcrest.Matcher<? super T>)`

### Why are the changes needed?
Clean up deprecated API usage

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31815 from LuciferYang/SPARK-34722.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: e757091)
The file was modifiedcore/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java (diff)
The file was modifiedcommon/network-common/src/test/java/org/apache/spark/network/client/TransportClientFactorySuite.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java (diff)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java (diff)
The file was modifiedcommon/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherSuite.java (diff)
The file was modifiedlauncher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java (diff)
Commit a0f3b72e1c58fe66c122d7108fb8a076f068e171 by dhyun
[SPARK-34729][SQL] Faster execution for broadcast nested loop join (left semi/anti with no condition)

### What changes were proposed in this pull request?

For `BroadcastNestedLoopJoinExec` left semi and left anti join without condition. If we broadcast left side. Currently we check whether every row from broadcast side has a match or not by [iterating broadcast side a lot of time](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L256-L275). This is unnecessary and very inefficient when there's no condition, as we only need to check whether stream side is empty or not. Create this PR to add the optimization. This can boost the affected query execution performance a lot.

In addition, create a common method `getMatchedBroadcastRowsBitSet()` shared by several methods.
Refactor `defaultJoin()` to move
* left semi and left anti join related logic to `leftExistenceJoin`
* existence join related logic to `existenceJoin`.

After this, `defaultJoin()` holds logic only for outer join (left outer, right outer and full outer), which is much easier to read from my own opinion.

### Why are the changes needed?

Improve the affected query performance a lot.
Test with a simple query by modifying `JoinBenchmark.scala` locally:

```
val N = 20 << 20
val M = 1 << 4
val dim = broadcast(spark.range(M).selectExpr("id as k"))
val df = dim.join(spark.range(N), Seq.empty, "left_semi")
df.noop()
```

See >30x run time improvement. Note the stream side is only `spark.range(N)`. For complicated query with non-trivial stream side, the saving would be much more.

```
Running benchmark: broadcast nested loop left semi join
  Running case: broadcast nested loop left semi join optimization off
  Stopped after 2 iterations, 3163 ms
  Running case: broadcast nested loop left semi join optimization on
  Stopped after 5 iterations, 366 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
broadcast nested loop left semi join:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------------
broadcast nested loop left semi join optimization off           1568           1582          19         13.4          74.8       1.0X
broadcast nested loop left semi join optimization on              46             73          18        456.0           2.2      34.1X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `ExistenceJoinSuite.scala`.

Closes #31821 from c21/nested-join.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: a0f3b72)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala (diff)
Commit be888b27edfbb0d7ebb2265de1bf74acb8d3d09a by wenchen
[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression

### What changes were proposed in this pull request?

In `Analyzer.resolveExpression`, we have a parameter to decide if we should remove unnecessary `Alias` or not. This is over complicated and we can always remove unnecessary `Alias`.

This PR simplifies this part and removes the parameter.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #31758 from cloud-fan/resolve.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: be888b2)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/struct.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (diff)
Commit 363a7f0722522898337b926224bf46bb9c32176c by dhyun
[SPARK-34743][SQL][TESTS] ExpressionEncoderSuite should use deepEquals when we expect `array of array`

### What changes were proposed in this pull request?

This PR aims to make `ExpressionEncoderSuite` to use `deepEquals` instead of `equals` when `input` is `array of array`.

This comparison code itself was added by SPARK-11727 at Apache Spark 1.6.0.

### Why are the changes needed?

Currently, the interpreted mode fails for `array of array` because the following line is used.
```
Arrays.equals(b1.asInstanceOf[Array[AnyRef]], b2.asInstanceOf[Array[AnyRef]])
```

### Does this PR introduce _any_ user-facing change?

No. This is a test-only PR.

### How was this patch tested?

Pass the existing CIs.

Closes #31837 from dongjoon-hyun/SPARK-34743.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 363a7f0)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala (diff)
Commit 9809a2f1c5187205c81542dbdc84b71db535f6e1 by max.gekk
[SPARK-34739][SQL] Support add/subtract of a year-month interval to/from a timestamp

### What changes were proposed in this pull request?
Support `timestamp +/- year-month interval`. In the PR, I propose to introduce new binary expression `TimestampAddYMInterval` similarly to `DateAddYMInterval`. It invokes new method `timestampAddMonths` from `DateTimeUtils` by passing a timestamp as an offset in microseconds since the epoch, amount of months from the giveb year-month interval, and the time zone ID in which the operation is performed. The `timestampAddMonths()` method converts the input microseconds to a local timestamp, adds months to it, and converts the results back to an instant in microseconds at the given time zone.

### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such operation over timestamps and intervals:
<img width="811" alt="Screenshot 2021-03-12 at 11 36 14" src="https://user-images.githubusercontent.com/1580697/111081674-865d4900-8515-11eb-86c8-3538ecaf4804.png">

### Does this PR introduce _any_ user-facing change?
Should not since new intervals have not been released yet.

### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
$ build/sbt "test:testOnly *DateExpressionsSuite"
$ build/sbt "test:testOnly *ColumnExpressionSuite"
```

Closes #31832 from MaxGekk/timestamp-add-year-month-interval.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 9809a2f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala (diff)
Commit 0a70dff0663320cabdbf4e4ecb771071582f9417 by dhyun
[MINOR][SQL] Remove unused variable in NewInstance.constructor

### What changes were proposed in this pull request?

This PR removes one unused variable in `NewInstance.constructor`.

### Why are the changes needed?

This looks like a variable for debugging at the initial commit of SPARK-23584 .
- https://github.com/apache/spark/commit/1b08c4393cf48e21fea9914d130d8d3bf544061d#diff-2a36e31684505fd22e2d12a864ce89fd350656d716a3f2d7789d2cdbe38e15fbR461

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31838 from dongjoon-hyun/minor-object.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 0a70dff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (diff)
Commit 202529ef23eb39d09f624f6ca9177535e296c58d by yumwang
[SPARK-21449][SPARK-23745][SQL] add ShutdownHook to cloes HiveClient's SessionState to delete residual dirs

### What changes were proposed in this pull request?

We initialized a Hive `SessionState` to interact with the external hive metastore server but left it behind after we finished.

We should close the metastore client explicitly in case of connection leaks with HMS
and we should trigger the `SessionState` to close itself to clean the residual dirs to fix issues reported by SPARK-21449 and SPARK-23745.

`hive.downloaded.resources.dir` contains transient files, such as UDF jars, it will not be used anymore after spark applications exit.

### Why are the changes needed?

1. prevent potential metastore client leak

2. clean `hive.downloaded.resources.dir`

```
    DOWNLOADED_RESOURCES_DIR("hive.downloaded.resources.dir", "${system:java.io.tmpdir}" + File.separator + "${hive.session.id}_resources", "Temporary local directory for added resources in the remote file system."),

```
### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing jenkins and verify locally

Closes #31833 from yaooqinn/SPARK-21449-2.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(commit: 202529e)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
Commit bb05dc91f0b161140d0a4b17935eaad30382a7b3 by dhyun
[SPARK-34729][SQL][FOLLOWUP] Broadcast nested loop join to use executeTake instead of execute

### What changes were proposed in this pull request?

This is a followup minor change from https://github.com/apache/spark/pull/31821#discussion_r594110622 , where we change from using `execute()` to `executeTake()`. Performance-wise there's no difference. We are just using a different API to be aligned with code path of `Dataset`.

### Why are the changes needed?

To align with other code paths in SQL/Dataset.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests same as https://github.com/apache/spark/pull/31821 .

Closes #31845 from c21/join-followup.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: bb05dc9)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala (diff)
Commit 4a6f5340ae4ff680da57b8c3410cd0d9b6c7f933 by gurwls223
[SPARK-34752][BUILD] Bump Jetty to 9.4.37 to address CVE-2020-27223

### What changes were proposed in this pull request?
Upgrade Jetty version from `9.4.36.v20210114` to `9.4.37.v20210219`.

### Why are the changes needed?
Current Jetty version is vulnerable to [CVE-2020-27223](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27223), see [Veracode](https://www.sourceclear.com/vulnerability-database/security/denial-of-servicedos/java/sid-29523) for more details.

### Does this PR introduce _any_ user-facing change?
No, minor Jetty version change. Release notes can be found [here](https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.37.v20210219).

### How was this patch tested?
Will let GitHub run the unit tests.

Closes #31846 from xkrogen/xkrogen-SPARK-34752-jetty-upgrade-cve.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 4a6f534)
The file was modifiedpom.xml (diff)
Commit cef665004847c4cc2c5b0be9ef29ea5510c0922e by wenchen
Revert "[SPARK-33428][SQL] Conv UDF use BigInt to avoid  Long value overflow"

This reverts commit 5f9a7fea06cbbb6bf2b40cc9b3aa4d539c996301.
(commit: cef6650)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/NumberConverterSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala (diff)
The file was modifiedsql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConverter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala (diff)
Commit af553735b10812d303de411034540e3d90199a26 by wenchen
[SPARK-34504][SQL] Avoid unnecessary resolving of SQL temp views for DDL commands

### What changes were proposed in this pull request?

For DDL commands like DROP VIEW, they don't really need to resolve the view (parse and analyze the view SQL text), they just need to get the view metadata.

This PR fixes the rule `ResolveTempViews` to only resolve the temp view for `UnresolvedRelation`. This also fixes a bug for DROP VIEW, as previously it tried to resolve the view and failed to drop invalid views.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes #31853 from cloud-fan/view-resolve.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: af55373)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewTestSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)