SuccessChanges

Summary

  1. [SPARK-23489][SQL][TEST] HiveExternalCatalogVersionsSuite should verify (commit: 0fe53b64e9ee68956e5f5cd454942af432f58fc1) (details)
  2. [SPARK-24166][SQL] InMemoryTableScanExec should not access SQLConf at (commit: 10e2f1fc02e1b5f794f7721466a1f755c8979e53) (details)
  3. [SPARK-24133][SQL] Backport [] Check for integer overflows when resizing (commit: bfe50b6843938100e7bad59071b027689a22ab83) (details)
  4. [SPARK-24169][SQL] JsonToStructs should not access SQLConf at executor (commit: 61e7bc0c145b0da3129e1dac46d72cf0db5e1d94) (details)
  5. [SPARK-23433][CORE] Late zombie task completions update all tasksets (commit: 8509284e1ec048d5afa87d41071c0429924e45c9) (details)
Commit 0fe53b64e9ee68956e5f5cd454942af432f58fc1 by wenchen
[SPARK-23489][SQL][TEST] HiveExternalCatalogVersionsSuite should verify
the downloaded file
## What changes were proposed in this pull request?
Although
[SPARK-22654](https://issues.apache.org/jira/browse/SPARK-22654) made
`HiveExternalCatalogVersionsSuite` download from Apache mirrors three
times, it has been flaky because it didn't verify the downloaded file.
Some Apache mirrors terminate the downloading abnormally, the
*corrupted* file shows the following errors.
``` gzip: stdin: not in gzip format tar: Child returned status 1 tar:
Error is not recoverable: exiting now 22:46:32.700 WARN
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite:
===== POSSIBLE THREAD LEAK IN SUITE
o.a.s.sql.hive.HiveExternalCatalogVersionsSuite, thread names:
Keep-Alive-Timer =====
*** RUN ABORTED ***
java.io.IOException: Cannot run program "./bin/spark-submit" (in
directory "/tmp/test-spark/spark-2.2.0"): error=2, No such file or
directory
```
This has been reported weirdly in two ways. For example, the above case
is reported as Case 2 `no failures`.
- Case 1. [Test Result (1 failure /
+1)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389/)
- Case 2. [Test Result (no
failures)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/)
This PR aims to make `HiveExternalCatalogVersionsSuite` more robust by
verifying the downloaded `tgz` file by extracting and checking the
existence of `bin/spark-submit`. If it turns out that the file is empty
or corrupted, `HiveExternalCatalogVersionsSuite` will do retry logic
like the download failure.
## How was this patch tested?
Pass the Jenkins.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #21210 from dongjoon-hyun/SPARK-23489.
(cherry picked from commit c9bfd1c6f8d16890ea1e5bc2bcb654a3afb32591)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0fe53b64e9ee68956e5f5cd454942af432f58fc1)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit 10e2f1fc02e1b5f794f7721466a1f755c8979e53 by wenchen
[SPARK-24166][SQL] InMemoryTableScanExec should not access SQLConf at
executor side
## What changes were proposed in this pull request?
This PR is extracted from https://github.com/apache/spark/pull/21190 ,
to make it easier to backport.
`InMemoryTableScanExec#createAndDecompressColumn` is executed inside
`rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there.
## How was this patch tested?
it's tested in #21190
Author: Wenchen Fan <wenchen@databricks.com>
Closes #21223 from cloud-fan/minor1.
(cherry picked from commit 991b526992bcf1dc1268578b650916569b12f583)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 10e2f1fc02e1b5f794f7721466a1f755c8979e53)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala (diff)
Commit bfe50b6843938100e7bad59071b027689a22ab83 by hvanhovell
[SPARK-24133][SQL] Backport [] Check for integer overflows when resizing
WritableColumnVectors
`ColumnVector`s store string data in one big byte array. Since the array
size is capped at just under Integer.MAX_VALUE, a single `ColumnVector`
cannot store more than 2GB of string data. But since the Parquet files
commonly contain large blobs stored as strings, and `ColumnVector`s by
default carry 4096 values, it's entirely possible to go past that limit.
In such cases a negative capacity is requested from
`WritableColumnVector.reserve()`. The call succeeds (requested capacity
is smaller than already allocated capacity), and consequently
`java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader
actually attempts to put the data into the array.
This change introduces a simple check for integer overflow to
`WritableColumnVector.reserve()` which should help catch the error
earlier and provide more informative exception. Additionally, the error
message in `WritableColumnVector.throwUnsupportedException()` was
corrected.
New units tests were added.
Author: Ala Luszczak <aladatabricks.com>
Closes #21206 from ala/overflow-reserve.
(cherry picked from commit 8bd27025b7cf0b44726b6f4020d294ef14dbbb7e)
Signed-off-by: Ala Luszczak <aladatabricks.com>
Author: Ala Luszczak <ala@databricks.com>
Closes #21227 from ala/cherry-pick-overflow-reserve.
(commit: bfe50b6843938100e7bad59071b027689a22ab83)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala (diff)
Commit 61e7bc0c145b0da3129e1dac46d72cf0db5e1d94 by wenchen
[SPARK-24169][SQL] JsonToStructs should not access SQLConf at executor
side
## What changes were proposed in this pull request?
This PR is extracted from #21190 , to make it easier to backport.
`JsonToStructs` can be serialized to executors and evaluate, we should
not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)`
in the body.
## How was this patch tested?
tested in #21190
Author: Wenchen Fan <wenchen@databricks.com>
Closes #21226 from cloud-fan/minor4.
(cherry picked from commit 96a50016bb0fb1cc57823a6706bff2467d671efd)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 61e7bc0c145b0da3129e1dac46d72cf0db5e1d94)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/json-functions.sql.out (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala (diff)
Commit 8509284e1ec048d5afa87d41071c0429924e45c9 by irashid
[SPARK-23433][CORE] Late zombie task completions update all tasksets
Fetch failure lead to multiple tasksets which are active for a given
stage.  While there is only one "active" version of the taskset, the
earlier attempts can still have running tasks, which can complete
successfully.  So a task completion needs to update every taskset so
that it knows the partition is completed.  That way the final active
taskset does not try to submit another task for the same partition, and
so that it knows when it is completed and when it should be marked as a
"zombie".
Added a regression test.
Author: Imran Rashid <irashid@cloudera.com>
Closes #21131 from squito/SPARK-23433.
(cherry picked from commit 94641fe6cc68e5977dd8663b8f232a287a783acb)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
(commit: 8509284e1ec048d5afa87d41071c0429924e45c9)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (diff)