SuccessChanges

Summary

  1. [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to (commit: eab10f9945c1d01daa45c233a39dedfd184f543c) (details)
  2. [PYSPARK] Update py4j to version 0.10.7. (commit: 323dc3ad02e63a7c99b5bd6da618d6020657ecba) (details)
  3. [SPARKR] Match pyspark features in SparkR communication protocol. (commit: 16cd9ac5264831e061c033b26fe1173ebc88e5d1) (details)
  4. [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics" (commit: 4c49b12da512ae29e2e4b773a334abbf6a4f08f1) (details)
  5. [SPARK-10878][CORE] Fix race condition when multiple clients resolves (commit: 414e4e3d70caa950a63fab1c8cac3314fb961b0c) (details)
Commit eab10f9945c1d01daa45c233a39dedfd184f543c by hyukjinkwon
[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to
Text datasource on schema inferring
## What changes were proposed in this pull request?
While reading CSV or JSON files, DataFrameReader's options are converted
to Hadoop's parameters, for example there:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302
but the options are not propagated to Text datasource on schema
inferring, for instance:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188
The PR proposes propagation of user's options to Text datasource on
scheme inferring in similar way as user's options are converted to
Hadoop parameters if schema is specified.
## How was this patch tested? The changes were tested manually by using
https://github.com/twitter/hadoop-lzo:
``` hadoop-lzo> mvn clean package hadoop-lzo> ln -s
./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar
``` Create 2 test files in JSON and CSV format and compress them:
```shell
$ cat test.csv col1|col2 a|1
$ lzop test.csv
$ cat test.json
{"col1":"a","col2":1}
$ lzop test.json
``` Run `spark-shell` with hadoop-lzo:
``` bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar
``` reading compressed CSV and JSON without schema:
```scala spark.read.option("io.compression.codecs",
"com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show()
+----+----+
|col1|col2|
+----+----+
|   a|   1|
+----+----+
```
```scala spark.read.option("io.compression.codecs",
"com.hadoop.compression.lzo.LzopCodec").option("multiLine",
true).json("test.json.lzo").printSchema root
|-- col1: string (nullable = true)
|-- col2: long (nullable = true)
```
Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk
<max.gekk@gmail.com>
Closes #21292 from MaxGekk/text-options-backport-v2.3.
(commit: eab10f9945c1d01daa45c233a39dedfd184f543c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala (diff)
Commit 323dc3ad02e63a7c99b5bd6da618d6020657ecba by vanzin
[PYSPARK] Update py4j to version 0.10.7.
(cherry picked from commit cc613b552e753d03cb62661591de59e1c8d82c74)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 323dc3ad02e63a7c99b5bd6da618d6020657ecba)
The file was modifiedsbin/spark-config.sh (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/PythonRunner.scala (diff)
The file was addedpython/lib/py4j-0.10.7-src.zip
The file was modifieddev/deps/spark-deps-hadoop-2.6 (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRDD.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7 (diff)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedbin/pyspark2.cmd (diff)
The file was modifieddev/run-pip-tests (diff)
The file was modifiedpython/pyspark/rdd.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonUtils.scala (diff)
The file was modifiedpython/docs/Makefile (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedLICENSE (diff)
The file was modifiedpython/README.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala (diff)
The file was removedpython/lib/py4j-0.10.6-src.zip
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedpython/pyspark/daemon.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SecurityManager.scala (diff)
The file was modifiedpython/setup.py (diff)
The file was addedcore/src/test/scala/org/apache/spark/security/SocketAuthHelperSuite.scala
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala (diff)
The file was modifiedpython/pyspark/java_gateway.py (diff)
The file was modifiedbin/pyspark (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/security/SocketAuthHelper.scala
The file was modifiedpython/pyspark/context.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
Commit 16cd9ac5264831e061c033b26fe1173ebc88e5d1 by vanzin
[SPARKR] Match pyspark features in SparkR communication protocol.
(cherry picked from commit 628c7b517969c4a7ccb26ea67ab3dd61266073ca)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 16cd9ac5264831e061c033b26fe1173ebc88e5d1)
The file was modifiedR/pkg/inst/worker/worker.R (diff)
The file was addedcore/src/main/scala/org/apache/spark/api/r/RBackendAuthHandler.scala
The file was modifiedR/pkg/R/client.R (diff)
The file was addedcore/src/main/scala/org/apache/spark/api/r/RAuthHelper.scala
The file was modifiedR/pkg/inst/worker/daemon.R (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/RRunner.scala (diff)
The file was modifiedR/pkg/R/deserialize.R (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/r/RRunner.scala (diff)
The file was modifiedR/pkg/R/sparkR.R (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/r/RBackend.scala (diff)
Commit 4c49b12da512ae29e2e4b773a334abbf6a4f08f1 by vanzin
[SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"
## What changes were proposed in this pull request?
Sometimes "SparkListenerSuite.local metrics" test fails because the
average of executorDeserializeTime is too short. As squito suggested to
avoid these situations in one of the task a reference introduced to an
object implementing a custom Externalizable.readExternal which sleeps
1ms before returning.
## How was this patch tested?
With unit tests (and checking the effect of this change to the average
with a much larger sleep time).
Author: “attilapiros” <piros.attila.zsolt@gmail.com> Author: Attila
Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Closes #21280 from attilapiros/SPARK-19181.
(cherry picked from commit 3e2600538ee477ffe3f23fba57719e035219550b)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 4c49b12da512ae29e2e4b773a334abbf6a4f08f1)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala (diff)
Commit 414e4e3d70caa950a63fab1c8cac3314fb961b0c by vanzin
[SPARK-10878][CORE] Fix race condition when multiple clients resolves
artifacts at the same time
## What changes were proposed in this pull request?
When multiple clients attempt to resolve artifacts via the `--packages`
parameter, they could run into race condition when they each attempt to
modify the dummy `org.apache.spark-spark-submit-parent-default.xml` file
created in the default ivy cache dir. This PR changes the behavior to
encode UUID in the dummy module descriptor so each client will operate
on a different resolution file in the ivy cache dir. In addition, this
patch changes the behavior of when and which resolution files are
cleaned to prevent accumulation of resolution files in the default ivy
cache dir.
Since this PR is a successor of #18801, close #18801. Many codes were
ported from #18801. **Many efforts were put here. I think this PR should
credit to Victsm .**
## How was this patch tested?
added UT into `SparkSubmitUtilsSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #21251 from kiszk/SPARK-10878.
(cherry picked from commit d3c426a5b02abdec49ff45df12a8f11f9e473a88)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 414e4e3d70caa950a63fab1c8cac3314fb961b0c)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala (diff)