SuccessChanges

Summary

  1. [SPARK-24014][PYSPARK] Add onStreamingStarted method to (commit: 32bec6ca3d9e47587c84f928d4166475fe29f596) (details)
  2. [SPARK-24021][CORE] fix bug in BlacklistTracker's (commit: 7fb11176f285b4de47e61511c09acbbb79e5c44c) (details)
  3. [SPARK-23989][SQL] exchange should copy data before non-serialized (commit: fb968215ca014c5cf40097a3c4588bbee11e2c02) (details)
  4. [SPARK-23340][SQL][BRANCH-2.3] Upgrade Apache ORC to 1.4.3 (commit: be184d16e86f96a748d6bf1642c1c319d2a09f5c) (details)
  5. [SPARK-24022][TEST] Make SparkContextSuite not flaky (commit: 9b562d6fef765cb8357dbc31390e60b5947a9069) (details)
Commit 32bec6ca3d9e47587c84f928d4166475fe29f596 by jerryshao
[SPARK-24014][PYSPARK] Add onStreamingStarted method to
StreamingListener
## What changes were proposed in this pull request?
The `StreamingListener` in PySpark side seems to be lack of
`onStreamingStarted` method. This patch adds it and a test for it.
This patch also includes a trivial doc improvement for
`createDirectStream`.
Original PR is #21057.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #21098 from viirya/SPARK-24014.
(cherry picked from commit 8bb0df2c65355dfdcd28e362ff661c6c7ebc99c0)
Signed-off-by: jerryshao <sshao@hortonworks.com>
(commit: 32bec6ca3d9e47587c84f928d4166475fe29f596)
The file was modifiedpython/pyspark/streaming/listener.py (diff)
The file was modifiedpython/pyspark/streaming/tests.py (diff)
The file was modifiedpython/pyspark/streaming/kafka.py (diff)
Commit 7fb11176f285b4de47e61511c09acbbb79e5c44c by irashid
[SPARK-24021][CORE] fix bug in BlacklistTracker's
updateBlacklistForFetchFailure
## What changes were proposed in this pull request?
There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:
``` val blacklistedExecsOnNode =
   nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
blacklistedExecsOnNode += exec
``` where first **exec** should be **host**.
## How was this patch tested?
adjust existed test.
Author: wuyi <ngone_5451@163.com>
Closes #21104 from Ngone51/SPARK-24021.
(cherry picked from commit 0deaa5251326a32a3d2d2b8851193ca926303972)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
(commit: 7fb11176f285b4de47e61511c09acbbb79e5c44c)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala (diff)
Commit fb968215ca014c5cf40097a3c4588bbee11e2c02 by hvanhovell
[SPARK-23989][SQL] exchange should copy data before non-serialized
shuffle
## What changes were proposed in this pull request?
In Spark SQL, we usually reuse the `UnsafeRow` instance and need to copy
the data when a place buffers non-serialized objects.
Shuffle may buffer objects if we don't make it to the bypass merge
shuffle or unsafe shuffle.
`ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` misses the case
that, if `spark.sql.shuffle.partitions` is large enough, we could fail
to run unsafe shuffle and go with the non-serialized shuffle.
This bug is very hard to hit since users wouldn't set such a large
number of partitions(16 million) for Spark SQL exchange.
TODO: test
## How was this patch tested?
todo.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #21101 from cloud-fan/shuffle.
(cherry picked from commit 6e19f7683fc73fabe7cdaac4eb1982d2e3e607b7)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
(commit: fb968215ca014c5cf40097a3c4588bbee11e2c02)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (diff)
Commit be184d16e86f96a748d6bf1642c1c319d2a09f5c by gatorsmile
[SPARK-23340][SQL][BRANCH-2.3] Upgrade Apache ORC to 1.4.3
## What changes were proposed in this pull request?
This PR updates Apache ORC dependencies to 1.4.3 released on February
9th. Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3
has 5 more patches (https://s.apache.org/Fll8).
Especially, the following ORC-285 is fixed at 1.4.3.
```scala scala> val df = Seq(Array.empty[Float]).toDF()
scala> df.write.format("orc").save("/tmp/floatarray")
scala> spark.read.orc("/tmp/floatarray") res1:
org.apache.spark.sql.DataFrame = [value: array<float>]
scala> spark.read.orc("/tmp/floatarray").show() 18/02/12 22:09:10 ERROR
Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file:
file:/tmp/floatarray/part-00000-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
... Caused by: java.io.EOFException: Read past EOF for compressed stream
Stream for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0
limit: 0
```
## How was this patch tested?
Pass the Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #21093 from dongjoon-hyun/SPARK-23340-2.
(commit: be184d16e86f96a748d6bf1642c1c319d2a09f5c)
The file was modifieddev/deps/spark-deps-hadoop-2.7 (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.6 (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala (diff)
Commit 9b562d6fef765cb8357dbc31390e60b5947a9069 by vanzin
[SPARK-24022][TEST] Make SparkContextSuite not flaky
## What changes were proposed in this pull request?
SparkContextSuite.test("Cancelling stages/jobs with custom reasons.")
could stay in an infinite loop because of the problem found and fixed in
[SPARK-23775](https://issues.apache.org/jira/browse/SPARK-23775).
This PR solves this mentioned flakyness by removing shared variable
usages when cancel happens in a loop and using wait and CountDownLatch
for synhronization.
## How was this patch tested?
Existing unit test.
Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Closes #21105 from gaborgsomogyi/SPARK-24022.
(cherry picked from commit e55953b0bf2a80b34127ba123417ee54955a6064)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 9b562d6fef765cb8357dbc31390e60b5947a9069)
The file was modifiedcore/src/test/scala/org/apache/spark/SparkContextSuite.scala (diff)