1. [SPARK-23243][CORE][2.3] Fix RDD.repartition() data correctness issue (commit: d22379ec2a01fb3aa2121c312a863f057a5761ed) (details)
  2. [SPARK-25330][BUILD][BRANCH-2.3] Revert Hadoop 2.7 to 2.7.3 (commit: 84922e506e57413a83cea4460a2a1649f2700293) (details)
  3. [SPARK-24415][CORE] Fixed the aggregated stage metrics by retaining (commit: 5b8b6b4e9e36228e993a15cab19c80e7fad43786) (details)
Commit d22379ec2a01fb3aa2121c312a863f057a5761ed by wenchen
[SPARK-23243][CORE][2.3] Fix RDD.repartition() data correctness issue
backport to 2.3
An alternative fix for
When Spark rerun tasks for an RDD, there are 3 different behaviors: 1.
determinate. Always return the same result with same order when rerun.
2. unordered. Returns same data set in random order when rerun. 3.
indeterminate. Returns different result when rerun.
Normally Spark doesn't need to care about it. Spark runs stages one by
one, when a task is failed, just rerun it. Although the rerun task may
return a different result, users will not be surprised.
However, Spark may rerun a finished stage when seeing fetch failures.
When this happens, Spark needs to rerun all the tasks of all the
succeeding stages if the RDD output is indeterminate, because the input
of the succeeding stages has been changed.
If the RDD output is determinate, we only need to rerun the failed tasks
of the succeeding stages, because the input doesn't change.
If the RDD output is unordered, it's same as determinate, because
shuffle partitioner is always deterministic(round-robin partitioner is
not a shuffle partitioner that extends `org.apache.spark.Partitioner`),
so the reducers will still get the same input data set.
This PR fixed the failure handling for `repartition`, to avoid
correctness issues.
For `repartition`, it applies a stateful map function to generate a
round-robin id, which is order sensitive and makes the RDD's output
indeterminate. When the stage contains `repartition` reruns, we must
also rerun all the tasks of all the succeeding stages.
**future improvement:** 1. Currently we can't rollback and rerun a
shuffle map stage, and just fail. We should fix it later. 2. Currently we can't
rollback and rerun a result stage, and just fail. We should fix it
later. 3. We should
provide public API to allow users to tag the random level of the RDD's
computing function.
a new test case
Closes #22354 from cloud-fan/repartition.
Authored-by: Wenchen Fan <> Signed-off-by: Wenchen
Fan <>
(commit: d22379ec2a01fb3aa2121c312a863f057a5761ed)
The file was modifiedcore/src/main/scala/org/apache/spark/Partitioner.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/RDD.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (diff)
Commit 84922e506e57413a83cea4460a2a1649f2700293 by sean.owen
[SPARK-25330][BUILD][BRANCH-2.3] Revert Hadoop 2.7 to 2.7.3
## What changes were proposed in this pull request? How to reproduce
permission issue:
# build spark
./dev/ --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive
-Phive-thriftserver -Pyarn
tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tar && cd
spark-2.4.0-SNAPSHOT-bin-SPARK-25330 export HADOOP_PROXY_USER=user_a
export HADOOP_PROXY_USER=user_b bin/spark-sql
```java Exception in thread "main" java.lang.RuntimeException: Permission denied:
user=user_b, access=EXECUTE,
The issue occurred in this commit:
This pr revert Hadoop 2.7 to 2.7.3 to avoid this issue.
## How was this patch tested? unit tests and manual tests.
Closes #22327 from wangyum/SPARK-25330.
Authored-by: Yuming Wang <> Signed-off-by: Sean Owen
(cherry picked from commit b0ada7dce02d101b6a04323d8185394e997caca4)
Signed-off-by: Sean Owen <>
(commit: 84922e506e57413a83cea4460a2a1649f2700293)
The file was modifieddocs/ (diff)
The file was modifiedpom.xml (diff)
The file was modifiedassembly/README (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7 (diff)
Commit 5b8b6b4e9e36228e993a15cab19c80e7fad43786 by tgraves
[SPARK-24415][CORE] Fixed the aggregated stage metrics by retaining
stage objects in liveStages until all tasks are complete
The problem occurs because stage object is removed from liveStages in
AppStatusListener onStageCompletion. Because of this any onTaskEnd event
received after onStageCompletion event do not update stage metrics.
The fix is to retain stage objects in liveStages until all tasks are
1. Fixed the reproducible example posted in the JIRA 2. Added unit test
Closes #22209 from ankuriitg/ankurgupta/SPARK-24415.
Authored-by: ankurgupta <> Signed-off-by:
Marcelo Vanzin <>
(cherry picked from commit 39a02d8f75def7191c66d388729ba1721c92188d)
Signed-off-by: Thomas Graves <>
(commit: 5b8b6b4e9e36228e993a15cab19c80e7fad43786)
The file was modifiedstreaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusListener.scala (diff)