1. [SPARK-25231] Fix synchronization of executor heartbeat receiver in (commit: 31e46ec60849d315a4e83e0a332606a4405907ad) (details)
Commit 31e46ec60849d315a4e83e0a332606a4405907ad by tgraves
[SPARK-25231] Fix synchronization of executor heartbeat receiver in
Running a large Spark job with speculation turned on was causing
executor heartbeats to time out on the driver end after sometime and
eventually, after hitting the max number of executor failures, the job
would fail.
## What changes were proposed in this pull request?
The main reason for the heartbeat timeouts was that the
heartbeat-receiver-event-loop-thread was blocked waiting on the
TaskSchedulerImpl object which was being held by one of the
dispatcher-event-loop threads executing the method
dequeueSpeculativeTasks() in TaskSetManager.scala. On further analysis
of the heartbeat receiver method executorHeartbeatReceived() in
TaskSchedulerImpl class, we found out that instead of waiting to acquire
the lock on the TaskSchedulerImpl object, we can remove that lock and
make the operations to the global variables inside the code block to be
atomic. The block of code in that method only uses  one global HashMap
taskIdToTaskSetManager. Making that map a ConcurrentHashMap, we are
ensuring atomicity of operations and speeding up the heartbeat receiver
thread operation.
## How was this patch tested?
Screenshots of the thread dump have been attached below:
<img width="1409" alt="screen shot 2018-08-24 at 9 19 57 am"
<img width="1409" alt="screen shot 2018-08-24 at 9 21 56 am"
Closes #22221 from pgandhi999/SPARK-25231.
Authored-by: pgandhi <> Signed-off-by: Thomas Graves
(cherry picked from commit 559b899aceb160fcec3a57109c0b60a0ae40daeb)
Signed-off-by: Thomas Graves <>
(commit: 31e46ec60849d315a4e83e0a332606a4405907ad)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala (diff)