FailedChanges

Summary

  1. [SPARK-30313][CORE] Ensure EndpointRef is available (details)
Commit 895e572b73ca2796cbc3c468bb2c21abed5b22f1 by vanzin
[SPARK-30313][CORE] Ensure EndpointRef is available
MasterWebUI/WorkerPage
### What changes were proposed in this pull request?
This patch fixes flaky tests "master/worker web ui available" &
"master/worker web ui available with reverseProxy" in MasterSuite.
Tracking back from stack trace below,
``` 19/12/19 13:48:39.160 dispatcher-event-loop-4 INFO Worker:
WorkerWebUI is available at http://localhost:8080/proxy/worker-20191219
134839-localhost-36054 19/12/19 13:48:39.296 WorkerUI-52072 WARN
JettyUtils: GET /json/ failed: java.lang.NullPointerException
java.lang.NullPointerException
       at
org.apache.spark.deploy.worker.ui.WorkerPage.renderJson(WorkerPage.scala:39)
       at
org.apache.spark.ui.WebUI.$anonfun$attachPage$2(WebUI.scala:91)
       at
org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
       at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
       at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
       at
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
       at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
       at
org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
       at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
       at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
```
there's possible race condition in `Dispatcher.registerRpcEndpoint()`:
https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala#L64-L77
`getMessageLoop()` initializes a new Inbox for this endpoint for both
DedicatedMessageLoop
and SharedMessageLoop, which calls `onStart()`  "asynchronously" and
"eventually" via posting `OnStart` message. `onStart()` will initialize
UI page instance(s), so the execution of `endpointRefs.put()` and
initializing UI page instance(s) are "concurrent".
MasterPage and WorkerPage retrieve endpoint ref and store it as "val"
assuming endpoint ref is valid when they're initialized - so in bad case
they could store "null" as endpoint ref, and don't change.
https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala#L33-L38
https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerPage.scala#L35-L41
This patch breaks down the step to `find the right message loop` and
`register endpoint to message loop`, and ensure endpoint ref is set
"before" registering endpoint to message loop.
### Why are the changes needed?
We observed the test failures from Jenkins; below are the links:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115583/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115700/testReport/
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
You can also reproduce the bug consistently via adding
`Thread.sleep(1000)` just before `endpointRefs.put(endpoint,
endpointRef)` in `Dispatcher.registerRpcEndpoint(...)`.
Closes #27010 from HeartSaVioR/SPARK-30313.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
The file was modifiedcore/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala (diff)