SuccessChanges

Summary

  1. [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more (commit: 9ac9f36c48391e4f2c1c32747bd2ad94a1b21c08) (details)
  2. [SPARK-25253][PYSPARK] Refactor local connection & auth code (commit: a2a54a5f49364a1825932c9f04eb0ff82dd7d465) (details)
  3. [PYSPARK] Updates to pyspark broadcast (commit: 09dd34cb1706f2477a89174d6a1a0f17ed5b0a65) (details)
  4. [PYSPARK][SQL] Updates to RowQueue (commit: 6d742d1bd71aa3803dce91a830b37284cb18cf70) (details)
  5. [CORE] Updates to remote cache reads (commit: 575fea120e25249716e3f680396580c5f9e26b5b) (details)
  6. [HOTFIX] fix lint-java (commit: f3bbb7ceb9ae8038c3612f1fe5b8b44f0652711a) (details)
  7. [SPARK-25400][CORE][TEST] Increase test timeouts (commit: 0c1e3d109735b802172a2e5e79015597b02ff663) (details)
Commit 9ac9f36c48391e4f2c1c32747bd2ad94a1b21c08 by wenchen
[SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more
information like file path to event log
## What changes were proposed in this pull request?
Field metadata removed from SparkPlanInfo in #18600 . Corresponding,
many meta data was also removed from event
SparkListenerSQLExecutionStart in Spark event log. If we want to analyze
event log to get all input paths, we couldn't get them. Instead,
simpleString of SparkPlanInfo JSON only display 100 characters, it won't
help.
Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log
looks like below (It contains the metadata field which has the intact
information):
>{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart",
Location:
InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4...,
"metadata": {"Location":
"InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"}
After #18600, metadata field was removed.
>{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart",
Location:
InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4...,
So I add this field back to SparkPlanInfo class. Then it will log out
the meta data to event log. Intact information in event log is very
useful for offline job analysis.
## How was this patch tested? Unit test
Closes #22353 from LantaoJin/SPARK-25357.
Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(cherry picked from commit 6dc5921e66d56885b95c07e56e687f9f6c1eaca7)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 9ac9f36c48391e4f2c1c32747bd2ad94a1b21c08)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanInfo.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLJsonProtocolSuite.scala (diff)
Commit a2a54a5f49364a1825932c9f04eb0ff82dd7d465 by irashid
[SPARK-25253][PYSPARK] Refactor local connection & auth code
This eliminates some duplication in the code to connect to a server on
localhost to talk directly to the jvm.  Also it gives consistent ipv6
and error handling.  Two other incidental changes, that shouldn't
matter: 1) python barrier tasks perform authentication immediately
(rather than waiting for the BARRIER_FUNCTION indicator) 2) for
`rdd._load_from_socket`, the timeout is only increased after
authentication.
Closes #22247 from squito/py_connection_refactor.
Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by:
hyukjinkwon <gurwls223@apache.org>
(cherry picked from commit 38391c9aa8a88fcebb337934f30298a32d91596b)
(commit: a2a54a5f49364a1825932c9f04eb0ff82dd7d465)
The file was modifiedpython/pyspark/rdd.py (diff)
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedpython/pyspark/java_gateway.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRDD.scala (diff)
The file was modifiedpython/pyspark/broadcast.py (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala (diff)
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (diff)
The file was modifiedpython/pyspark/context.py (diff)
The file was modifiedpython/pyspark/serializers.py (diff)
The file was addedpython/pyspark/test_serializers.py
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was addedpython/pyspark/test_broadcast.py
Commit 6d742d1bd71aa3803dce91a830b37284cb18cf70 by irashid
[PYSPARK][SQL] Updates to RowQueue
Tested with updates to RowQueueSuite
(commit: 6d742d1bd71aa3803dce91a830b37284cb18cf70)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/python/RowQueueSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/RowQueue.scala (diff)
Commit 575fea120e25249716e3f680396580c5f9e26b5b by irashid
[CORE] Updates to remote cache reads
Covered by tests in DistributedSuite
(commit: 575fea120e25249716e3f680396580c5f9e26b5b)
The file was removedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/TempFileManager.java
The file was modifiedcore/src/main/scala/org/apache/spark/network/BlockTransferService.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleClient.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/buffer/ManagedBuffer.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/SimpleDownloadFile.java
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/DownloadFileManager.java
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/DownloadFile.java
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/DiskStore.scala (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/DownloadFileWritableChannel.java
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/DownloadFile.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/DownloadFileWritableChannel.java (diff)
Commit 0c1e3d109735b802172a2e5e79015597b02ff663 by sean.owen
[SPARK-25400][CORE][TEST] Increase test timeouts
We've seen some flakiness in jenkins in SchedulerIntegrationSuite which
looks like it just needs a longer timeout.
Closes #22385 from squito/SPARK-25400.
Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Sean
Owen <sean.owen@databricks.com>
(cherry picked from commit 9deddbb13edebfefb3fd03f063679ed12e73c575)
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(commit: 0c1e3d109735b802172a2e5e79015597b02ff663)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala (diff)