SuccessChanges

Summary

  1. [SPARK-23121][CORE] Fix for ui becoming unaccessible for long running (commit: 4e75b0cb4b575d4799c02455eed286fe971c6c50) (details)
  2. Preparing Spark release v2.3.0-rc2 (commit: 489ecb0ef23e5d9b705e5e5bae4fa3d871bdac91) (details)
  3. Preparing development version 2.3.1-SNAPSHOT (commit: 6facc7fb2333cc61409149e2f896bf84dd085fa3) (details)
  4. [MINOR] Typo fixes (commit: 566ef93a672aea1803d6977883204780c2f6982d) (details)
  5. [SPARK-22389][SQL] data source v2 partitioning reporting interface (commit: 7241556d8b550e22eed2341287812ea373dc1cb2) (details)
  6. [SPARK-22465][FOLLOWUP] Update the number of partitions of default (commit: 832d69817c29e8c44fcab0d6a476f36d4ee0c837) (details)
  7. [SPARK-20749][SQL][FOLLOW-UP] Override prettyName for bit_length and (commit: 29ed718732de40e956e37f0673743ae375cd30c5) (details)
  8. [SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples. (commit: f8f522c01025e78eca1724c909c749374f855039) (details)
  9. [SPARK-23192][SQL] Keep the Hint after Using Cached Data (commit: 851c303867eb54405f6508919619debe84708933) (details)
  10. [SPARK-23195][SQL] Keep the Hint of Cached Data (commit: a23f6b13e8a4f0471ee33879a14746786bbf0435) (details)
  11. [SPARK-23197][DSTREAMS] Increased timeouts to resolve flakiness (commit: 3316a9d7104aece977384974cf61e5ec635ad350) (details)
  12. [SPARK-21727][R] Allow multi-element atomic vector as column type in (commit: 9cfe90e5abb21086711e0efb7ed08026dba96ffc) (details)
  13. Revert "[SPARK-23195][SQL] Keep the Hint of Cached Data" (commit: d656be74b87746efc020d5cae3bfa294f8f98594) (details)
  14. [SPARK-23177][SQL][PYSPARK][BACKPORT-2.3] Extract zero-parameter UDFs (commit: 84a189a3429c64eeabc7b4e8fc0488ec16742002) (details)
  15. [SPARK-23148][SQL] Allow pathnames with special characters for CSV / (commit: 17317c8fb99715836fcebc39ffb04648ab7fb762) (details)
  16. [SPARK-20906][SPARKR] Add API doc example for Constrained Logistic (commit: 4336e67f41344fd587808182741ae4ef9fb2b76a) (details)
  17. [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issues. (commit: 2221a30352f1c0f5483c91301f32e66672a43644) (details)
  18. [SPARK-22837][SQL] Session timeout checker does not work in (commit: 30272c668b2cd8c0b0ee78c600bc3feb17bd6647) (details)
  19. [SPARK-23198][SS][TEST] Fix (commit: 500c94434d8f5267b1488accd176cf54b69e6ba4) (details)
  20. [MINOR][SQL] add new unit test to LimitPushdown (commit: a857ad56621f644a26b9d27079b76ab21f3726ae) (details)
  21. [SPARK-23129][CORE] Make deserializeStream of DiskMapIterator init (commit: 012695256a61f1830ff02780611d4aada00a88a0) (details)
  22. [SPARK-23208][SQL] Fix code generation for complex create array (commit: abd3e1bb5fc11a3074ee64a5ed0dc97020dca61c) (details)
  23. [SPARK-23163][DOC][PYTHON] Sync ML Python API with Scala (commit: e66c66cd2d4f0cd67cbc2aa6f95135176f1165e4) (details)
  24. [SPARK-21717][SQL] Decouple consume functions of physical operators in (commit: c79e771f8952e6773c3a84cc617145216feddbcf) (details)
  25. [SPARK-23112][DOC] Add highlights and migration guide for 2.3 (commit: 8866f9c24673e739ee87c7341d75dd5c133a744f) (details)
  26. [SPARK-23081][PYTHON] Add colRegex API to PySpark (commit: 2f65c20ea74a87729eaf3c9b2aebcfb10c0ecf4b) (details)
  27. [SPARK-23032][SQL] Add a per-query codegenStageId to (commit: 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78) (details)
  28. [SPARK-23205][ML] Update ImageSchema.readImages to correctly set alpha (commit: 87d128ffd4a56fe3995f041bf4a8c4cba6c20092) (details)
  29. [SPARK-23020][CORE] Fix race in SparkAppHandle cleanup, again. (commit: fdf140e25998ad45ce0f931e2e1e36bf4b382dca) (details)
  30. [SPARK-22799][ML] Bucketizer should throw exception if single- and (commit: d6cdc699e350adc6b5ed938192e0e72cff5f52d8) (details)
  31. [SPARK-22797][PYSPARK] Bucketizer support multi-column (commit: ab1b5d921b395cb7df3a3a2c4a7e5778d98e6f01) (details)
  32. [SPARK-23218][SQL] simplify ColumnVector.getArray (commit: ca3613be20ff4dc546c43322eeabf591ab8ad97f) (details)
  33. Revert "[SPARK-22797][PYSPARK] Bucketizer support multi-column" (commit: f5911d4894700eb48f794133cbd363bf3b7c8c8e) (details)
  34. [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame could lead to (commit: 30d16e116b0ff044ca03974de0f1faf17e497903) (details)
  35. [SPARK-23242][SS][TESTS] Don't run tests in KafkaSourceSuiteBase twice (commit: 7aaf23cf8ab871a8e8877ec82183656ae5f4be7b) (details)
  36. [SPARK-23214][SQL] cached data should not carry extra hint info (commit: 20c0efe48df9ce622d97cf3d1274d877c0e3095c) (details)
  37. [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples (commit: 234c854bd203e9ba32be50b1f33cc118d0dbd9e8) (details)
  38. [SPARK-23245][SS][TESTS] Don't access `lastExecution.executedPlan` in (commit: 65600bfdb9417e5f2bd2e40312e139f592f238e9) (details)
  39. [SPARK-23233][PYTHON] Reset the cache in asNondeterministic to set (commit: 3b6fc286d105ae7de737c46e50cf941e6831ab98) (details)
  40. [SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in (commit: 8ff0cc48b1b45ed41914822ffaaf8de8dff87b72) (details)
  41. [SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFrameWriter (commit: 7ca2cd463db90fc166a10de1ebe58ccc795fbbe9) (details)
  42. [SPARK-23196] Unify continuous and microbatch V2 sinks (commit: 588b9694c1967ff45774431441e84081ee6eb515) (details)
  43. [SPARK-23020] Ignore Flaky Test: (commit: 5dda5db1229a20d7e3b0caab144af16da0787d56) (details)
  44. [SPARK-23238][SQL] Externalize SQLConf configurations exposed in (commit: 8229e155d84cf02479c5dd0df6d577aff5075c00) (details)
  45. [SPARK-23219][SQL] Rename ReadTask to DataReaderFactory in data source (commit: de66abafcf081c5cc4d1556d21c8ec21e1fefdf5) (details)
  46. [SPARK-23199][SQL] improved Removes repetition from group expressions in (commit: 4059454f979874caa9745861a2bcc60cac0bbffd) (details)
  47. [SPARK-23223][SQL] Make stacking dataset transforms more performant (commit: d68198d26e32ce98cbf0d3f8755d21dc72b3756d) (details)
  48. [SPARK-22221][DOCS] Adding User Documentation for Arrow (commit: 6588e007e8e10da7cc9771451eeb4d3a2bdc6e0e) (details)
  49. [SPARK-22916][SQL][FOLLOW-UP] Update the Description of Join Selection (commit: 438631031b2a7d79f8c639ef8ef0de1303bb9f2b) (details)
  50. [SPARK-23209][core] Allow credential manager to work when Hive not (commit: 75131ee867bca17ca3aade5f832f1d49b7cfcff5) (details)
  51. [SPARK-22221][SQL][FOLLOWUP] Externalize (commit: 2858eaafaf06d3b8c55a8a5ed7831260244932cd) (details)
  52. [SPARK-23207][SQL][FOLLOW-UP] Don't perform local sort for (commit: a81ace196ec41b1b3f739294c965df270ee2ddd2) (details)
  53. [SPARK-23157][SQL] Explain restriction on column expression in (commit: bb7502f9a506d52365d7532b3b0281098dd85763) (details)
  54. [SPARK-23138][ML][DOC] Multiclass logistic regression summary example (commit: 107d4e293867af42c87cbc1a93d14c5492c2ba84) (details)
  55. [SPARK-23260][SPARK-23262][SQL] several data source v2 naming cleanup (commit: d3e623b19231e6d59793b86afa01f169fb2dedb2) (details)
  56. [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky (commit: 7d96dc1acf7d7049a6e6c35de726f800c8160422) (details)
  57. [SPARK-23267][SQL] Increase spark.sql.codegen.hugeMethodLimit to 65535 (commit: 2e0c1e5f3e47e4e35c14732b93a29d1a25e15662) (details)
  58. [SPARK-23275][SQL] hive/tests have been failing when run locally on the (commit: f4802dc8866d0316a6b555b0ab58a56d56d8c6fe) (details)
  59. [SPARK-23261][PYSPARK][BACKPORT-2.3] Rename Pandas UDFs (commit: 7b9fe08658b529901a5f22bf81ffdd4410180809) (details)
  60. [SPARK-23276][SQL][TEST] Enable UDT tests in (commit: 6ed0d57f86e76f37c4ca1c6d721fc235dcec520e) (details)
  61. [SPARK-23274][SQL] Fix ReplaceExceptWithFilter when the right's Filter (commit: b8778321bbb443f2d51ab7e2b1aff3d1e4236e35) (details)
  62. [SPARK-23279][SS] Avoid triggering distributed job for Console sink (commit: ab5a5105502c545bed951538f0ce9409cfbde154) (details)
  63. [SPARK-23272][SQL] add calendar interval type support to ColumnVector (commit: 7ec8ad7ba5b45658b55dd278b4c7ca2e35acfdd3) (details)
  64. [SPARK-23112][DOC] Update ML migration guide with breaking and behavior (commit: c83246c9a3fe7557e7e1d226edf18a7e96730d18) (details)
  65. revert [SPARK-22785][SQL] remove ColumnVector.anyNullsSet (commit: 33f17b28b3448a6f0389d2ee93bb5d49a02f288c) (details)
  66. [SPARK-23249][SQL] Improved block merging logic for partitions (commit: f5f21e8c4261c0dfe8e3e788a30b38b188a18f67) (details)
  67. [SPARK-23281][SQL] Query produces results in incorrect order when a (commit: 8ee3a71c9c1b8ed51c5916635d008fdd49cf891a) (details)
  68. [SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment (commit: 7ccfc753086c3859abe358c87f2e7b7a30422d5e) (details)
  69. [SPARK-23280][SQL] add map type support to ColumnVector (commit: 0d0f5793686b98305f98a3d9e494bbfcee9cff13) (details)
  70. [SPARK-23268][SQL] Reorganize packages in data source V2 (commit: 59e89a2990e4f66839c91f48f41157dac6e670ad) (details)
  71. [SPARK-21396][SQL] Fixes MatchError when UDTs are passed through Hive (commit: 871fd48dc381d48e67f7efcc45cc534d36e4ee6e) (details)
  72. [SPARK-23107][ML] ML 2.3 QA: New Scala APIs, docs. (commit: 205bce974b86bef9d9d507e1b89549cb01c7b535) (details)
  73. [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues. (commit: 6b6bc9c4ebeb4c1ebfea3f6ddff0d2f502011e0c) (details)
  74. [SPARK-23280][SQL][FOLLOWUP] Enable `MutableColumnarRow.getMap()`. (commit: 3aa780ef34492ab1746bbcde8a75bfa8c3d929e1) (details)
  75. [SPARK-23289][CORE] OneForOneBlockFetcher.DownloadCallback.onData should (commit: 2549beae20fe8761242f6fb9cda35ff06a652897) (details)
  76. [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--hiveconf" and (commit: 2db7e49dbb58c5b44e22a6a3aae4fa04cd89e5e8) (details)
  77. [SPARK-23293][SQL] fix data source v2 self join (commit: 07a8f4ddfc2edccde9b1d28b4436a596d2f7db63) (details)
  78. [SPARK-23296][YARN] Include stacktrace in YARN-app diagnostic (commit: ab23785c70229acd6c22218f722337cf0a9cc55b) (details)
  79. [SPARK-23284][SQL] Document the behavior of several ColumnVector's get (commit: 7baae3aef34d16cb0a2d024d96027d8378a03927) (details)
  80. [SPARK-23301][SQL] data source column pruning should work for arbitrary (commit: 2b07452cacb4c226c815a216c4cfea2a04227700) (details)
  81. [SPARK-23312][SQL] add a config to turn off vectorized cache reader (commit: e5e9f9a430c827669ecfe9d5c13cc555fc89c980) (details)
  82. [SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up (commit: 56eb9a310217a5372bdba1e24e4af0d4de1829ca) (details)
  83. [SQL] Minor doc update: Add an example in DataFrameReader.schema (commit: dcd0af4be752ab61b8caf36f70d98e97c6925473) (details)
  84. [SPARK-23317][SQL] rename ContinuousReader.setOffset to setStartOffset (commit: b614c083a4875c874180a93b08ea5031fa90cfec) (details)
  85. [SPARK-23311][SQL][TEST] add FilterFunction test case for test (commit: 1bcb3728db11be6e34060eff670fc8245ad571c6) (details)
  86. [SPARK-23305][SQL][TEST] Test `spark.sql.files.ignoreMissingFiles` for (commit: 4de206182c8a1f76e1e5e6b597c4b3890e2ca255) (details)
  87. [MINOR][DOC] Use raw triple double quotes around docstrings where there (commit: be3de87914f29a56137e391d0cf494c0d1a7ba12) (details)
  88. [SPARK-21658][SQL][PYSPARK] Revert "[] Add default None for value in (commit: 45f0f4ff76accab3387b08b3773a0b127333ea3a) (details)
  89. [SPARK-22036][SQL][FOLLOWUP] Fix decimalArithmeticOperations.sql (commit: 430025cba1ca8cc652fd11f894cef96203921dab) (details)
  90. [SPARK-23307][WEBUI] Sort jobs/stages/tasks/queries with the completed (commit: e688ffee20cf9d7124e03b28521e31e10d0bb7f3) (details)
  91. [SPARK-23310][CORE] Turn off read ahead input stream for unshafe shuffle (commit: 173449c2bd454a87487f8eebf7d74ee6ed505290) (details)
  92. [SPARK-23330][WEBUI] Spark UI SQL executions page throws NPE (commit: 4aa9aafcd542d5f28b7e6bb756c2e965010a757c) (details)
  93. [SPARK-23326][WEBUI] schedulerDelay should return 0 when the task is (commit: 521494d7bdcbb6699e0b12cd3ff60fc27908938f) (details)
  94. [SPARK-23290][SQL][PYTHON][BACKPORT-2.3] Use datetime.date for date type (commit: 44933033e9216ccb2e533b9dc6e6cb03cce39817) (details)
  95. [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() (commit: a511544822be6e3fc9c6bb5080a163b9acbb41f2) (details)
  96. [SPARK-23310][CORE][FOLLOWUP] Fix Java style check issues. (commit: 7782fd03ab95552dff1d1477887632bbc8f6ee51) (details)
  97. [SPARK-23312][SQL][FOLLOWUP] add a config to turn off vectorized cache (commit: 036a04b29c818ddbe695f7833577781e8bb16d3f) (details)
  98. [MINOR][TEST] Fix class name for Pandas UDF tests (commit: 77cccc5e154b14f8a9cad829d7fd476e3b6405ce) (details)
  99. [SPARK-23315][SQL] failed to get output from canonicalized data source (commit: f9c913263219f5e8a375542994142645dd0f6c6a) (details)
  100. [SPARK-23327][SQL] Update the description and tests of three external (commit: 874d3f89fe0f903a6465520c3e6c4788a6865d9a) (details)
  101. [SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by (commit: cb22e830b0af3f2d760beffea9a79a6d349e4661) (details)
  102. [SPARK-23345][SQL] Remove open stream record even closing it fails (commit: 05239afc9e62ef4c71c9f22a930e73888985510a) (details)
  103. [SPARK-23300][TESTS][BRANCH-2.3] Prints out if Pandas and PyArrow are (commit: 2ba07d5b101c44382e0db6d660da756c2f5ce627) (details)
  104. Revert [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by (commit: db59e554273fe0a54a3223079ff39106fdd1442e) (details)
  105. [SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow (commit: 0538302561c4d77b2856b1ce73b3ccbcb6688ac6) (details)
  106. [SPARK-23348][SQL] append data using saveAsTable should adjust the data (commit: 0c2a2100d0116776d2dcb2d48493f77a64aead0c) (details)
  107. [SPARK-23268][SQL][FOLLOWUP] Reorganize packages in data source V2 (commit: 68f3a070c728d0af95e9b5eec2c49be274b67a20) (details)
  108. [SPARK-23186][SQL] Initialize DriverManager first before loading JDBC (commit: dfb16147791ff87342ff852105420a5eac5c553b) (details)
  109. [SPARK-23328][PYTHON] Disallow default value None in na.replace/replace (commit: 196304a3a8ed15fd018e9c7b441954d17bd60124) (details)
  110. [SPARK-23358][CORE] When the number of partitions is greater than 2^28, (commit: 08eb95f609f5d356c89dedcefa768b12a7a8b96c) (details)
  111. [MINOR][HIVE] Typo fixes (commit: 49771ac8da8e68e8412d9f5d181953eaf0de7973) (details)
  112. [SPARK-23275][SQL] fix the thread leaking in hive/tests (commit: f3a9a7f6b6eac4421bd74ff73a74105982604ce6) (details)
  113. [SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, (commit: b7571b9bfcced2e08f87e815c2ea9474bfd5fe2a) (details)
  114. [SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive (commit: 9fa7b0e107c283557648160195ce179077752e4c) (details)
  115. [SPARK-23387][SQL][PYTHON][TEST][BRANCH-2.3] Backport assertPandasEqual (commit: 8875e47cec01ae8da4ffb855409b54089e1016fb) (details)
  116. [SPARK-23376][SQL] creating UnsafeKVExternalSorter with BytesToBytesMap (commit: 7e2a2b33c0664b3638a1428688b28f68323994c1) (details)
  117. [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSourceSuite in Spark (commit: 79e8650cccb00c7886efba6dd691d9733084cb81) (details)
  118. [SPARK-22977][SQL] fix web UI SQL tab for CTAS (commit: 1e3118c2ee0fe7d2c59cb3e2055709bb2809a6db) (details)
  119. [SPARK-23391][CORE] It may lead to overflow for some integer (commit: d31c4ae7ba734356c849347b9a7b448da9a5a9a1) (details)
  120. Preparing Spark release v2.3.0-rc3 (commit: 89f6fcbafcfb0a7aeb897fba6036cb085bd35121) (details)
  121. Preparing development version 2.3.1-SNAPSHOT (commit: 70be6038df38d5e80af8565120eedd8242c5a7c5) (details)
  122. [SPARK-23388][SQL] Support for Parquet Binary DecimalType in (commit: 4e138207ebb11a08393c15e5e39f46a5dc1e7c66) (details)
  123. [SPARK-22002][SQL][FOLLOWUP][TEST] Add a test to check if the original (commit: 9632c461e6931a1a4d05684d0f62ee36f9e90b77) (details)
  124. [SPARK-23313][DOC] Add a migration guide for ORC (commit: 2b80571e215d56d15c59f0fc5db053569a79efae) (details)
  125. [SPARK-23230][SQL] When hive.default.fileformat is other kinds of file (commit: befb22de81aad41673eec9dba7585b80c6cb2564) (details)
  126. [SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in (commit: 43f5e40679f771326b2ee72f14cf1ab0ed2ad692) (details)
  127. [SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark (commit: 3737c3d32bb92e73cadaf3b1b9759d9be00b288d) (details)
  128. [SPARK-23384][WEB-UI] When it has no incomplete(completed) applications (commit: 1c81c0c626f115fbfe121ad6f6367b695e9f3b5f) (details)
  129. [SPARK-23053][CORE] taskBinarySerialization and task partitions (commit: dbb1b399b6cf8372a3659c472f380142146b1248) (details)
  130. [SPARK-23316][SQL] AnalysisException after max iteration reached for IN (commit: ab01ba718c7752b564e801a1ea546aedc2055dc0) (details)
  131. [SPARK-23154][ML][DOC] Document backwards compatibility guarantees for (commit: 320ffb1309571faedb271f2c769b4ab1ee1cd267) (details)
  132. [SPARK-23400][SQL] Add a constructors for ScalaUDF (commit: 4f6a457d464096d791e13e57c55bcf23c01c418f) (details)
  133. [SPARK-23399][SQL] Register a task completion listener first for (commit: bb26bdb55fdf84c4e36fd66af9a15e325a3982d6) (details)
Commit 4e75b0cb4b575d4799c02455eed286fe971c6c50 by vanzin
[SPARK-23121][CORE] Fix for ui becoming unaccessible for long running
streaming apps
## What changes were proposed in this pull request?
The allJobs and the job pages attempt to use stage attempt and DAG
visualization from the store, but for long running jobs they are not
guaranteed to be retained, leading to exceptions when these pages are
rendered.
To fix it `store.lastStageAttempt(stageId)` and
`store.operationGraphForJob(jobId)` are wrapped in `store.asOption` and
default values are used if the info is missing.
## How was this patch tested?
Manual testing of the UI, also using the test command reported in
SPARK-23121:
./bin/spark-submit --class
org.apache.spark.examples.streaming.HdfsWordCount
./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
Closes #20287
Author: Sandor Murakozi <smurakozi@gmail.com>
Closes #20330 from smurakozi/SPARK-23121.
(cherry picked from commit 446948af1d8dbc080a26a6eec6f743d338f1d12b)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 4e75b0cb4b575d4799c02455eed286fe971c6c50)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala (diff)
The file was modifiedresource-managers/mesos/pom.xml (diff)
The file was modifiedsql/core/pom.xml (diff)
The file was modifiedexternal/kafka-0-8-assembly/pom.xml (diff)
The file was modifiedtools/pom.xml (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-sql/pom.xml (diff)
The file was modifiedexternal/spark-ganglia-lgpl/pom.xml (diff)
The file was modifiedexternal/kafka-0-8/pom.xml (diff)
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-assembly/pom.xml (diff)
The file was modifiedlauncher/pom.xml (diff)
The file was modifiedexternal/kafka-0-10/pom.xml (diff)
The file was modifiedhadoop-cloud/pom.xml (diff)
The file was modifiedexternal/flume-sink/pom.xml (diff)
The file was modifiedexternal/flume/pom.xml (diff)
The file was modifiedcommon/sketch/pom.xml (diff)
The file was modifiedexternal/kinesis-asl/pom.xml (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedcommon/tags/pom.xml (diff)
The file was modifiedmllib-local/pom.xml (diff)
The file was modifiedexternal/docker-integration-tests/pom.xml (diff)
The file was modifiedpom.xml (diff)
The file was modifiedsql/hive/pom.xml (diff)
The file was modifiedcommon/network-shuffle/pom.xml (diff)
The file was modifiedgraphx/pom.xml (diff)
The file was modifiedexternal/flume-assembly/pom.xml (diff)
The file was modifiedexternal/kinesis-asl-assembly/pom.xml (diff)
The file was modifiedR/pkg/DESCRIPTION (diff)
The file was modifiedsql/hive-thriftserver/pom.xml (diff)
The file was modifiedassembly/pom.xml (diff)
The file was modifiedcommon/network-common/pom.xml (diff)
The file was modifiedmllib/pom.xml (diff)
The file was modifiedcommon/kvstore/pom.xml (diff)
The file was modifiedrepl/pom.xml (diff)
The file was modifieddocs/_config.yml (diff)
The file was modifiedcommon/network-yarn/pom.xml (diff)
The file was modifiedcommon/unsafe/pom.xml (diff)
The file was modifiedstreaming/pom.xml (diff)
The file was modifiedresource-managers/yarn/pom.xml (diff)
The file was modifiedexamples/pom.xml (diff)
The file was modifiedpython/pyspark/version.py (diff)
The file was modifiedmllib-local/pom.xml (diff)
The file was modifiedcommon/kvstore/pom.xml (diff)
The file was modifiedexternal/flume-assembly/pom.xml (diff)
The file was modifiedexternal/kafka-0-8/pom.xml (diff)
The file was modifiedhadoop-cloud/pom.xml (diff)
The file was modifiedrepl/pom.xml (diff)
The file was modifiedassembly/pom.xml (diff)
The file was modifiedgraphx/pom.xml (diff)
The file was modifiedlauncher/pom.xml (diff)
The file was modifiedcommon/unsafe/pom.xml (diff)
The file was modifiedsql/core/pom.xml (diff)
The file was modifiedcommon/network-common/pom.xml (diff)
The file was modifiedexternal/docker-integration-tests/pom.xml (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedstreaming/pom.xml (diff)
The file was modifiedR/pkg/DESCRIPTION (diff)
The file was modifiedexternal/kafka-0-10/pom.xml (diff)
The file was modifiedexternal/kinesis-asl-assembly/pom.xml (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedexternal/flume/pom.xml (diff)
The file was modifiedexternal/kafka-0-8-assembly/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-sql/pom.xml (diff)
The file was modifiedexamples/pom.xml (diff)
The file was modifiedsql/hive-thriftserver/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-assembly/pom.xml (diff)
The file was modifiedpom.xml (diff)
The file was modifiedtools/pom.xml (diff)
The file was modifiedcommon/network-yarn/pom.xml (diff)
The file was modifiedexternal/kinesis-asl/pom.xml (diff)
The file was modifiedmllib/pom.xml (diff)
The file was modifiedpython/pyspark/version.py (diff)
The file was modifiedcommon/sketch/pom.xml (diff)
The file was modifieddocs/_config.yml (diff)
The file was modifiedsql/hive/pom.xml (diff)
The file was modifiedcommon/network-shuffle/pom.xml (diff)
The file was modifiedresource-managers/mesos/pom.xml (diff)
The file was modifiedexternal/flume-sink/pom.xml (diff)
The file was modifiedresource-managers/yarn/pom.xml (diff)
The file was modifiedcommon/tags/pom.xml (diff)
The file was modifiedexternal/spark-ganglia-lgpl/pom.xml (diff)
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
Commit 566ef93a672aea1803d6977883204780c2f6982d by sowen
[MINOR] Typo fixes
## What changes were proposed in this pull request?
Typo fixes
## How was this patch tested?
Local build / Doc-only changes
Author: Jacek Laskowski <jacek@japila.pl>
Closes #20344 from jaceklaskowski/typo-fixes.
(cherry picked from commit 76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: 566ef93a672aea1803d6977883204780c2f6982d)
The file was modifiedsql/core/src/test/resources/sql-tests/results/columnresolution-views.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/streaming/OutputMode.java (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/columnresolution.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingQueryWrapper.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (diff)
Commit 7241556d8b550e22eed2341287812ea373dc1cb2 by gatorsmile
[SPARK-22389][SQL] data source v2 partitioning reporting interface
## What changes were proposed in this pull request?
a new interface which allows data source to report partitioning and
avoid shuffle at Spark side.
The design is pretty like the internal distribution/partitioing
framework. Spark defines a `Distribution` interfaces and several
concrete implementations, and ask the data source to report a
`Partitioning`, the `Partitioning` should tell Spark if it can satisfy a
`Distribution` or not.
## How was this patch tested?
new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20201 from cloud-fan/partition-reporting.
(cherry picked from commit 51eb750263dd710434ddb60311571fa3dcec66eb)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 7241556d8b550e22eed2341287812ea373dc1cb2)
The file was addedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaPartitionAwareDataSource.java
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ClusteredDistribution.java
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourcePartitioning.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Distribution.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java
Commit 832d69817c29e8c44fcab0d6a476f36d4ee0c837 by mridul
[SPARK-22465][FOLLOWUP] Update the number of partitions of default
partitioner when defaultParallelism is set
## What changes were proposed in this pull request?
#20002 purposed a way to safe check the default partitioner, however, if
`spark.default.parallelism` is set, the defaultParallelism still could
be smaller than the proper number of partitions for upstreams RDDs. This
PR tries to extend the approach to address the condition when
`spark.default.parallelism` is set.
The requirements where the PR helps with are :
- Max partitioner is not eligible since it is atleast an order smaller,
and
- User has explicitly set 'spark.default.parallelism', and
- Value of 'spark.default.parallelism' is lower than max partitioner
- Since max partitioner was discarded due to being at least an order
smaller, default parallelism is worse - even though user specified.
Under the rest cases, the changes should be no-op.
## How was this patch tested?
Add corresponding test cases in `PairRDDFunctionsSuite` and
`PartitioningSuite`.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #20091 from jiangxb1987/partitioner.
(cherry picked from commit 96cb60bc33936c1aaf728a1738781073891480ff)
Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
(commit: 832d69817c29e8c44fcab0d6a476f36d4ee0c837)
The file was modifiedcore/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/PartitioningSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/Partitioner.scala (diff)
Commit 29ed718732de40e956e37f0673743ae375cd30c5 by hyukjinkwon
[SPARK-20749][SQL][FOLLOW-UP] Override prettyName for bit_length and
octet_length
## What changes were proposed in this pull request? We need to override
the prettyName for bit_length and octet_length for getting the expected
auto-generated alias name.
## How was this patch tested? The existing tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20358 from gatorsmile/test2.3More.
(cherry picked from commit ee572ba8c1339d21c592001ec4f7f270005ff1cf)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 29ed718732de40e956e37f0673743ae375cd30c5)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/operators.sql.out (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/scalar-subquery/scalar-subquery-predicate.sql.out (diff)
Commit f8f522c01025e78eca1724c909c749374f855039 by meng
[SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples.
## What changes were proposed in this pull request?
Added documentation for new transformer.
Author: Bago Amirbekian <bago@databricks.com>
Closes #20285 from MrBago/sizeHintDocs.
(cherry picked from commit 05839d164836e544af79c13de25802552eadd636)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
(commit: f8f522c01025e78eca1724c909c749374f855039)
The file was modifieddocs/ml-features.md (diff)
The file was addedexamples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java
The file was addedexamples/src/main/scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala
The file was addedexamples/src/main/python/ml/vector_size_hint_example.py
Commit 851c303867eb54405f6508919619debe84708933 by gatorsmile
[SPARK-23192][SQL] Keep the Hint after Using Cached Data
## What changes were proposed in this pull request?
The hint of the plan segment is lost, if the plan segment is replaced by
the cached data.
```Scala
     val df1 = spark.createDataFrame(Seq((1, "4"), (2,
"2"))).toDF("key", "value")
     val df2 = spark.createDataFrame(Seq((1, "1"), (2,
"2"))).toDF("key", "value")
     df2.cache()
     val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
```
This PR is to fix it.
## How was this patch tested? Added a test
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20365 from gatorsmile/fixBroadcastHintloss.
(cherry picked from commit 613c290336e3826111164c24319f66774b1f65a3)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 851c303867eb54405f6508919619debe84708933)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala (diff)
Commit a23f6b13e8a4f0471ee33879a14746786bbf0435 by gatorsmile
[SPARK-23195][SQL] Keep the Hint of Cached Data
## What changes were proposed in this pull request? The broadcast hint
of the cached plan is lost if we cache the plan. This PR is to correct
it.
```Scala
val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key",
"value")
val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key",
"value")
broadcast(df2).cache()
df2.collect()
val df3 = df1.join(df2, Seq("key"), "inner")
```
## How was this patch tested? Added a test.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20368 from gatorsmile/cachedBroadcastHint.
(cherry picked from commit 44cc4daf3a03f1a220eef8ce3c86867745db9ab7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: a23f6b13e8a4f0471ee33879a14746786bbf0435)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala (diff)
Commit 3316a9d7104aece977384974cf61e5ec635ad350 by tathagata.das1565
[SPARK-23197][DSTREAMS] Increased timeouts to resolve flakiness
## What changes were proposed in this pull request?
Increased timeout from 50 ms to 300 ms (50 ms was really too low).
## How was this patch tested? Multiple rounds of tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #20371 from tdas/SPARK-23197.
(cherry picked from commit 15adcc8273e73352e5e1c3fc9915c0b004ec4836)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
(commit: 3316a9d7104aece977384974cf61e5ec635ad350)
The file was modifiedstreaming/src/test/scala/org/apache/spark/streaming/ReceiverSuite.scala (diff)
Commit 9cfe90e5abb21086711e0efb7ed08026dba96ffc by felixcheung
[SPARK-21727][R] Allow multi-element atomic vector as column type in
SparkR DataFrame
## What changes were proposed in this pull request?
A fix to https://issues.apache.org/jira/browse/SPARK-21727, "Operating
on an ArrayType in a SparkR DataFrame throws error"
## How was this patch tested?
- Ran tests at R\pkg\tests\run-all.R (see below attached results)
- Tested the following lines in SparkR, which now seem to execute
without error:
``` indices <- 1:4 myDf <- data.frame(indices) myDf$data <- list(rep(0,
20)) mySparkDf <- as.DataFrame(myDf) collect(mySparkDf)
```
[2018-01-22 SPARK-21727 Test
Results.txt](https://github.com/apache/spark/files/1653535/2018-01-22.SPARK-21727.Test.Results.txt)
felixcheung yanboliang sun-rui shivaram
_The contribution is my original work and I license the work to the
project under the project’s open source license_
Author: neilalex <neil@neilalex.com>
Closes #20352 from neilalex/neilalex-sparkr-arraytype.
(cherry picked from commit f54b65c15a732540f7a41a9083eeb7a08feca125)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
(commit: 9cfe90e5abb21086711e0efb7ed08026dba96ffc)
The file was modifiedR/pkg/tests/fulltests/test_Serde.R (diff)
The file was modifiedR/pkg/R/serialize.R (diff)
Commit d656be74b87746efc020d5cae3bfa294f8f98594 by gatorsmile
Revert "[SPARK-23195][SQL] Keep the Hint of Cached Data"
This reverts commit a23f6b13e8a4f0471ee33879a14746786bbf0435.
(commit: d656be74b87746efc020d5cae3bfa294f8f98594)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
Commit 84a189a3429c64eeabc7b4e8fc0488ec16742002 by hyukjinkwon
[SPARK-23177][SQL][PYSPARK][BACKPORT-2.3] Extract zero-parameter UDFs
from aggregate
## What changes were proposed in this pull request?
We extract Python UDFs in logical aggregate which depends on aggregate
expression or grouping key in ExtractPythonUDFFromAggregate rule. But
Python UDFs which don't depend on above expressions should also be
extracted to avoid the issue reported in the JIRA.
A small code snippet to reproduce that issue looks like:
```python import pyspark.sql.functions as f
df = spark.createDataFrame([(1,2), (3,4)]) f_udf = f.udf(lambda:
str("const_str")) df2 = df.distinct().withColumn("a", f_udf())
df2.show()
```
Error exception is raised as:
```
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Binding attribute, tree: pythonUDF0#50
       at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
       at
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
       at
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
       at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
       at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
       at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
       at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
       at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
       at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
       at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
       at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
       at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
       at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
       at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
       at
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
       at
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
       at
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
```
This exception raises because `HashAggregateExec` tries to bind the
aliased Python UDF expression (e.g., `pythonUDF0#50 AS a#44`) to
grouping key.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #20379 from viirya/SPARK-23177-backport-2.3.
(commit: 84a189a3429c64eeabc7b4e8fc0488ec16742002)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit 17317c8fb99715836fcebc39ffb04648ab7fb762 by hyukjinkwon
[SPARK-23148][SQL] Allow pathnames with special characters for CSV /
JSON / text
…JSON / text
## What changes were proposed in this pull request?
Fix for JSON and CSV data sources when file names include characters
that would be changed by URL encoding.
## How was this patch tested?
New unit tests for JSON, CSV and text suites
Author: Henry Robinson <henry@cloudera.com>
Closes #20355 from henryr/spark-23148.
(cherry picked from commit de36f65d3a819c00d6bf6979deef46c824203669)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 17317c8fb99715836fcebc39ffb04648ab7fb762)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala (diff)
Commit 4336e67f41344fd587808182741ae4ef9fb2b76a by felixcheung
[SPARK-20906][SPARKR] Add API doc example for Constrained Logistic
Regression
## What changes were proposed in this pull request?
doc only changes
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #20380 from felixcheung/rclrdoc.
(cherry picked from commit e18d6f5326e0d9ea03d31de5ce04cb84d3b8ab37)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
(commit: 4336e67f41344fd587808182741ae4ef9fb2b76a)
The file was modifiedR/pkg/R/mllib_classification.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_mllib_classification.R (diff)
Commit 2221a30352f1c0f5483c91301f32e66672a43644 by vanzin
[SPARK-23020][CORE][FOLLOWUP] Fix Java style check issues.
## What changes were proposed in this pull request?
This is a follow-up of #20297 which broke lint-java checks. This pr
fixes the lint-java issues.
```
[ERROR] src/test/java/org/apache/spark/launcher/BaseSuite.java:[21,8]
(imports) UnusedImports: Unused import - java.util.concurrent.TimeUnit.
[ERROR]
src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java:[27,8]
(imports) UnusedImports: Unused import - java.util.concurrent.TimeUnit.
```
## How was this patch tested?
Checked manually in my local environment.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20376 from ueshin/issues/SPARK-23020/fup1.
(cherry picked from commit 8c273b4162b6138c4abba64f595c2750d1ef8bcb)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 2221a30352f1c0f5483c91301f32e66672a43644)
The file was modifiedlauncher/src/test/java/org/apache/spark/launcher/BaseSuite.java (diff)
The file was modifiedcore/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java (diff)
Commit 30272c668b2cd8c0b0ee78c600bc3feb17bd6647 by gatorsmile
[SPARK-22837][SQL] Session timeout checker does not work in
SessionManager.
## What changes were proposed in this pull request?
Currently we do not call the `super.init(hiveConf)` in
`SparkSQLSessionManager.init`. So we do not load the config
`HIVE_SERVER2_SESSION_CHECK_INTERVAL HIVE_SERVER2_IDLE_SESSION_TIMEOUT
HIVE_SERVER2_IDLE_SESSION_CHECK_OPERATION` , which cause the session
timeout checker does not work.
## How was this patch tested?
manual tests
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes #20025 from zuotingbing/SPARK-22837.
(cherry picked from commit bbb87b350d9d0d393db3fb7ca61dcbae538553bb)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 30272c668b2cd8c0b0ee78c600bc3feb17bd6647)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala (diff)
Commit 500c94434d8f5267b1488accd176cf54b69e6ba4 by zsxwing
[SPARK-23198][SS][TEST] Fix
KafkaContinuousSourceStressForDontFailOnDataLossSuite to test
ContinuousExecution
## What changes were proposed in this pull request?
Currently, `KafkaContinuousSourceStressForDontFailOnDataLossSuite` runs
on `MicroBatchExecution`. It should test `ContinuousExecution`.
## How was this patch tested?
Pass the updated test suite.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20374 from dongjoon-hyun/SPARK-23198.
(cherry picked from commit bc9641d9026aeae3571915b003ac971f6245d53c)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
(commit: 500c94434d8f5267b1488accd176cf54b69e6ba4)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala (diff)
Commit a857ad56621f644a26b9d27079b76ab21f3726ae by gatorsmile
[MINOR][SQL] add new unit test to LimitPushdown
## What changes were proposed in this pull request?
This PR is repaired as follows 1、update y -> x in "left outer join"
test case ,maybe is mistake. 2、add a new test case:"left outer join
and left sides are limited" 3、add a new test case:"left outer join and
right sides are limited" 4、add a new test case: "right outer join and
right sides are limited" 5、add a new test case: "right outer join and
left sides are limited" 6、Remove annotations without code
implementation
## How was this patch tested?
add new unit test case.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes #20381 from heary-cao/LimitPushdownSuite.
(cherry picked from commit 6f0ba8472d1128551fa8090deebcecde0daebc53)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: a857ad56621f644a26b9d27079b76ab21f3726ae)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushdownSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
Commit 012695256a61f1830ff02780611d4aada00a88a0 by wenchen
[SPARK-23129][CORE] Make deserializeStream of DiskMapIterator init
lazily
## What changes were proposed in this pull request?
Currently,the deserializeStream in ExternalAppendOnlyMap#DiskMapIterator
init when DiskMapIterator instance created.This will cause memory use
overhead when ExternalAppendOnlyMap spill too much times.
We can avoid this by making deserializeStream init when it is used the
first time. This patch make deserializeStream init lazily.
## How was this patch tested?
Exist tests
Author: zhoukang <zhoukang199191@gmail.com>
Closes #20292 from caneGuy/zhoukang/lay-diskmapiterator.
(cherry picked from commit 45b4bbfddc18a77011c3bc1bfd71b2cd3466443c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 012695256a61f1830ff02780611d4aada00a88a0)
The file was modifiedcore/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala (diff)
Commit abd3e1bb5fc11a3074ee64a5ed0dc97020dca61c by wenchen
[SPARK-23208][SQL] Fix code generation for complex create array
(related) expressions
## What changes were proposed in this pull request? The
`GenArrayData.genCodeToCreateArrayData` produces illegal java code when
code splitting is enabled. This is used in `CreateArray` and `CreateMap`
expressions for complex object arrays.
This issue is caused by a typo.
## How was this patch tested? Added a regression test in
`complexTypesSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes #20391 from hvanhovell/SPARK-23208.
(cherry picked from commit e29b08add92462a6505fef966629e74ba30e994e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: abd3e1bb5fc11a3074ee64a5ed0dc97020dca61c)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
Commit e66c66cd2d4f0cd67cbc2aa6f95135176f1165e4 by felixcheung
[SPARK-23163][DOC][PYTHON] Sync ML Python API with Scala
## What changes were proposed in this pull request?
This syncs the ML Python API with Scala for differences found after the
2.3 QA audit.
## How was this patch tested?
NA
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #20354 from BryanCutler/pyspark-ml-doc-sync-23163.
(cherry picked from commit 39ee2acf96f1e1496cff8e4d2614d27fca76d43b)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
(commit: e66c66cd2d4f0cd67cbc2aa6f95135176f1165e4)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
The file was modifiedpython/pyspark/ml/feature.py (diff)
The file was modifiedpython/pyspark/ml/evaluation.py (diff)
Commit c79e771f8952e6773c3a84cc617145216feddbcf by wenchen
[SPARK-21717][SQL] Decouple consume functions of physical operators in
whole-stage codegen
## What changes were proposed in this pull request?
It has been observed in SPARK-21603 that whole-stage codegen suffers
performance degradation, if the generated functions are too long to be
optimized by JIT.
We basically produce a single function to incorporate generated codes
from all physical operators in whole-stage. Thus, it is possibly to grow
the size of generated function over a threshold that we can't have JIT
optimization for it anymore.
This patch is trying to decouple the logic of consuming rows in physical
operators to avoid a giant function processing rows.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #18931 from viirya/SPARK-21717.
(cherry picked from commit d20bbc2d87ae6bd56d236a7c3d036b52c5f20ff5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c79e771f8952e6773c3a84cc617145216feddbcf)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 8866f9c24673e739ee87c7341d75dd5c133a744f by nickp
[SPARK-23112][DOC] Add highlights and migration guide for 2.3
Update ML user guide with highlights and migration guide for `2.3`.
## How was this patch tested?
Doc only.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #20363 from MLnick/SPARK-23112-ml-guide.
(cherry picked from commit 8532e26f335b67b74c976712ad82c20ea6dbbf80)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
(commit: 8866f9c24673e739ee87c7341d75dd5c133a744f)
The file was modifieddocs/ml-guide.md (diff)
The file was modifieddocs/ml-migration-guides.md (diff)
Commit 2f65c20ea74a87729eaf3c9b2aebcfb10c0ecf4b by hyukjinkwon
[SPARK-23081][PYTHON] Add colRegex API to PySpark
## What changes were proposed in this pull request?
Add colRegex API to PySpark
## How was this patch tested?
add a test in sql/tests.py
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes #20390 from huaxingao/spark-23081.
(cherry picked from commit 8480c0c57698b7dcccec5483d67b17cf2c7527ed)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 2f65c20ea74a87729eaf3c9b2aebcfb10c0ecf4b)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
Commit 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78 by gatorsmile
[SPARK-23032][SQL] Add a per-query codegenStageId to
WholeStageCodegenExec
## What changes were proposed in this pull request?
**Proposal**
Add a per-query ID to the codegen stages as represented by
`WholeStageCodegenExec` operators. This ID will be used in
-  the explain output of the physical plan, and in
- the generated class name.
Specifically, this ID will be stable within a query, counting up from 1
in depth-first post-order for all the `WholeStageCodegenExec` inserted
into a plan. The ID value 0 is reserved for "free-floating"
`WholeStageCodegenExec` objects, which may have been created for one-off
purposes, e.g. for fallback handling of codegen stages that failed to
codegen the whole stage and wishes to codegen a subset of the children
operators (as seen in
`org.apache.spark.sql.execution.FileSourceScanExec#doExecute`).
Example: for the following query:
```scala scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold",
1)
scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as
'y).orderBy('x).select('x + 1 as 'z, 'y) df1:
org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]
scala> val df2 = spark.range(5) df2: org.apache.spark.sql.Dataset[Long]
= [id: bigint]
scala> val query = df1.join(df2, 'z === 'id) query:
org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field]
```
The explain output before the change is:
```scala scala> query.explain
== Physical Plan ==
*SortMergeJoin [z#9L], [id#13L], Inner
:- *Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
:     +- *Project [(x#3L + 1) AS z#9L, y#4L]
:        +- *Sort [x#3L ASC NULLS FIRST], true, 0
:           +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:              +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
:                 +- *Range (0, 10, step=1, splits=8)
+- *Sort [id#13L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(id#13L, 200)
     +- *Range (0, 5, step=1, splits=8)
``` Note how codegen'd operators are annotated with a prefix `"*"`. See
how the `SortMergeJoin` operator and its direct children `Sort`
operators are adjacent and all annotated with the `"*"`, so it's hard to
tell they're actually in separate codegen stages.
and after this change it'll be:
```scala scala> query.explain
== Physical Plan ==
*(6) SortMergeJoin [z#9L], [id#13L], Inner
:- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
:     +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
:        +- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
:           +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:              +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
:                 +- *(1) Range (0, 10, step=1, splits=8)
+- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(id#13L, 200)
     +- *(4) Range (0, 5, step=1, splits=8)
``` Note that the annotated prefix becomes `"*(id) "`. See how the
`SortMergeJoin` operator and its direct children `Sort` operators have
different codegen stage IDs.
It'll also show up in the name of the generated class, as a suffix in
the format of `GeneratedClass$GeneratedIterator$id`.
For example, note how `GeneratedClass$GeneratedIteratorForCodegenStage3`
and `GeneratedClass$GeneratedIteratorForCodegenStage6` in the following
stack trace corresponds to the IDs shown in the explain output above:
```
"Executor task launch worker for task 42412957" daemon prio=5 tid=0x58
nid=NA runnable
java.lang.Thread.State: RUNNABLE
  at
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
  at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32)
  at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41)
  at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
  at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42)
  at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101)
  at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
  at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
  at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)
```
**Rationale**
Right now, the codegen from Spark SQL lacks the means to differentiate
between a couple of things:
1. It's hard to tell which physical operators are in the same
WholeStageCodegen stage. Note that this "stage" is a separate notion
from Spark's RDD execution stages; this one is only to delineate codegen
units. There can be adjacent physical operators that are both codegen'd
but are in separate codegen stages. Some of this is due to hacky
implementation details, such as the case with `SortMergeJoin` and its
`Sort` inputs -- they're hard coded to be split into separate stages
although both are codegen'd. When printing out the explain output of the
physical plan, you'd only see the codegen'd physical operators annotated
with a preceding star (`'*'`) but would have no way to figure out if
they're in the same stage.
2. Performance/error diagnosis The generated code has class/method names
that are hard to differentiate between queries or even between codegen
stages within the same query. If we use a Java-level profiler to collect
profiles, or if we encounter a Java-level exception with a stack trace
in it, it's really hard to tell which part of a query it's at. By
introducing a per-query codegen stage ID, we'd at least be able to know
which codegen stage (and in turn, which group of physical operators) was
a profile tick or an exception happened.
The reason why this proposal uses a per-query ID is because it's stable
within a query, so that multiple runs of the same query will see the
same resulting IDs. This both benefits understandability for users, and
also it plays well with the codegen cache in Spark SQL which uses the
generated source code as the key.
The downside to using per-query IDs as opposed to a per-session or
globally incrementing ID is of course we can't tell apart different
query runs with this ID alone. But for now I believe this is a good
enough tradeoff.
## How was this patch tested?
Existing tests. This PR does not involve any runtime behavior changes
other than some name changes. The SQL query test suites that compares
explain outputs have been updates to ignore the newly added
`codegenStageId`.
Author: Kris Mok <kris.mok@databricks.com>
Closes #20224 from rednaxelafx/wsc-codegenstageid.
(cherry picked from commit e57f394818b0a62f99609e1032fede7e981f306f)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala (diff)
Commit 87d128ffd4a56fe3995f041bf4a8c4cba6c20092 by sowen
[SPARK-23205][ML] Update ImageSchema.readImages to correctly set alpha
values for four-channel images
## What changes were proposed in this pull request?
When parsing raw image data in ImageSchema.decode(), we use a
[java.awt.Color](https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int))
constructor that sets alpha = 255, even for four-channel images (which
may have different alpha values). This PR fixes this issue & adds a unit
test to verify correctness of reading four-channel images.
## How was this patch tested?
Updates an existing unit test ("readImages pixel values test" in
`ImageSchemaSuite`) to also verify correctness when reading a
four-channel image.
Author: Sid Murching <sid.murching@databricks.com>
Closes #20389 from smurching/image-schema-bugfix.
(cherry picked from commit 7bd46d9871567597216cc02e1dc72ff5806ecdf8)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: 87d128ffd4a56fe3995f041bf4a8c4cba6c20092)
The file was addeddata/mllib/images/multi-channel/BGRA_alpha_60.png
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala (diff)
Commit fdf140e25998ad45ce0f931e2e1e36bf4b382dca by wenchen
[SPARK-23020][CORE] Fix race in SparkAppHandle cleanup, again.
Third time is the charm?
There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported in
the bug.
The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code a
bit and makes it also easier to follow.
I included an unrelated, but correct, change to a YARN test so that it
triggers when the PR is built.
Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #20388 from vanzin/SPARK-23020.
(cherry picked from commit 70a68b328b856c17eb22cc86fee0ebe8d64f8825)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: fdf140e25998ad45ce0f931e2e1e36bf4b382dca)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/LauncherServer.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/InProcessAppHandle.java (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala (diff)
Commit d6cdc699e350adc6b5ed938192e0e72cff5f52d8 by nickp
[SPARK-22799][ML] Bucketizer should throw exception if single- and
multi-column params are both set
## What changes were proposed in this pull request?
Currently there is a mixed situation when both single- and multi-column
are supported. In some cases exceptions are thrown, in others only a
warning log is emitted. In this discussion
https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049,
the decision was to throw an exception.
The PR throws an exception in `Bucketizer`, instead of logging a
warning.
## How was this patch tested?
modified UT
Author: Marco Gaido <marcogaido91@gmail.com> Author: Joseph K. Bradley
<joseph@databricks.com>
Closes #19993 from mgaido91/SPARK-22799.
(cherry picked from commit cd3956df0f96dd416b6161bf7ce2962e06d0a62e)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
(commit: d6cdc699e350adc6b5ed938192e0e72cff5f52d8)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/params.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala (diff)
Commit ab1b5d921b395cb7df3a3a2c4a7e5778d98e6f01 by nickp
[SPARK-22797][PYSPARK] Bucketizer support multi-column
## What changes were proposed in this pull request? Bucketizer support
multi-column in the python side
## How was this patch tested? existing tests and added tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #19892 from zhengruifeng/20542_py.
(cherry picked from commit c22eaa94e85aaac649566495dcf763a5de3c8d06)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
(commit: ab1b5d921b395cb7df3a3a2c4a7e5778d98e6f01)
The file was modifiedpython/pyspark/ml/tests.py (diff)
The file was modifiedpython/pyspark/ml/param/__init__.py (diff)
The file was modifiedpython/pyspark/ml/feature.py (diff)
Commit ca3613be20ff4dc546c43322eeabf591ab8ad97f by gatorsmile
[SPARK-23218][SQL] simplify ColumnVector.getArray
## What changes were proposed in this pull request?
`ColumnVector` is very flexible about how to implement array type. As a
result `ColumnVector` has 3 abstract methods for array type:
`arrayData`, `getArrayOffset`, `getArrayLength`. For example, in
`WritableColumnVector` we use the first child vector as the array data
vector, and store offsets and lengths in 2 arrays in the parent vector.
`ArrowColumnVector` has a different implementation.
This PR simplifies `ColumnVector` by using only one abstract method for
array type: `getArray`.
## How was this patch tested?
existing tests.
rerun `ColumnarBatchBenchmark`, there is no performance regression.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20395 from cloud-fan/vector.
(cherry picked from commit dd8e257d1ccf20f4383dd7f30d634010b176f0d3)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: ca3613be20ff4dc546c43322eeabf591ab8ad97f)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java (diff)
Commit f5911d4894700eb48f794133cbd363bf3b7c8c8e by nickp
Revert "[SPARK-22797][PYSPARK] Bucketizer support multi-column"
This reverts commit ab1b5d921b395cb7df3a3a2c4a7e5778d98e6f01.
(commit: f5911d4894700eb48f794133cbd363bf3b7c8c8e)
The file was modifiedpython/pyspark/ml/feature.py (diff)
The file was modifiedpython/pyspark/ml/tests.py (diff)
The file was modifiedpython/pyspark/ml/param/__init__.py (diff)
Commit 30d16e116b0ff044ca03974de0f1faf17e497903 by sameerag
[SPARK-23207][SQL] Shuffle+Repartition on a DataFrame could lead to
incorrect answers
## What changes were proposed in this pull request?
Currently shuffle repartition uses RoundRobinPartitioning, the generated
result is nondeterministic since the sequence of input rows are not
determined.
The bug can be triggered when there is a repartition call following a
shuffle (which would lead to non-deterministic row ordering), as the
pattern shows below: upstream stage -> repartition stage -> result stage
(-> indicate a shuffle) When one of the executors process goes down,
some tasks on the repartition stage will be retried and generate
inconsistent ordering, and some tasks of the result stage will be
retried generating different data.
The following code returns 931532, instead of 1000000:
``` import scala.sys.process._
import org.apache.spark.TaskContext val res = spark.range(0, 1000 *
1000, 1).repartition(200).map { x =>
x
}.repartition(200).map { x =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId <
2) {
   throw new Exception("pkill -f java".!!)
}
x
} res.distinct().count()
```
In this PR, we propose a most straight-forward way to fix this problem
by performing a local sort before partitioning, after we make the input
row ordering deterministic, the function from rows to partitions is
fully deterministic too.
The downside of the approach is that with extra local sort inserted, the
performance of repartition() will go down, so we add a new config named
`spark.sql.execution.sortBeforeRepartition` to control whether this
patch is applied. The patch is default enabled to be safe-by-default,
but user may choose to manually turn it off to avoid performance
regression.
This patch also changes the output rows ordering of repartition(), that
leads to a bunch of test cases failure because they are comparing the
results directly.
## How was this patch tested?
Add unit test in ExchangeSuite.
With this patch(and `spark.sql.execution.sortBeforeRepartition` set to
true), the following query returns 1000000:
``` import scala.sys.process._
import org.apache.spark.TaskContext
spark.conf.set("spark.sql.execution.sortBeforeRepartition", "true")
val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
x
}.repartition(200).map { x =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId <
2) {
   throw new Exception("pkill -f java".!!)
}
x
} res.distinct().count()
res7: Long = 1000000
```
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #20393 from jiangxb1987/shuffle-repartition.
(cherry picked from commit 94c67a76ec1fda908a671a47a2a1fa63b3ab1b06)
Signed-off-by: Sameer Agarwal <sameerag@apache.org>
(commit: 30d16e116b0ff044ca03974de0f1faf17e497903)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ForeachSinkSuite.scala (diff)
The file was modifiedcore/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillMerger.java (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/WholeTextFileSuite.scala (diff)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/util/collection/unsafe/sort/RecordComparator.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/RDD.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (diff)
The file was modifiedcore/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java (diff)
Commit 7aaf23cf8ab871a8e8877ec82183656ae5f4be7b by zsxwing
[SPARK-23242][SS][TESTS] Don't run tests in KafkaSourceSuiteBase twice
## What changes were proposed in this pull request?
KafkaSourceSuiteBase should be abstract class, otherwise
KafkaSourceSuiteBase will also run.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes #20412 from zsxwing/SPARK-23242.
(cherry picked from commit 073744985f439ca90afb9bd0bbc1332c53f7b4bb)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
(commit: 7aaf23cf8ab871a8e8877ec82183656ae5f4be7b)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala (diff)
Commit 20c0efe48df9ce622d97cf3d1274d877c0e3095c by gatorsmile
[SPARK-23214][SQL] cached data should not carry extra hint info
## What changes were proposed in this pull request?
This is a regression introduced by
https://github.com/apache/spark/pull/19864
When we lookup cache, we should not carry the hint info, as this cache
entry might be added by a plan having hint info, while the input plan
for this lookup may not have hint info, or have different hint info.
## How was this patch tested?
a new test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20394 from cloud-fan/cache.
(cherry picked from commit 5b5447c68ac79715e2256e487e1212861cdab1fc)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 20c0efe48df9ce622d97cf3d1274d877c0e3095c)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala (diff)
Commit 234c854bd203e9ba32be50b1f33cc118d0dbd9e8 by sowen
[MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples
## What changes were proposed in this pull request?
This PR fixes Scala/Java doc examples in `Trigger.java`.
## How was this patch tested?
N/A.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20401 from dongjoon-hyun/SPARK-TRIGGER.
(cherry picked from commit e7bc9f0524822a08d857c3a5ba57119644ceae85)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: 234c854bd203e9ba32be50b1f33cc118d0dbd9e8)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java (diff)
Commit 65600bfdb9417e5f2bd2e40312e139f592f238e9 by zsxwing
[SPARK-23245][SS][TESTS] Don't access `lastExecution.executedPlan` in
StreamTest
## What changes were proposed in this pull request?
`lastExecution.executedPlan` is lazy val so accessing it in StreamTest
may need to acquire the lock of `lastExecution`. It may be waiting
forever when the streaming thread is holding it and running a continuous
Spark job.
This PR changes to check if `s.lastExecution` is null to avoid accessing
`lastExecution.executedPlan`.
## How was this patch tested?
Jenkins
Author: Jose Torres <jose@databricks.com>
Closes #20413 from zsxwing/SPARK-23245.
(cherry picked from commit 6328868e524121bd00595959d6d059f74e038a6b)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
(commit: 65600bfdb9417e5f2bd2e40312e139f592f238e9)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala (diff)
Commit 3b6fc286d105ae7de737c46e50cf941e6831ab98 by gatorsmile
[SPARK-23233][PYTHON] Reset the cache in asNondeterministic to set
deterministic properly
## What changes were proposed in this pull request?
Reproducer:
```python from pyspark.sql.functions import udf f = udf(lambda x: x)
spark.range(1).select(f("id"))  # cache JVM UDF instance. f =
f.asNondeterministic()
spark.range(1).select(f("id"))._jdf.logicalPlan().projectList().head().deterministic()
```
It should return `False` but the current master returns `True`. Seems
it's because we cache the JVM UDF instance and then we reuse it even
after setting `deterministic` disabled once it's called.
## How was this patch tested?
Manually tested. I am not sure if I should add the test with a lot of
JVM accesses with the intetnal stuff .. Let me know if anyone feels so.
I will add.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20409 from HyukjinKwon/SPARK-23233.
(cherry picked from commit 3227d14feb1a65e95a2bf326cff6ac95615cc5ac)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 3b6fc286d105ae7de737c46e50cf941e6831ab98)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit 8ff0cc48b1b45ed41914822ffaaf8de8dff87b72 by hyukjinkwon
[SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in
PySpark examples
## What changes were proposed in this pull request?
This PR proposes to relocate the docstrings in modules of examples to
the top. Seems these are mistakes. So, for example, the below codes
```python
>>> help(aft_survival_regression)
```
shows the module docstrings for examples as below:
**Before**
``` Help on module aft_survival_regression:
NAME
   aft_survival_regression
...
DESCRIPTION
   # Licensed to the Apache Software Foundation (ASF) under one or more
   # contributor license agreements.  See the NOTICE file distributed
with
   # this work for additional information regarding copyright ownership.
   # The ASF licenses this file to You under the Apache License, Version
2.0
   # (the "License"); you may not use this file except in compliance
with
   # the License.  You may obtain a copy of the License at
   #
   #    http://www.apache.org/licenses/LICENSE-2.0
   #
   # Unless required by applicable law or agreed to in writing, software
   # distributed under the License is distributed on an "AS IS" BASIS,
   # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
   # See the License for the specific language governing permissions and
   # limitations under the License.
   #
...
(END)
```
**After**
``` Help on module aft_survival_regression:
NAME
   aft_survival_regression
...
DESCRIPTION
   An example demonstrating aft survival regression.
   Run with:
     bin/spark-submit
examples/src/main/python/ml/aft_survival_regression.py
(END)
```
## How was this patch tested?
Manually checked.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20416 from HyukjinKwon/module-docstring-example.
(cherry picked from commit b8c32dc57368e49baaacf660b7e8836eedab2df7)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 8ff0cc48b1b45ed41914822ffaaf8de8dff87b72)
The file was modifiedexamples/src/main/python/ml/fpgrowth_example.py (diff)
The file was modifiedexamples/src/main/python/sql/basic.py (diff)
The file was modifiedexamples/src/main/python/ml/one_vs_rest_example.py (diff)
The file was modifiedexamples/src/main/python/ml/min_hash_lsh_example.py (diff)
The file was modifiedexamples/src/main/python/ml/correlation_example.py (diff)
The file was modifiedexamples/src/main/python/parquet_inputformat.py (diff)
The file was modifiedexamples/src/main/python/ml/aft_survival_regression.py (diff)
The file was modifiedexamples/src/main/python/ml/gaussian_mixture_example.py (diff)
The file was modifiedexamples/src/main/python/avro_inputformat.py (diff)
The file was modifiedexamples/src/main/python/sql/datasource.py (diff)
The file was modifiedexamples/src/main/python/ml/chi_square_test_example.py (diff)
The file was modifiedexamples/src/main/python/ml/isotonic_regression_example.py (diff)
The file was modifiedexamples/src/main/python/ml/bucketed_random_projection_lsh_example.py (diff)
The file was modifiedexamples/src/main/python/ml/kmeans_example.py (diff)
The file was modifiedexamples/src/main/python/ml/lda_example.py (diff)
The file was modifiedexamples/src/main/python/ml/bisecting_k_means_example.py (diff)
The file was modifiedexamples/src/main/python/ml/generalized_linear_regression_example.py (diff)
The file was modifiedexamples/src/main/python/sql/hive.py (diff)
The file was modifiedexamples/src/main/python/ml/imputer_example.py (diff)
The file was modifiedexamples/src/main/python/ml/train_validation_split.py (diff)
The file was modifiedexamples/src/main/python/ml/cross_validator.py (diff)
The file was modifiedexamples/src/main/python/ml/logistic_regression_summary_example.py (diff)
Commit 7ca2cd463db90fc166a10de1ebe58ccc795fbbe9 by sowen
[SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFrameWriter
## What changes were proposed in this pull request?
Fix typo in ScalaDoc for DataFrameWriter - originally stated "This is
applicable for all file-based data sources (e.g. Parquet, JSON) staring
Spark 2.1.0", should be "starting with Spark 2.1.0".
## How was this patch tested?
Check of correct spelling in ScalaDoc
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: CCInCharge <charles.l.chen.clc@gmail.com>
Closes #20417 from CCInCharge/master.
(cherry picked from commit 686a622c93207564635569f054e1e6c921624e96)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: 7ca2cd463db90fc166a10de1ebe58ccc795fbbe9)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
Commit 588b9694c1967ff45774431441e84081ee6eb515 by wenchen
[SPARK-23196] Unify continuous and microbatch V2 sinks
## What changes were proposed in this pull request?
Replace streaming V2 sinks with a unified StreamWriteSupport interface,
with a shim to use it with microbatch execution.
Add a new SQL config to use for disabling V2 sinks, falling back to the
V1 sink implementation.
## How was this patch tested?
Existing tests, which in the case of Kafka (the only existing continuous
V2 sink) now use V2 for microbatch.
Author: Jose Torres <jose@databricks.com>
Closes #20369 from jose-torres/streaming-sink.
(cherry picked from commit 49b0207dc9327989c72700b4d04d2a714c92e159)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 588b9694c1967ff45774431441e84081ee6eb515)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousWriteSupport.java
The file was addedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaStreamWriter.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MemorySinkV2Suite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/StreamWriteSupport.java
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/MicroBatchWriteSupport.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/writer/ContinuousWriter.java
The file was removedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousWriter.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/MicroBatchWriter.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/writer/StreamWriter.java
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was modifiedsql/core/src/test/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala (diff)
Commit 5dda5db1229a20d7e3b0caab144af16da0787d56 by sameerag
[SPARK-23020] Ignore Flaky Test:
SparkLauncherSuite.testInProcessLauncher in Spark 2.3
(commit: 5dda5db1229a20d7e3b0caab144af16da0787d56)
The file was modifiedcore/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java (diff)
Commit 8229e155d84cf02479c5dd0df6d577aff5075c00 by hyukjinkwon
[SPARK-23238][SQL] Externalize SQLConf configurations exposed in
documentation
## What changes were proposed in this pull request?
This PR proposes to expose few internal configurations found in the
documentation.
Also it fixes the description for `spark.sql.execution.arrow.enabled`.
It's quite self-explanatory.
## How was this patch tested?
N/A
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20403 from HyukjinKwon/minor-doc-arrow.
(cherry picked from commit 39d2c6b03488895a0acb1dd3c46329db00fdd357)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 8229e155d84cf02479c5dd0df6d577aff5075c00)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit de66abafcf081c5cc4d1556d21c8ec21e1fefdf5 by wenchen
[SPARK-23219][SQL] Rename ReadTask to DataReaderFactory in data source
v2
## What changes were proposed in this pull request?
Currently we have `ReadTask` in data source v2 reader, while in writer
we have `DataWriterFactory`. To make the naming consistent and better,
renaming `ReadTask` to `DataReaderFactory`.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes #20397 from gengliangwang/rename.
(cherry picked from commit badf0d0e0d1d9aa169ed655176ce9ae684d3905d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: de66abafcf081c5cc4d1556d21c8ec21e1fefdf5)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReader.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaBatchDataSourceV2.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/SimpleWritableDataSource.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/MicroBatchReader.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Distribution.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ClusteredDistribution.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanUnsafeRow.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaPartitionAwareDataSource.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/MicroBatchReadSupport.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDDIter.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java (diff)
Commit 4059454f979874caa9745861a2bcc60cac0bbffd by gatorsmile
[SPARK-23199][SQL] improved Removes repetition from group expressions in
Aggregate
## What changes were proposed in this pull request?
Currently, all Aggregate operations will go into
RemoveRepetitionFromGroupExpressions, but there is no group expression
or there is no duplicate group expression in group expression, we not
need copy for logic plan.
## How was this patch tested?
the existed test case.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes #20375 from heary-cao/RepetitionGroupExpressions.
(cherry picked from commit 54dd7cf4ef921bc9dc12f99cfb90d1da57939901)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 4059454f979874caa9745861a2bcc60cac0bbffd)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala (diff)
Commit d68198d26e32ce98cbf0d3f8755d21dc72b3756d by hvanhovell
[SPARK-23223][SQL] Make stacking dataset transforms more performant
## What changes were proposed in this pull request? It is a common
pattern to apply multiple transforms to a `Dataset` (using
`Dataset.withColumn` for example. This is currently quite expensive
because we run `CheckAnalysis` on the full plan and create an encoder
for each intermediate `Dataset`.
This PR extends the usage of the `AnalysisBarrier` to include
`CheckAnalysis`. By doing this we hide the already analyzed plan  from
`CheckAnalysis` because barrier is a `LeafNode`. The `AnalysisBarrier`
is in the `FinishAnalysis` phase of the optimizer.
We also make binding the `Dataset` encoder lazy. The bound encoder is
only needed when we materialize the dataset.
## How was this patch tested? Existing test should cover this.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes #20402 from hvanhovell/SPARK-23223.
(cherry picked from commit 2d903cf9d3a827e54217dfc9f1e4be99d8204387)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
(commit: d68198d26e32ce98cbf0d3f8755d21dc72b3756d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala (diff)
Commit 6588e007e8e10da7cc9771451eeb4d3a2bdc6e0e by gatorsmile
[SPARK-22221][DOCS] Adding User Documentation for Arrow
## What changes were proposed in this pull request?
Adding user facing documentation for working with Arrow in Spark
Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin
<ice.xelloss@gmail.com> Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19575 from BryanCutler/arrow-user-docs-SPARK-2221.
(cherry picked from commit 0d60b3213fe9a7ae5e9b208639f92011fdb2ca32)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 6588e007e8e10da7cc9771451eeb4d3a2bdc6e0e)
The file was modifieddocs/sql-programming-guide.md (diff)
The file was addedexamples/src/main/python/sql/arrow.py
Commit 438631031b2a7d79f8c639ef8ef0de1303bb9f2b by gatorsmile
[SPARK-22916][SQL][FOLLOW-UP] Update the Description of Join Selection
## What changes were proposed in this pull request? This PR is to update
the description of the join algorithm changes.
## How was this patch tested? N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20420 from gatorsmile/followUp22916.
(cherry picked from commit e30b34f7bd9a687eb43d636fffeb98fe235fcbf4)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 438631031b2a7d79f8c639ef8ef0de1303bb9f2b)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
Commit 75131ee867bca17ca3aade5f832f1d49b7cfcff5 by irashid
[SPARK-23209][core] Allow credential manager to work when Hive not
available.
The JVM seems to be doing early binding of classes that the Hive
provider depends on, causing an error to be thrown before it was caught
by the code in the class.
The fix wraps the creation of the provider in a try..catch so that the
provider can be ignored when dependencies are missing.
Added a unit test (which fails without the fix), and also tested that
getting tokens still works in a real cluster.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #20399 from vanzin/SPARK-23209.
(cherry picked from commit b834446ec1338349f6d974afd96f677db3e8fd1a)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
(commit: 75131ee867bca17ca3aade5f832f1d49b7cfcff5)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManagerSuite.scala (diff)
Commit 2858eaafaf06d3b8c55a8a5ed7831260244932cd by gatorsmile
[SPARK-22221][SQL][FOLLOWUP] Externalize
spark.sql.execution.arrow.maxRecordsPerBatch
## What changes were proposed in this pull request?
This is a followup to #19575 which added a section on setting max Arrow
record batches and this will externalize the conf that was referenced in
the docs.
## How was this patch tested? NA
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #20423 from
BryanCutler/arrow-user-doc-externalize-maxRecordsPerBatch-SPARK-22221.
(cherry picked from commit f235df66a4754cbb64d5b7b5cfd5a52bdd243b8a)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 2858eaafaf06d3b8c55a8a5ed7831260244932cd)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit a81ace196ec41b1b3f739294c965df270ee2ddd2 by wenchen
[SPARK-23207][SQL][FOLLOW-UP] Don't perform local sort for
DataFrame.repartition(1)
## What changes were proposed in this pull request?
In `ShuffleExchangeExec`, we don't need to insert extra local sort
before round-robin partitioning, if the new partitioning has only 1
partition, because under that case all output rows go to the same
partition.
## How was this patch tested?
The existing test cases.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #20426 from jiangxb1987/repartition1.
(cherry picked from commit b375397b1678b7fe20a0b7f87a7e8b37ae5646ef)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: a81ace196ec41b1b3f739294c965df270ee2ddd2)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ForeachSinkSuite.scala (diff)
Commit bb7502f9a506d52365d7532b3b0281098dd85763 by gatorsmile
[SPARK-23157][SQL] Explain restriction on column expression in
withColumn()
## What changes were proposed in this pull request?
It's not obvious from the comments that any added column must be a
function of the dataset that we are adding it to. Add a comment to that
effect to Scala, Python and R Data* methods.
Author: Henry Robinson <henry@cloudera.com>
Closes #20429 from henryr/SPARK-23157.
(cherry picked from commit 8b983243e45dfe2617c043a3229a7d87f4c4b44b)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: bb7502f9a506d52365d7532b3b0281098dd85763)
The file was modifiedR/pkg/R/DataFrame.R (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
Commit 107d4e293867af42c87cbc1a93d14c5492c2ba84 by nickp
[SPARK-23138][ML][DOC] Multiclass logistic regression summary example
and user guide
## What changes were proposed in this pull request?
User guide and examples are updated to reflect multiclass logistic
regression summary which was added in
[SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139).
I did not make a separate summary example, but added the summary code to
the multiclass example that already existed. I don't see the need for a
separate example for the summary.
## How was this patch tested?
Docs and examples only. Ran all examples locally using spark-submit.
Author: sethah <shendrickson@cloudera.com>
Closes #20332 from sethah/multiclass_summary_example.
(cherry picked from commit 5056877e8bea56dd0f4dc9e3385669e1e78b2925)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
(commit: 107d4e293867af42c87cbc1a93d14c5492c2ba84)
The file was modifieddocs/ml-classification-regression.md (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala (diff)
The file was modifiedexamples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/ml/JavaMulticlassLogisticRegressionWithElasticNetExample.java (diff)
Commit d3e623b19231e6d59793b86afa01f169fb2dedb2 by wenchen
[SPARK-23260][SPARK-23262][SQL] several data source v2 naming cleanup
## What changes were proposed in this pull request?
All other classes in the reader/writer package doesn't have `V2` in
their names, and the streaming reader/writer don't have `V2` either.
It's more consistent to remove `V2` from `DataSourceV2Reader` and
`DataSourceVWriter`.
Also rename `DataSourceV2Option` to remote the `V2`, we should only have
`V2` in the root interface: `DataSourceV2`.
This PR also fixes some places that the mix-in interface doesn't extend
the interface it aimed to mix in.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20427 from cloud-fan/ds-v2.
(cherry picked from commit 0a9ac0248b6514a1e83ff7e4c522424f01b8b78d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: d3e623b19231e6d59793b86afa01f169fb2dedb2)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/SimpleWritableDataSource.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaPartitionAwareDataSource.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupportWithSchema.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceOptionsSuite.scala
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownCatalystFilters.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/writer/StreamWriter.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/SessionConfigSupport.java (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/SupportsWriteInternalRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/MicroBatchWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/MicroBatchReadSupport.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/WriterCommitMessage.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceReaderHolder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceV2Suite.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/MicroBatchReader.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanUnsafeRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/StreamWriteSupport.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownRequiredColumns.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportStatistics.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaBatchDataSourceV2.java (diff)
The file was removedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2OptionsSuite.scala
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceReader.java
Commit 7d96dc1acf7d7049a6e6c35de726f800c8160422 by wenchen
[SPARK-23222][SQL] Make DataFrameRangeSuite not flaky
## What changes were proposed in this pull request?
It is reported that the test `Cancelling stage in a query with Range` in
`DataFrameRangeSuite` fails a few times in unrelated PRs. I personally
also saw it too in my PR.
This test is not very flaky actually but only fails occasionally. Based
on how the test works, I guess that is because `range` finishes before
the listener calls `cancelStage`.
I increase the range number from `1000000000L` to `100000000000L` and
count the range in one partition. I also reduce the `interval` of
checking stage id. Hopefully it can make the test not flaky anymore.
## How was this patch tested?
The modified tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #20431 from viirya/SPARK-23222.
(cherry picked from commit 84bcf9dc88ffeae6fba4cfad9455ad75bed6e6f6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7d96dc1acf7d7049a6e6c35de726f800c8160422)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameRangeSuite.scala (diff)
Commit 2e0c1e5f3e47e4e35c14732b93a29d1a25e15662 by gatorsmile
[SPARK-23267][SQL] Increase spark.sql.codegen.hugeMethodLimit to 65535
## What changes were proposed in this pull request? Still saw the
performance regression introduced by `spark.sql.codegen.hugeMethodLimit`
in our internal workloads. There are two major issues in the current
solution.
- The size of the complied byte code is not identical to the bytecode
size of the method. The detection is still not accurate.
- The bytecode size of a single operator (e.g., `SerializeFromObject`)
could still exceed 8K limit. We saw the performance regression in such
scenario.
Since it is close to the release of 2.3, we decide to increase it to 64K
for avoiding the perf regression.
## How was this patch tested? N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20434 from gatorsmile/revertConf.
(cherry picked from commit 31c00ad8b090d7eddc4622e73dc4440cd32624de)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 2e0c1e5f3e47e4e35c14732b93a29d1a25e15662)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit f4802dc8866d0316a6b555b0ab58a56d56d8c6fe by gatorsmile
[SPARK-23275][SQL] hive/tests have been failing when run locally on the
laptop (Mac) with OOM
## What changes were proposed in this pull request? hive tests have been
failing when they are run locally (Mac Os) after a recent change in the
trunk. After running the tests for some time, the test fails with OOM
with Error: unable to create new native thread.
I noticed the thread count goes all the way up to 2000+ after which we
start getting these OOM errors. Most of the threads seem to be related
to the connection pool in hive metastore (BoneCP-xxxxx-xxxx ). This
behaviour change is happening after we made the following change to
HiveClientImpl.reset()
``` SQL
def reset(): Unit = withHiveState {
   try {
     // code
   } finally {
     runSqlHive("USE default")  ===> this is causing the issue
   }
``` I am proposing to temporarily back-out part of a fix made to address
SPARK-23000 to resolve this issue while we work-out the exact reason for
this sudden increase in thread counts.
## How was this patch tested? Ran hive/test multiple times in different
machines.
(If this patch involves UI changes, please attach a screenshot;
otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #20441 from dilipbiswal/hive_tests.
(cherry picked from commit 58fcb5a95ee0b91300138cd23f3ce2165fab597f)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: f4802dc8866d0316a6b555b0ab58a56d56d8c6fe)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
Commit 7b9fe08658b529901a5f22bf81ffdd4410180809 by gatorsmile
[SPARK-23261][PYSPARK][BACKPORT-2.3] Rename Pandas UDFs
This PR is to backport https://github.com/apache/spark/pull/20428 to
Spark 2.3 without adding the changes regarding `GROUPED AGG PANDAS UDF`
---
## What changes were proposed in this pull request? Rename the public
APIs and names of pandas udfs.
- `PANDAS SCALAR UDF` -> `SCALAR PANDAS UDF`
- `PANDAS GROUP MAP UDF` -> `GROUPED MAP PANDAS UDF`
## How was this patch tested? The existing tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20439 from gatorsmile/backport2.3.
(commit: 7b9fe08658b529901a5f22bf81ffdd4410180809)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala (diff)
The file was modifiedpython/pyspark/rdd.py (diff)
The file was modifiedexamples/src/main/python/sql/arrow.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala (diff)
The file was modifieddocs/sql-programming-guide.md (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (diff)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedpython/pyspark/sql/group.py (diff)
Commit 6ed0d57f86e76f37c4ca1c6d721fc235dcec520e by gatorsmile
[SPARK-23276][SQL][TEST] Enable UDT tests in
(Hive)OrcHadoopFsRelationSuite
## What changes were proposed in this pull request?
Like Parquet, ORC test suites should enable UDT tests.
## How was this patch tested?
Pass the Jenkins with newly enabled test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20440 from dongjoon-hyun/SPARK-23276.
(cherry picked from commit 77866167330a665e174ae08a2f8902ef9dc3438b)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 6ed0d57f86e76f37c4ca1c6d721fc235dcec520e)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala (diff)
Commit b8778321bbb443f2d51ab7e2b1aff3d1e4236e35 by gatorsmile
[SPARK-23274][SQL] Fix ReplaceExceptWithFilter when the right's Filter
contains the references that are not in the left output
## What changes were proposed in this pull request? This PR is to fix
the `ReplaceExceptWithFilter` rule when the right's Filter contains the
references that are not in the left output.
Before this PR, we got the error like
``` java.util.NoSuchElementException: key not found: a
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
```
After this PR, `ReplaceExceptWithFilter ` will not take an effect in
this case.
## How was this patch tested? Added tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20444 from gatorsmile/fixReplaceExceptWithFilter.
(cherry picked from commit ca04c3ff2387bf0a4308a4b010154e6761827278)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: b8778321bbb443f2d51ab7e2b1aff3d1e4236e35)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala (diff)
Commit ab5a5105502c545bed951538f0ce9409cfbde154 by jerryshao
[SPARK-23279][SS] Avoid triggering distributed job for Console sink
## What changes were proposed in this pull request?
Console sink will redistribute collected local data and trigger a
distributed job in each batch, this is not necessary, so here change to
local job.
## How was this patch tested?
Existing UT and manual verification.
Author: jerryshao <sshao@hortonworks.com>
Closes #20447 from jerryshao/console-minor.
(cherry picked from commit 8c6a9c90a36a938372f28ee8be72178192fbc313)
Signed-off-by: jerryshao <sshao@hortonworks.com>
(commit: ab5a5105502c545bed951538f0ce9409cfbde154)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala (diff)
Commit 7ec8ad7ba5b45658b55dd278b4c7ca2e35acfdd3 by wenchen
[SPARK-23272][SQL] add calendar interval type support to ColumnVector
## What changes were proposed in this pull request?
`ColumnVector` is aimed to support all the data types, but
`CalendarIntervalType` is missing. Actually we do support interval type
for inner fields, e.g. `ColumnarRow`, `ColumnarArray` both support
interval type. It's weird if we don't support interval type at the top
level.
This PR adds the interval type support.
This PR also makes `ColumnVector.getChild` protect. We need it public
because `MutableColumnaRow.getInterval` needs it. Now the interval
implementation is in `ColumnVector.getInterval`.
## How was this patch tested?
a new test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20438 from cloud-fan/interval.
(cherry picked from commit 695f7146bca342a0ee192d8c7f5ec48d4d8577a8)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7ec8ad7ba5b45658b55dd278b4c7ca2e35acfdd3)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java (diff)
Commit c83246c9a3fe7557e7e1d226edf18a7e96730d18 by nickp
[SPARK-23112][DOC] Update ML migration guide with breaking and behavior
changes.
Add breaking changes, as well as update behavior changes, to `2.3` ML
migration guide.
## How was this patch tested?
Doc only
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #20421 from MLnick/SPARK-23112-ml-guide.
(cherry picked from commit 161a3f2ae324271a601500e3d2900db9359ee2ef)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
(commit: c83246c9a3fe7557e7e1d226edf18a7e96730d18)
The file was modifieddocs/ml-guide.md (diff)
Commit 33f17b28b3448a6f0389d2ee93bb5d49a02f288c by wenchen
revert [SPARK-22785][SQL] remove ColumnVector.anyNullsSet
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/19980 , we thought `anyNullsSet`
can be simply implemented by `numNulls() > 0`. This is logically true,
but may have performance problems.
`OrcColumnVector` is an example. It doesn't have the `numNulls`
property, only has a `noNulls` property. We will lose a lot of
performance if we use `numNulls() > 0` to check null.
This PR simply revert #19980, with a renaming to call it `hasNull`.
Better name suggestions are welcome, e.g. `nullable`?
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20452 from cloud-fan/null.
(cherry picked from commit 48dd6a4c79e33a8f2dba8349b58aa07e4796a925)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 33f17b28b3448a6f0389d2ee93bb5d49a02f288c)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java (diff)
Commit f5f21e8c4261c0dfe8e3e788a30b38b188a18f67 by wenchen
[SPARK-23249][SQL] Improved block merging logic for partitions
## What changes were proposed in this pull request?
Change DataSourceScanExec so that when grouping blocks together into
partitions, also checks the end of the sorted list of splits to more
efficiently fill out partitions.
## How was this patch tested?
Updated old test to reflect the new logic, which causes the # of
partitions to drop from 4 -> 3 Also, a current test exists to test large
non-splittable files at
https://github.com/glentakahashi/spark/blob/c575977a5952bf50b605be8079c9be1e30f3bd36/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala#L346
## Rationale
The current bin-packing method of next-fit descending for blocks into
partitions is sub-optimal in a lot of cases and will result in extra
partitions, un-even distribution of block-counts across partitions, and
un-even distribution of partition sizes.
As an example, 128 files ranging from 1MB, 2MB,...127MB,128MB. will
result in 82 partitions with the current algorithm, but only 64 using
this algorithm. Also in this example, the max # of blocks per partition
in NFD is 13, while in this algorithm is is 2.
More generally, running a simulation of 1000 runs using 128MB blocksize,
between 1-1000 normally distributed file sizes between 1-500Mb, you can
see an improvement of approx 5% reduction of partition counts, and a
large reduction in standard deviation of blocks per partition.
This algorithm also runs in O(n) time as NFD does, and in every case is
strictly better results than NFD.
Overall, the more even distribution of blocks across partitions and
therefore reduced partition counts should result in a small but
significant performance increase across the board
Author: Glen Takahashi <gtakahashi@palantir.com>
Closes #20372 from glentakahashi/feature/improved-block-merging.
(cherry picked from commit 8c21170decfb9ca4d3233e1ea13bd1b6e3199ed9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f5f21e8c4261c0dfe8e3e788a30b38b188a18f67)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (diff)
Commit 8ee3a71c9c1b8ed51c5916635d008fdd49cf891a by gatorsmile
[SPARK-23281][SQL] Query produces results in incorrect order when a
composite order by clause refers to both original columns and aliases
## What changes were proposed in this pull request? Here is the test
snippet.
``` SQL scala> Seq[(Integer, Integer)](
    |         (1, 1),
    |         (1, 3),
    |         (2, 3),
    |         (3, 3),
    |         (4, null),
    |         (5, null)
    |       ).toDF("key", "value").createOrReplaceTempView("src")
scala> sql(
    |         """
    |           |SELECT MAX(value) as value, key as col2
    |           |FROM src
    |           |GROUP BY key
    |           |ORDER BY value desc, key
    |         """.stripMargin).show
+-----+----+
|value|col2|
+-----+----+
|    3|   3|
|    3|   2|
|    3|   1|
| null|   5|
| null|   4|
+-----+----+
```SQL Here is the explain output :
```SQL
== Parsed Logical Plan ==
'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true
+- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10]
  +- 'UnresolvedRelation `src`
== Analyzed Logical Plan == value: int, col2: int Project [value#9,
col2#10]
+- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true
  +- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10]
     +- SubqueryAlias src
        +- Project [_1#2 AS key#5, _2#3 AS value#6]
           +- LocalRelation [_1#2, _2#3]
``` SQL The sort direction is being wrongly changed from ASC to DSC
while resolving ```Sort``` in resolveAggregateFunctions.
The above testcase models TPCDS-Q71 and thus we have the same issue in
Q71 as well.
## How was this patch tested? A few tests are added in SQLQuerySuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #20453 from dilipbiswal/local_spark.
(commit: 8ee3a71c9c1b8ed51c5916635d008fdd49cf891a)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 7ccfc753086c3859abe358c87f2e7b7a30422d5e by hyukjinkwon
[SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment
Author: Henry Robinson <henry@cloudera.com>
Closes #20443 from henryr/SPARK-23157.
(cherry picked from commit f470df2fcf14e6234c577dc1bdfac27d49b441f5)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 7ccfc753086c3859abe358c87f2e7b7a30422d5e)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedR/pkg/R/DataFrame.R (diff)
Commit 0d0f5793686b98305f98a3d9e494bbfcee9cff13 by wenchen
[SPARK-23280][SQL] add map type support to ColumnVector
## What changes were proposed in this pull request?
Fill the last missing piece of `ColumnVector`: the map type support.
The idea is similar to the array type support. A map is basically 2
arrays: keys and values. We ask the implementations to provide a key
array, a value array, and an offset and length to specify the range of
this map in the key/value array.
In `WritableColumnVector`, we put the key array in first child vector,
and value array in second child vector, and offsets and lengths in the
current vector, which is very similar to how array type is implemented
here.
## How was this patch tested?
a new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20450 from cloud-fan/map.
(cherry picked from commit 52e00f70663a87b5837235bdf72a3e6f84e11411)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0d0f5793686b98305f98a3d9e494bbfcee9cff13)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarMap.java
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java (diff)
Commit 59e89a2990e4f66839c91f48f41157dac6e670ad by gatorsmile
[SPARK-23268][SQL] Reorganize packages in data source V2
## What changes were proposed in this pull request? 1. create a new
package for partitioning/distribution related classes.
   As Spark will add new concrete implementations of `Distribution` in
new releases, it is good to
   have a new package for partitioning/distribution related classes.
2. move streaming related class to package
`org.apache.spark.sql.sources.v2.reader/writer.streaming`, instead of
`org.apache.spark.sql.sources.v2.streaming.reader/writer`. So that the
there won't be package reader/writer inside package streaming, which is
quite confusing. Before change:
``` v2
├── reader
├── streaming
│   ├── reader
│   └── writer
└── writer
```
After change:
``` v2
├── reader
│   └── streaming
└── writer
   └── streaming
```
## How was this patch tested? Unit test.
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes #20435 from gengliangwang/new_pkg.
(cherry picked from commit 56ae32657e9e5d1e30b62afe77d9e14eb07cf4fb)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 59e89a2990e4f66839c91f48f41157dac6e670ad)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/MicroBatchReadSupport.java
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ClusteredDistribution.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamWriter.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateStreamOffset.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDDIter.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/writer/StreamWriter.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/PartitionOffset.java
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/PartitionOffset.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourcePartitioning.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Distribution.java
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchReader.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousReadSupport.java
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/MicroBatchReader.java
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaStreamWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaPartitionAwareDataSource.java (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/partitioning/Partitioning.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/partitioning/Distribution.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/StreamWriteSupport.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/Offset.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/Offset.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/partitioning/ClusteredDistribution.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/MicroBatchWriter.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousDataReader.java
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/MicroBatchReadSupport.java
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/StreamWriteSupport.java
Commit 871fd48dc381d48e67f7efcc45cc534d36e4ee6e by gatorsmile
[SPARK-21396][SQL] Fixes MatchError when UDTs are passed through Hive
Thriftserver
Signed-off-by: Atallah Hezbor <atallahhezborgmail.com>
## What changes were proposed in this pull request?
This PR proposes modifying the match statement that gets the columns of
a row in HiveThriftServer. There was previously no case for
`UserDefinedType`, so querying a table that contained them would throw a
match error. The changes catch that case and return the string
representation.
## How was this patch tested?
While I would have liked to add a unit test, I couldn't easily
incorporate UDTs into the ``HiveThriftServer2Suites`` pipeline. With
some guidance I would be happy to push a commit with tests.
Instead I did a manual test by loading a `DataFrame` with Point UDT in a
spark shell with a HiveThriftServer. Then in beeline, connecting to the
server and querying that table.
Here is the result before the change
``` 0: jdbc:hive2://localhost:10000> select * from chicago; Error:
scala.MatchError: org.apache.spark.sql.PointUDT2d980dc3 (of class
org.apache.spark.sql.PointUDT) (state=,code=0)
```
And after the change:
``` 0: jdbc:hive2://localhost:10000> select * from chicago;
+---------------------------------------+--------------+------------------------+---------------------+--+
|                __fid__                | case_number  |          dtg  
       |        geom         |
+---------------------------------------+--------------+------------------------+---------------------+--+
| 109602f9-54f8-414b-8c6f-42b1a337643e  | 2            | 2016-01-01
19:00:00.0  | POINT (-77 38)      |
| 709602f9-fcff-4429-8027-55649b6fd7ed  | 1            | 2015-12-31
19:00:00.0  | POINT (-76.5 38.5)  |
| 009602f9-fcb5-45b1-a867-eb8ba10cab40  | 3            | 2016-01-02
19:00:00.0  | POINT (-78 39)      |
+---------------------------------------+--------------+------------------------+---------------------+--+
```
Author: Atallah Hezbor <atallahhezbor@gmail.com>
Closes #20385 from atallahhezbor/udts_over_hive.
(cherry picked from commit b2e7677f4d3d8f47f5f148680af39d38f2b558f0)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 871fd48dc381d48e67f7efcc45cc534d36e4ee6e)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveUtilsSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (diff)
Commit 205bce974b86bef9d9d507e1b89549cb01c7b535 by nickp
[SPARK-23107][ML] ML 2.3 QA: New Scala APIs, docs.
## What changes were proposed in this pull request? Audit new APIs and
docs in 2.3.0.
## How was this patch tested? No test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #20459 from yanboliang/SPARK-23107.
(cherry picked from commit e15da5b14c8d845028365a609c0c66731d024ee7)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
(commit: 205bce974b86bef9d9d507e1b89549cb01c7b535)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala (diff)
Commit 6b6bc9c4ebeb4c1ebfea3f6ddff0d2f502011e0c by ueshin
[SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues.
## What changes were proposed in this pull request?
This is a follow-up of #20450 which broke lint-java checks. This pr
fixes the lint-java issues.
```
[ERROR]
src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java:[20,8]
(imports) UnusedImports: Unused import -
org.apache.spark.sql.catalyst.util.MapData.
[ERROR]
src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java:[21,8]
(imports) UnusedImports: Unused import -
org.apache.spark.sql.catalyst.util.MapData.
[ERROR]
src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java:[22,8]
(imports) UnusedImports: Unused import -
org.apache.spark.sql.catalyst.util.MapData.
```
## How was this patch tested?
Checked manually in my local environment.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20468 from ueshin/issues/SPARK-23280/fup1.
(cherry picked from commit 8bb70b068ea782e799e45238fcb093a6acb0fc9f)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 6b6bc9c4ebeb4c1ebfea3f6ddff0d2f502011e0c)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java (diff)
Commit 3aa780ef34492ab1746bbcde8a75bfa8c3d929e1 by ueshin
[SPARK-23280][SQL][FOLLOWUP] Enable `MutableColumnarRow.getMap()`.
## What changes were proposed in this pull request?
This is a followup pr of #20450. We should've enabled
`MutableColumnarRow.getMap()` as well.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20471 from ueshin/issues/SPARK-23280/fup2.
(cherry picked from commit 89e8d556b93d1bf1b28fe153fd284f154045b0ee)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 3aa780ef34492ab1746bbcde8a75bfa8c3d929e1)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java (diff)
Commit 2549beae20fe8761242f6fb9cda35ff06a652897 by wenchen
[SPARK-23289][CORE] OneForOneBlockFetcher.DownloadCallback.onData should
write the buffer fully
## What changes were proposed in this pull request?
`channel.write(buf)` may not write the whole buffer since the underlying
channel is a FileChannel, we should retry until the whole buffer is
written.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes #20461 from zsxwing/SPARK-23289.
(cherry picked from commit ec63e2d0743a4f75e1cce21d0fe2b54407a86a4a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 2549beae20fe8761242f6fb9cda35ff06a652897)
The file was modifiedcore/src/test/scala/org/apache/spark/FileSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java (diff)
Commit 2db7e49dbb58c5b44e22a6a3aae4fa04cd89e5e8 by gatorsmile
[SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--hiveconf" and
''--hivevar" variables since 2.0
## What changes were proposed in this pull request?
`--hiveconf` and `--hivevar` variables no longer work since Spark 2.0.
The `spark-sql` client has fixed by
[SPARK-15730](https://issues.apache.org/jira/browse/SPARK-15730) and
[SPARK-18086](https://issues.apache.org/jira/browse/SPARK-18086). but
`beeline`/[`Spark SQL
HiveThriftServer2`](https://github.com/apache/spark/blob/v2.1.1/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala)
is still broken. This pull request fix it.
This pull request works for both `JDBC client` and `beeline`.
## How was this patch tested?
unit tests for  `JDBC client` manual tests for `beeline`:
``` git checkout origin/pr/17886
dev/make-distribution.sh --mvn mvn  --tgz -Phive -Phive-thriftserver
-Phadoop-2.6 -DskipTests
tar -zxf spark-2.3.0-SNAPSHOT-bin-2.6.5.tgz && cd
spark-2.3.0-SNAPSHOT-bin-2.6.5
sbin/start-thriftserver.sh
```
``` cat <<EOF > test.sql select '\${a}', '\${b}'; EOF
beeline -u jdbc:hive2://localhost:10000 --hiveconf a=avalue --hivevar
b=bvalue -f test.sql
```
Author: Yuming Wang <wgyumg@gmail.com>
Closes #17886 from wangyum/SPARK-13983-dev.
(cherry picked from commit f051f834036e63d5e480d86440ce39924f979e82)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 2db7e49dbb58c5b44e22a6a3aae4fa04cd89e5e8)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala (diff)
Commit 07a8f4ddfc2edccde9b1d28b4436a596d2f7db63 by gatorsmile
[SPARK-23293][SQL] fix data source v2 self join
`DataSourceV2Relation` should extend `MultiInstanceRelation`, to take
care of self-join.
a new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20466 from cloud-fan/dsv2-selfjoin.
(cherry picked from commit 73da3b6968630d9e2cafc742ccb6d4eb54957df4)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 07a8f4ddfc2edccde9b1d28b4436a596d2f7db63)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
Commit ab23785c70229acd6c22218f722337cf0a9cc55b by vanzin
[SPARK-23296][YARN] Include stacktrace in YARN-app diagnostic
## What changes were proposed in this pull request?
Include stacktrace in the diagnostics message upon abnormal unregister
from RM
## How was this patch tested? Tested with a failing job, and confirmed a
stacktrace in the client output and YARN webUI.
Author: Gera Shegalov <gera@apache.org>
Closes #20470 from gerashegalov/gera/stacktrace-diagnostics.
(cherry picked from commit 032c11b83f0d276bf8085992229b8c598f02798a)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: ab23785c70229acd6c22218f722337cf0a9cc55b)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala (diff)
Commit 7baae3aef34d16cb0a2d024d96027d8378a03927 by wenchen
[SPARK-23284][SQL] Document the behavior of several ColumnVector's get
APIs when accessing null slot
## What changes were proposed in this pull request?
For some ColumnVector get APIs such as getDecimal, getBinary, getStruct,
getArray, getInterval, getUTF8String, we should clearly document their
behaviors when accessing null slot. They should return null in this
case. Then we can remove null checks from the places using above APIs.
For the APIs of primitive values like getInt, getInts, etc., this also
documents their behaviors when accessing null slots. Their returning
values are undefined and can be anything.
## How was this patch tested?
Added tests into `ColumnarBatchSuite`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #20455 from viirya/SPARK-23272-followup.
(cherry picked from commit 90848d507457d30abb36e3ba07618dfc87c34cd6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7baae3aef34d16cb0a2d024d96027d8378a03927)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (diff)
Commit 2b07452cacb4c226c815a216c4cfea2a04227700 by gatorsmile
[SPARK-23301][SQL] data source column pruning should work for arbitrary
expressions
This PR fixes a mistake in the `PushDownOperatorsToDataSource` rule, the
column pruning logic is incorrect about `Project`.
a new test case for column pruning with arbitrary expressions, and
improve the existing tests to make sure the
`PushDownOperatorsToDataSource` really works.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20476 from cloud-fan/push-down.
(cherry picked from commit 19c7c7ebdef6c1c7a02ebac9af6a24f521b52c37)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 2b07452cacb4c226c815a216c4cfea2a04227700)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala (diff)
Commit e5e9f9a430c827669ecfe9d5c13cc555fc89c980 by wenchen
[SPARK-23312][SQL] add a config to turn off vectorized cache reader
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-23309 reported a performance
regression about cached table in Spark 2.3. While the investigating is
still going on, this PR adds a conf to turn off the vectorized cache
reader, to unblock the 2.3 release.
## How was this patch tested?
a new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20483 from cloud-fan/cache.
(cherry picked from commit b9503fcbb3f4a3ce263164d1f11a8e99b9ca5710)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e5e9f9a430c827669ecfe9d5c13cc555fc89c980)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala (diff)
Commit 56eb9a310217a5372bdba1e24e4af0d4de1829ca by tathagata.das1565
[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up
## What changes were proposed in this pull request? Further
clarification of caveats in using stream-stream outer joins.
## How was this patch tested? N/A
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #20494 from tdas/SPARK-23064-2.
(cherry picked from commit eaf35de2471fac4337dd2920026836d52b1ec847)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
(commit: 56eb9a310217a5372bdba1e24e4af0d4de1829ca)
The file was modifieddocs/structured-streaming-programming-guide.md (diff)
Commit dcd0af4be752ab61b8caf36f70d98e97c6925473 by gatorsmile
[SQL] Minor doc update: Add an example in DataFrameReader.schema
## What changes were proposed in this pull request? This patch adds a
small example to the schema string definition of schema function. It
isn't obvious how to use it, so an example would be useful.
## How was this patch tested? N/A - doc only.
Author: Reynold Xin <rxin@databricks.com>
Closes #20491 from rxin/schema-doc.
(cherry picked from commit 3ff83ad43a704cc3354ef9783e711c065e2a1a22)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: dcd0af4be752ab61b8caf36f70d98e97c6925473)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
Commit b614c083a4875c874180a93b08ea5031fa90cfec by gatorsmile
[SPARK-23317][SQL] rename ContinuousReader.setOffset to setStartOffset
## What changes were proposed in this pull request?
In the document of `ContinuousReader.setOffset`, we say this method is
used to specify the start offset. We also have a
`ContinuousReader.getStartOffset` to get the value back. I think it
makes more sense to rename `ContinuousReader.setOffset` to
`setStartOffset`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20486 from cloud-fan/rename.
(cherry picked from commit fe73cb4b439169f16cc24cd851a11fd398ce7edf)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: b614c083a4875c874180a93b08ea5031fa90cfec)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
Commit 1bcb3728db11be6e34060eff670fc8245ad571c6 by gatorsmile
[SPARK-23311][SQL][TEST] add FilterFunction test case for test
CombineTypedFilters
## What changes were proposed in this pull request?
In the current test case for CombineTypedFilters, we lack the test of
FilterFunction, so let's add it. In addition, in
TypedFilterOptimizationSuite's existing test cases, Let's extract a
common LocalRelation.
## How was this patch tested?
add new test cases.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes #20482 from heary-cao/TypedFilterOptimizationSuite.
(cherry picked from commit 63b49fa2e599080c2ba7d5189f9dde20a2e01fb4)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 1bcb3728db11be6e34060eff670fc8245ad571c6)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala (diff)
Commit 4de206182c8a1f76e1e5e6b597c4b3890e2ca255 by gatorsmile
[SPARK-23305][SQL][TEST] Test `spark.sql.files.ignoreMissingFiles` for
all file-based data sources
## What changes were proposed in this pull request?
Like Parquet, all file-based data source handles
`spark.sql.files.ignoreMissingFiles` correctly. We had better have a
test coverage for feature parity and in order to prevent future
accidental regression for all data sources.
## How was this patch tested?
Pass Jenkins with a newly added test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20479 from dongjoon-hyun/SPARK-23305.
(cherry picked from commit 522e0b1866a0298669c83de5a47ba380dc0b7c84)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 4de206182c8a1f76e1e5e6b597c4b3890e2ca255)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala (diff)
Commit be3de87914f29a56137e391d0cf494c0d1a7ba12 by gatorsmile
[MINOR][DOC] Use raw triple double quotes around docstrings where there
are occurrences of backslashes.
From [PEP 257](https://www.python.org/dev/peps/pep-0257/):
> For consistency, always use """triple double quotes""" around
docstrings. Use r"""raw triple double quotes""" if you use any
backslashes in your docstrings. For Unicode docstrings, use u"""Unicode
triple-quoted strings""".
For example, this is what help (kafka_wordcount) shows:
``` DESCRIPTION
   Counts words in UTF8 encoded, '
   ' delimited text received from the network every second.
    Usage: kafka_wordcount.py <zk> <topic>
     To run this on your local machine, you need to setup Kafka and
create a producer first, see
    http://kafka.apache.org/documentation.html#quickstart
     and then run the example
       `$ bin/spark-submit --jars     
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
      examples/src/main/python/streaming/kafka_wordcount.py     
localhost:2181 test`
```
This is what it shows, after the fix:
``` DESCRIPTION
   Counts words in UTF8 encoded, '\n' delimited text received from the
network every second.
   Usage: kafka_wordcount.py <zk> <topic>
    To run this on your local machine, you need to setup Kafka and
create a producer first, see
   http://kafka.apache.org/documentation.html#quickstart
    and then run the example
      `$ bin/spark-submit --jars \
      
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
\
        examples/src/main/python/streaming/kafka_wordcount.py \
        localhost:2181 test`
```
The thing worth noticing is no linebreak here in the help.
## What changes were proposed in this pull request?
Change triple double quotes to raw triple double quotes when there are
occurrences of backslashes in docstrings.
## How was this patch tested?
Manually as this is a doc fix.
Author: Shashwat Anand <me@shashwat.me>
Closes #20497 from ashashwat/docstring-fixes.
(cherry picked from commit 4aaa7d40bf495317e740b6d6f9c2a55dfd03521b)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: be3de87914f29a56137e391d0cf494c0d1a7ba12)
The file was modifiedexamples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py (diff)
The file was modifiedexamples/src/main/python/streaming/kafka_wordcount.py (diff)
The file was modifiedexamples/src/main/python/sql/streaming/structured_network_wordcount.py (diff)
The file was modifiedexamples/src/main/python/streaming/stateful_network_wordcount.py (diff)
The file was modifiedexamples/src/main/python/streaming/direct_kafka_wordcount.py (diff)
The file was modifiedexamples/src/main/python/streaming/flume_wordcount.py (diff)
The file was modifiedexamples/src/main/python/streaming/network_wordcount.py (diff)
The file was modifiedexamples/src/main/python/streaming/sql_network_wordcount.py (diff)
The file was modifiedexamples/src/main/python/streaming/network_wordjoinsentiments.py (diff)
Commit 45f0f4ff76accab3387b08b3773a0b127333ea3a by gatorsmile
[SPARK-21658][SQL][PYSPARK] Revert "[] Add default None for value in
na.replace in PySpark"
This reverts commit 0fcde87aadc9a92e138f11583119465ca4b5c518.
See the discussion in
[SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658),
[SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and
https://github.com/apache/spark/pull/16793
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20496 from HyukjinKwon/revert-SPARK-21658.
(cherry picked from commit 551dff2bccb65e9b3f77b986f167aec90d9a6016)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 45f0f4ff76accab3387b08b3773a0b127333ea3a)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
Commit 430025cba1ca8cc652fd11f894cef96203921dab by gatorsmile
[SPARK-22036][SQL][FOLLOWUP] Fix decimalArithmeticOperations.sql
## What changes were proposed in this pull request?
Fix decimalArithmeticOperations.sql test
## How was this patch tested?
N/A
Author: Yuming Wang <wgyumg@gmail.com> Author: wangyum
<wgyumg@gmail.com> Author: Yuming Wang <yumwang@ebay.com>
Closes #20498 from wangyum/SPARK-22036.
(cherry picked from commit 6fb3fd15365d43733aefdb396db205d7ccf57f75)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 430025cba1ca8cc652fd11f894cef96203921dab)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/typeCoercion/native/decimalArithmeticOperations.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/typeCoercion/native/decimalArithmeticOperations.sql.out (diff)
Commit e688ffee20cf9d7124e03b28521e31e10d0bb7f3 by wenchen
[SPARK-23307][WEBUI] Sort jobs/stages/tasks/queries with the completed
timestamp before cleaning up them
## What changes were proposed in this pull request?
Sort jobs/stages/tasks/queries with the completed timestamp before
cleaning up them to make the behavior consistent with 2.2.
## How was this patch tested?
- Jenkins.
- Manually ran the following codes and checked the UI for
jobs/stages/tasks/queries.
``` spark.ui.retainedJobs 10 spark.ui.retainedStages 10
spark.sql.ui.retainedExecutions 10 spark.ui.retainedTasks 10
```
``` new Thread() {
override def run() {
   spark.range(1, 2).foreach { i =>
       Thread.sleep(10000)
   }
}
}.start()
Thread.sleep(5000)
for (_ <- 1 to 20) {
   new Thread() {
     override def run() {
       spark.range(1, 2).foreach { i =>
       }
     }
   }.start()
}
Thread.sleep(15000)
spark.range(1, 2).foreach { i =>
}
sc.makeRDD(1 to 100, 100).foreach { i =>
}
```
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes #20481 from zsxwing/SPARK-23307.
(cherry picked from commit a6bf3db20773ba65cbc4f2775db7bd215e78829a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e688ffee20cf9d7124e03b28521e31e10d0bb7f3)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/storeTypes.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusListener.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusStore.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala (diff)
Commit 173449c2bd454a87487f8eebf7d74ee6ed505290 by sameerag
[SPARK-23310][CORE] Turn off read ahead input stream for unshafe shuffle
reader
To fix regression for TPC-DS queries
Author: Sital Kedia <skedia@fb.com>
Closes #20492 from sitalkedia/turn_off_async_inputstream.
(cherry picked from commit 03b7e120dd7ff7848c936c7a23644da5bd7219ab)
Signed-off-by: Sameer Agarwal <sameerag@apache.org>
(commit: 173449c2bd454a87487f8eebf7d74ee6ed505290)
The file was modifiedcore/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java (diff)
Commit 4aa9aafcd542d5f28b7e6bb756c2e965010a757c by vanzin
[SPARK-23330][WEBUI] Spark UI SQL executions page throws NPE
## What changes were proposed in this pull request?
Spark SQL executions page throws the following error and the page
crashes:
``` HTTP ERROR 500 Problem accessing /SQL/. Reason:
Server Error Caused by: java.lang.NullPointerException at
scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47)
at scala.collection.immutable.StringOps.length(StringOps.scala:47) at
scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at
scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111)
at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at
org.apache.spark.sql.execution.ui.ExecutionTable.descriptionCell(AllExecutionsPage.scala:182)
at
org.apache.spark.sql.execution.ui.ExecutionTable.row(AllExecutionsPage.scala:155)
at
org.apache.spark.sql.execution.ui.ExecutionTable$$anonfun$8.apply(AllExecutionsPage.scala:204)
at
org.apache.spark.sql.execution.ui.ExecutionTable$$anonfun$8.apply(AllExecutionsPage.scala:204)
at
org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:339)
at
org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:339)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at
scala.collection.AbstractTraversable.map(Traversable.scala:104) at
org.apache.spark.ui.UIUtils$.listingTable(UIUtils.scala:339) at
org.apache.spark.sql.execution.ui.ExecutionTable.toNodeSeq(AllExecutionsPage.scala:203)
at
org.apache.spark.sql.execution.ui.AllExecutionsPage.render(AllExecutionsPage.scala:67)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534) at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
```
One of the possible reason that this page fails may be the
`SparkListenerSQLExecutionStart` event get dropped before processed, so
the execution description and details don't get updated. This was not a
issue in 2.2 because it would ignore any job start event that arrives
before the corresponding execution start event, which doesn't sound like
a good decision.
We shall try to handle the null values in the front page side, that is,
try to give a default value when `execution.details` or
`execution.description` is null. Another possible approach is not to
spill the `LiveExecutionData` in `SQLAppStatusListener.update(exec:
LiveExecutionData)` if `exec.details` is null. This is not ideal because
this way you will not see the execution if
`SparkListenerSQLExecutionStart` event is lost, because
`AllExecutionsPage` only read executions from KVStore.
## How was this patch tested?
After the change, the page shows the following:
![image](https://user-images.githubusercontent.com/4784782/35775480-28cc5fde-093e-11e8-8ccc-f58c2ef4a514.png)
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #20502 from jiangxb1987/executionPage.
(cherry picked from commit c2766b07b4b9ed976931966a79c65043e81cf694)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 4aa9aafcd542d5f28b7e6bb756c2e965010a757c)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (diff)
Commit 521494d7bdcbb6699e0b12cd3ff60fc27908938f by wenchen
[SPARK-23326][WEBUI] schedulerDelay should return 0 when the task is
running
## What changes were proposed in this pull request?
When a task is still running, metrics like executorRunTime are not
available. Then `schedulerDelay` will be almost the same as `duration`
and that's confusing.
This PR makes `schedulerDelay` return 0 when the task is running which
is the same behavior as 2.2.
## How was this patch tested?
`AppStatusUtilsSuite.schedulerDelay`
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes #20493 from zsxwing/SPARK-23326.
(cherry picked from commit f3f1e14bb73dfdd2927d95b12d7d61d22de8a0ac)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 521494d7bdcbb6699e0b12cd3ff60fc27908938f)
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusUtils.scala (diff)
The file was addedcore/src/test/scala/org/apache/spark/status/AppStatusUtilsSuite.scala
Commit 44933033e9216ccb2e533b9dc6e6cb03cce39817 by hyukjinkwon
[SPARK-23290][SQL][PYTHON][BACKPORT-2.3] Use datetime.date for date type
when converting Spark DataFrame to Pandas DataFrame.
## What changes were proposed in this pull request?
This is a backport of #20506.
In #18664, there was a change in how `DateType` is being returned to
users ([line 1968 in
dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)).
This can cause client code which works in Spark 2.2 to fail. See
[SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917)
for an example.
This pr modifies to use `datetime.date` for date type as Spark 2.2 does.
## How was this patch tested?
Tests modified to fit the new behavior and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20515 from ueshin/issues/SPARK-23290_2.3.
(commit: 44933033e9216ccb2e533b9dc6e6cb03cce39817)
The file was modifiedpython/pyspark/serializers.py (diff)
The file was modifiedpython/pyspark/sql/types.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit a511544822be6e3fc9c6bb5080a163b9acbb41f2 by hyukjinkwon
[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType()
to handle str type properly in Python 2.
## What changes were proposed in this pull request?
In Python 2, when `pandas_udf` tries to return string type value created
in the udf with `".."`, the execution fails. E.g.,
```python from pyspark.sql.functions import pandas_udf, col import
pandas as pd
df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i
for i in x]), "string") df.select(str_f(col('id'))).show()
```
raises the following exception:
```
...
java.lang.AssertionError: assertion failed: Invalid schema from
pandas_udf: expected StringType, got BinaryType
at scala.Predef$.assert(Predef.scala:170)
at
org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)
...
```
Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()`
and consider it as binary type when the type is string type and the
string values are `str` instead of `unicode` in Python 2.
This pr adds a workaround for the case.
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20507 from ueshin/issues/SPARK-23334.
(cherry picked from commit 63c5bf13ce5cd3b8d7e7fb88de881ed207fde720)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: a511544822be6e3fc9c6bb5080a163b9acbb41f2)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/serializers.py (diff)
Commit 7782fd03ab95552dff1d1477887632bbc8f6ee51 by vanzin
[SPARK-23310][CORE][FOLLOWUP] Fix Java style check issues.
## What changes were proposed in this pull request?
This is a follow-up of #20492 which broke lint-java checks. This pr
fixes the lint-java issues.
```
[ERROR]
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java:[79]
(sizes) LineLength: Line is longer than 100 characters (found 114).
```
## How was this patch tested?
Checked manually in my local environment.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20514 from ueshin/issues/SPARK-23310/fup1.
(cherry picked from commit 7db9979babe52d15828967c86eb77e3fb2791579)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 7782fd03ab95552dff1d1477887632bbc8f6ee51)
The file was modifiedcore/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java (diff)
Commit 036a04b29c818ddbe695f7833577781e8bb16d3f by gatorsmile
[SPARK-23312][SQL][FOLLOWUP] add a config to turn off vectorized cache
reader
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/20483 tried to provide a way to
turn off the new columnar cache reader, to restore the behavior in 2.2.
However even we turn off that config, the behavior is still different
than 2.2.
If the output data are rows, we still enable whole stage codegen for the
scan node, which is different with 2.2, we should also fix it.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20513 from cloud-fan/cache.
(cherry picked from commit ac7454cac04a1d9252b3856360eda5c3e8bcb8da)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 036a04b29c818ddbe695f7833577781e8bb16d3f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
Commit 77cccc5e154b14f8a9cad829d7fd476e3b6405ce by gatorsmile
[MINOR][TEST] Fix class name for Pandas UDF tests
In
https://github.com/apache/spark/commit/b2ce17b4c9fea58140a57ca1846b2689b15c0d61,
I mistakenly renamed `VectorizedUDFTests` to `ScalarPandasUDF`. This PR
fixes the mistake.
Existing tests.
Author: Li Jin <ice.xelloss@gmail.com>
Closes #20489 from icexelloss/fix-scalar-udf-tests.
(cherry picked from commit caf30445632de6aec810309293499199e7a20892)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 77cccc5e154b14f8a9cad829d7fd476e3b6405ce)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit f9c913263219f5e8a375542994142645dd0f6c6a by gatorsmile
[SPARK-23315][SQL] failed to get output from canonicalized data source
v2 related plans
## What changes were proposed in this pull request?
`DataSourceV2Relation`  keeps a `fullOutput` and resolves the real
output on demand by column name lookup. i.e.
``` lazy val output: Seq[Attribute] =
reader.readSchema().map(_.name).map { name =>
fullOutput.find(_.name == name).get
}
```
This will be broken after we canonicalize the plan, because all
attribute names become "None", see
https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42
To fix this, `DataSourceV2Relation` should just keep `output`, and
update the `output` when doing column pruning.
## How was this patch tested?
a new test case
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20485 from cloud-fan/canonicalize.
(cherry picked from commit b96a083b1c6ff0d2c588be9499b456e1adce97dc)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: f9c913263219f5e8a375542994142645dd0f6c6a)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceReaderHolder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
Commit 874d3f89fe0f903a6465520c3e6c4788a6865d9a by gatorsmile
[SPARK-23327][SQL] Update the description and tests of three external
API or functions
## What changes were proposed in this pull request? Update the
description and tests of three external API or functions `createFunction
`, `length` and `repartitionByRange `
## How was this patch tested? N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20495 from gatorsmile/updateFunc.
(cherry picked from commit c36fecc3b416c38002779c3cf40b6a665ac4bf13)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 874d3f89fe0f903a6465520c3e6c4788a6865d9a)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala (diff)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala (diff)
Commit cb22e830b0af3f2d760beffea9a79a6d349e4661 by hyukjinkwon
[SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by
createOrReplaceTempView
## What changes were proposed in this pull request? Replace
`registerTempTable` by `createOrReplaceTempView`.
## How was this patch tested? N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20523 from gatorsmile/updateExamples.
(cherry picked from commit 9775df67f924663598d51723a878557ddafb8cfd)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: cb22e830b0af3f2d760beffea9a79a6d349e4661)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java (diff)
Commit 05239afc9e62ef4c71c9f22a930e73888985510a by gatorsmile
[SPARK-23345][SQL] Remove open stream record even closing it fails
## What changes were proposed in this pull request?
When `DebugFilesystem` closes opened stream, if any exception occurs, we
still need to remove the open stream record from `DebugFilesystem`.
Otherwise, it goes to report leaked filesystem connection.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #20524 from viirya/SPARK-23345.
(cherry picked from commit 9841ae0313cbee1f083f131f9446808c90ed5a7b)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 05239afc9e62ef4c71c9f22a930e73888985510a)
The file was modifiedcore/src/test/scala/org/apache/spark/DebugFilesystem.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/SharedSparkSession.scala (diff)
Commit 2ba07d5b101c44382e0db6d660da756c2f5ce627 by hyukjinkwon
[SPARK-23300][TESTS][BRANCH-2.3] Prints out if Pandas and PyArrow are
installed or not in PySpark SQL tests
This PR backports https://github.com/apache/spark/pull/20473 to
branch-2.3.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20533 from HyukjinKwon/backport-20473.
(commit: 2ba07d5b101c44382e0db6d660da756c2f5ce627)
The file was modifiedpython/run-tests.py (diff)
Commit db59e554273fe0a54a3223079ff39106fdd1442e by wenchen
Revert [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by
default
## What changes were proposed in this pull request?
This is to revert the changes made in
https://github.com/apache/spark/pull/19499 , because this causes a
regression. We should not ignore the table-specific compression conf
when the Hive serde tables are converted to the data source tables.
## How was this patch tested?
The existing tests.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20536 from gatorsmile/revert22279.
(cherry picked from commit 3473fda6dc77bdfd84b3de95d2082856ad4f8626)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: db59e554273fe0a54a3223079ff39106fdd1442e)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (diff)
Commit 0538302561c4d77b2856b1ce73b3ccbcb6688ac6 by hyukjinkwon
[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow
versions in PySpark tests (to skip or test)
This PR backports https://github.com/apache/spark/pull/20487 to
branch-2.3.
Author: hyukjinkwon <gurwls223@gmail.com> Author: Takuya UESHIN
<ueshin@databricks.com>
Closes #20534 from HyukjinKwon/PR_TOOL_PICK_PR_20487_BRANCH-2.3.
(commit: 0538302561c4d77b2856b1ce73b3ccbcb6688ac6)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedpython/setup.py (diff)
The file was modifiedpom.xml (diff)
The file was modifiedpython/pyspark/sql/session.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/sql/utils.py (diff)
Commit 0c2a2100d0116776d2dcb2d48493f77a64aead0c by gatorsmile
[SPARK-23348][SQL] append data using saveAsTable should adjust the data
types
## What changes were proposed in this pull request?
For inserting/appending data to an existing table, Spark should adjust
the data types of the input query according to the table schema, or fail
fast if it's uncastable.
There are several ways to insert/append data: SQL API,
`DataFrameWriter.insertInto`, `DataFrameWriter.saveAsTable`. The first 2
ways create `InsertIntoTable` plan, and the last way creates
`CreateTable` plan. However, we only adjust input query data types for
`InsertIntoTable`, and users may hit weird errors when appending data
using `saveAsTable`. See the JIRA for the error case.
This PR fixes this bug by adjusting data types for `CreateTable` too.
## How was this patch tested?
new test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20527 from cloud-fan/saveAsTable.
(cherry picked from commit 7f5f5fb1296275a38da0adfa05125dd8ebf729ff)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 0c2a2100d0116776d2dcb2d48493f77a64aead0c)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
Commit 68f3a070c728d0af95e9b5eec2c49be274b67a20 by wenchen
[SPARK-23268][SQL][FOLLOWUP] Reorganize packages in data source V2
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/20435.
While reorganizing the packages for streaming data source v2, the top
level stream read/write support interfaces should not be in the
reader/writer package, but should be in the `sources.v2` package, to
follow the `ReadSupport`, `WriteSupport`, etc.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20509 from cloud-fan/followup.
(cherry picked from commit a75f927173632eee1316879447cb62c8cf30ae37)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 68f3a070c728d0af95e9b5eec2c49be274b67a20)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/ContinuousReadSupport.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/MicroBatchReadSupport.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/MicroBatchReadSupport.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/sources/v2/StreamWriteSupport.java
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceV2Suite.scala (diff)
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousReadSupport.java
The file was removedsql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/StreamWriteSupport.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
Commit dfb16147791ff87342ff852105420a5eac5c553b by wenchen
[SPARK-23186][SQL] Initialize DriverManager first before loading JDBC
Drivers
## What changes were proposed in this pull request?
Since some JDBC Drivers have class initialization code to call
`DriverManager`, we need to initialize `DriverManager` first in order to
avoid potential executor-side **deadlock** situations like the following
(or [STORM-2527](https://issues.apache.org/jira/browse/STORM-2527)).
``` Thread 9587: (state = BLOCKED)
-
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
java.lang.Object[]) bci=0 (Compiled frame; information may be imprecise)
-
sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[])
bci=85, line=62 (Compiled frame)
-
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[])
bci=5, line=45 (Compiled frame)
- java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79,
line=423 (Compiled frame)
- java.lang.Class.newInstance() bci=138, line=442 (Compiled frame)
- java.util.ServiceLoader$LazyIterator.nextService() bci=119, line=380
(Interpreted frame)
- java.util.ServiceLoader$LazyIterator.next() bci=11, line=404
(Interpreted frame)
- java.util.ServiceLoader$1.next() bci=37, line=480 (Interpreted frame)
- java.sql.DriverManager$2.run() bci=21, line=603 (Interpreted frame)
- java.sql.DriverManager$2.run() bci=1, line=583 (Interpreted frame)
-
java.security.AccessController.doPrivileged(java.security.PrivilegedAction)
bci=0 (Compiled frame)
- java.sql.DriverManager.loadInitialDrivers() bci=27, line=583
(Interpreted frame)
- java.sql.DriverManager.<clinit>() bci=32, line=101 (Interpreted frame)
-
org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
java.lang.Integer, java.lang.String, java.util.Properties) bci=12,
line=98 (Interpreted frame)
-
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
java.util.Properties) bci=22, line=57 (Interpreted frame)
-
org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
org.apache.hadoop.conf.Configuration) bci=61, line=116 (Interpreted
frame)
-
org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
org.apache.hadoop.mapreduce.TaskAttemptContext) bci=10, line=71
(Interpreted frame)
-
org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(org.apache.spark.rdd.NewHadoopRDD,
org.apache.spark.Partition, org.apache.spark.TaskContext) bci=233,
line=156 (Interpreted frame)
Thread 9170: (state = BLOCKED)
- org.apache.phoenix.jdbc.PhoenixDriver.<clinit>() bci=35, line=125
(Interpreted frame)
-
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
java.lang.Object[]) bci=0 (Compiled frame)
-
sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[])
bci=85, line=62 (Compiled frame)
-
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[])
bci=5, line=45 (Compiled frame)
- java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79,
line=423 (Compiled frame)
- java.lang.Class.newInstance() bci=138, line=442 (Compiled frame)
-
org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
bci=89, line=46 (Interpreted frame)
-
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
bci=7, line=53 (Interpreted frame)
-
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
bci=1, line=52 (Interpreted frame)
-
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
org.apache.spark.Partition, org.apache.spark.TaskContext) bci=81,
line=347 (Interpreted frame)
-
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
org.apache.spark.TaskContext) bci=7, line=339 (Interpreted frame)
```
## How was this patch tested?
N/A
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20359 from dongjoon-hyun/SPARK-23186.
(cherry picked from commit 8cbcc33876c773722163b2259644037bbb259bd1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: dfb16147791ff87342ff852105420a5eac5c553b)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala (diff)
Commit 196304a3a8ed15fd018e9c7b441954d17bd60124 by wenchen
[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace
when 'to_replace' is not a dictionary
## What changes were proposed in this pull request?
This PR proposes to disallow default value None when 'to_replace' is not
a dictionary.
It seems weird we set the default value of `value` to `None` and we
ended up allowing the case as below:
```python
>>> df.show()
```
```
+----+------+-----+
| age|height| name|
+----+------+-----+
|  10|    80|Alice|
...
```
```python
>>> df.na.replace('Alice').show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...
```
**After**
This PR targets to disallow the case above:
```python
>>> df.na.replace('Alice').show()
```
```
... TypeError: value is required when to_replace is not a dictionary.
```
while we still allow when `to_replace` is a dictionary:
```python
>>> df.na.replace({'Alice': None}).show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...
```
## How was this patch tested?
Manually tested, tests were added in `python/pyspark/sql/tests.py` and
doctests were fixed.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20499 from HyukjinKwon/SPARK-19454-followup.
(cherry picked from commit 4b4ee2601079f12f8f410a38d2081793cbdedc14)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 196304a3a8ed15fd018e9c7b441954d17bd60124)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/__init__.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was addedpython/pyspark/_globals.py
The file was modifieddocs/sql-programming-guide.md (diff)
Commit 08eb95f609f5d356c89dedcefa768b12a7a8b96c by sowen
[SPARK-23358][CORE] When the number of partitions is greater than 2^28,
it will result in an error result
## What changes were proposed in this pull request? In the
`checkIndexAndDataFile`,the `blocks` is the ` Int` type,  when it is
greater than 2^28, `blocks*8` will overflow, and this will result in an
error result. In fact, `blocks` is actually the number of partitions.
## How was this patch tested? Manual test
Author: liuxian <liu.xian3@zte.com.cn>
Closes #20544 from 10110346/overflow.
(cherry picked from commit f77270b8811bbd8956d0c08fa556265d2c5ee20e)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: 08eb95f609f5d356c89dedcefa768b12a7a8b96c)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
Commit 49771ac8da8e68e8412d9f5d181953eaf0de7973 by sowen
[MINOR][HIVE] Typo fixes
## What changes were proposed in this pull request?
Typo fixes (with expanding a Hive property)
## How was this patch tested?
local build. Awaiting Jenkins
Author: Jacek Laskowski <jacek@japila.pl>
Closes #20550 from jaceklaskowski/hiveutils-typos.
(cherry picked from commit 557938e2839afce26a10a849a2a4be8fc4580427)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: 49771ac8da8e68e8412d9f5d181953eaf0de7973)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (diff)
Commit f3a9a7f6b6eac4421bd74ff73a74105982604ce6 by gatorsmile
[SPARK-23275][SQL] fix the thread leaking in hive/tests
## What changes were proposed in this pull request?
This is a follow up of https://github.com/apache/spark/pull/20441.
The two lines actually can trigger the hive metastore bug:
https://issues.apache.org/jira/browse/HIVE-16844
The two configs are not in the default `ObjectStore` properties, so any
run hive commands after these two lines will set the `propsChanged` flag
in the `ObjectStore.setConf` and then cause thread leaks.
I don't think the two lines are very useful. They can be removed safely.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot;
otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Feng Liu <fengliu@databricks.com>
Closes #20562 from liufengdb/fix-omm.
(cherry picked from commit 6d7c38330e68c7beb10f54eee8b4f607ee3c4136)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: f3a9a7f6b6eac4421bd74ff73a74105982604ce6)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
Commit b7571b9bfcced2e08f87e815c2ea9474bfd5fe2a by hyukjinkwon
[SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz,
or dateutil.
## What changes were proposed in this pull request?
Currently we use `tzlocal()` to get Python local timezone, but it
sometimes causes unexpected behavior. I changed the way to get Python
local timezone to use pytz if the timezone is specified in environment
variable, or timezone file via dateutil .
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20559 from ueshin/issues/SPARK-23360/master.
(cherry picked from commit 97a224a855c4410b2dfb9c0bcc6aae583bd28e92)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: b7571b9bfcced2e08f87e815c2ea9474bfd5fe2a)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/sql/types.py (diff)
Commit 9fa7b0e107c283557648160195ce179077752e4c by hyukjinkwon
[SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive
timestamps in Arrow codepath to deal with dst
## What changes were proposed in this pull request? When tz_localize a
tz-naive timetamp, pandas will throw exception if the timestamp is
during daylight saving time period, e.g., `2015-11-01 01:30:00`. This PR
fixes this issue by setting `ambiguous=False` when calling tz_localize,
which is the same default behavior of pytz.
## How was this patch tested? Add `test_timestamp_dst`
Author: Li Jin <ice.xelloss@gmail.com>
Closes #20537 from icexelloss/SPARK-23314.
(cherry picked from commit a34fce19bc0ee5a7e36c6ecba75d2aeb70fdcbc7)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 9fa7b0e107c283557648160195ce179077752e4c)
The file was modifiedpython/pyspark/sql/types.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit 8875e47cec01ae8da4ffb855409b54089e1016fb by hyukjinkwon
[SPARK-23387][SQL][PYTHON][TEST][BRANCH-2.3] Backport assertPandasEqual
to branch-2.3.
## What changes were proposed in this pull request?
When backporting a pr with tests using `assertPandasEqual` from master
to branch-2.3, the tests fail because `assertPandasEqual` doesn't exist
in branch-2.3. We should backport `assertPandasEqual` to branch-2.3 to
avoid the failures.
## How was this patch tested?
Modified tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20577 from ueshin/issues/SPARK-23387/branch-2.3.
(commit: 8875e47cec01ae8da4ffb855409b54089e1016fb)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit 7e2a2b33c0664b3638a1428688b28f68323994c1 by wenchen
[SPARK-23376][SQL] creating UnsafeKVExternalSorter with BytesToBytesMap
may fail
## What changes were proposed in this pull request?
This is a long-standing bug in `UnsafeKVExternalSorter` and was reported
in the dev list multiple times.
When creating `UnsafeKVExternalSorter` with `BytesToBytesMap`, we need
to create a `UnsafeInMemorySorter` to sort the data in
`BytesToBytesMap`. The data format of the sorter and the map is same, so
no data movement is required. However, both the sorter and the map need
a point array for some bookkeeping work.
There is an optimization in `UnsafeKVExternalSorter`: reuse the point
array between the sorter and the map, to avoid an extra memory
allocation. This sounds like a reasonable optimization, the length of
the `BytesToBytesMap` point array is at least 4 times larger than the
number of keys(to avoid hash collision, the hash table size should be at
least 2 times larger than the number of keys, and each key occupies 2
slots). `UnsafeInMemorySorter` needs the pointer array size to be 4
times of the number of entries, so we are safe to reuse the point array.
However, the number of keys of the map doesn't equal to the number of
entries in the map, because `BytesToBytesMap` supports duplicated keys.
This breaks the assumption of the above optimization and we may run out
of space when inserting data into the sorter, and hit error
``` java.lang.IllegalStateException: There is no space for new record
  at
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:239)
  at
org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:149)
...
```
This PR fixes this bug by creating a new point array if the existing one
is not big enough.
## How was this patch tested?
a new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20561 from cloud-fan/bug.
(cherry picked from commit 4bbd7443ebb005f81ed6bc39849940ac8db3b3cc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7e2a2b33c0664b3638a1428688b28f68323994c1)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeKVExternalSorterSuite.scala (diff)
Commit 79e8650cccb00c7886efba6dd691d9733084cb81 by sameerag
[SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSourceSuite in Spark
2.3/hadoop 2.7
## What changes were proposed in this pull request?
This test only fails with sbt on Hadoop 2.7, I can't reproduce it
locally, but here is my speculation by looking at the code: 1.
FileSystem.delete doesn't delete the directory entirely, somehow we can
still open the file as a 0-length empty file.(just speculation) 2. ORC
intentionally allow empty files, and the reader fails during reading
without closing the file stream.
This PR improves the test to make sure all files are deleted and can't
be opened.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20584 from cloud-fan/flaky-test.
(cherry picked from commit 6efd5d117e98074d1b16a5c991fbd38df9aa196e)
Signed-off-by: Sameer Agarwal <sameerag@apache.org>
(commit: 79e8650cccb00c7886efba6dd691d9733084cb81)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala (diff)
Commit 1e3118c2ee0fe7d2c59cb3e2055709bb2809a6db by wenchen
[SPARK-22977][SQL] fix web UI SQL tab for CTAS
## What changes were proposed in this pull request?
This is a regression in Spark 2.3.
In Spark 2.2, we have a fragile UI support for SQL data writing
commands. We only track the input query plan of `FileFormatWriter` and
display its metrics. This is not ideal because we don't know who
triggered the writing(can be table insertion, CTAS, etc.), but it's
still useful to see the metrics of the input query.
In Spark 2.3, we introduced a new mechanism: `DataWritigCommand`, to fix
the UI issue entirely. Now these writing commands have real children,
and we don't need to hack into the `FileFormatWriter` for the UI. This
also helps with `explain`, now `explain` can show the physical plan of
the input query, while in 2.2 the physical writing plan is simply
`ExecutedCommandExec` and it has no child.
However there is a regression in CTAS. CTAS commands don't extend
`DataWritigCommand`, and we don't have the UI hack in `FileFormatWriter`
anymore, so the UI for CTAS is just an empty node. See
https://issues.apache.org/jira/browse/SPARK-22977 for more information
about this UI issue.
To fix it, we should apply the `DataWritigCommand` mechanism to CTAS
commands.
TODO: In the future, we should refactor this part and create some
physical layer code pieces for data writing, and reuse them in different
writing commands. We should have different logical nodes for different
operators, even some of them share some same logic, e.g. CTAS, CREATE
TABLE, INSERT TABLE. Internally we can share the same physical logic.
## How was this patch tested?
manually tested. For data source table
<img width="644" alt="1"
src="https://user-images.githubusercontent.com/3182036/35874155-bdffab28-0ba6-11e8-94a8-e32e106ba069.png">
For hive table
<img width="666" alt="2"
src="https://user-images.githubusercontent.com/3182036/35874161-c437e2a8-0ba6-11e8-98ed-7930f01432c5.png">
Author: Wenchen Fan <wenchen@databricks.com>
Closes #20521 from cloud-fan/UI.
(cherry picked from commit 0e2c266de7189473177f45aa68ea6a45c7e47ec3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1e3118c2ee0fe7d2c59cb3e2055709bb2809a6db)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala (diff)
Commit d31c4ae7ba734356c849347b9a7b448da9a5a9a1 by sowen
[SPARK-23391][CORE] It may lead to overflow for some integer
multiplication
## What changes were proposed in this pull request? In the
`getBlockData`,`blockId.reduceId` is the `Int` type, when it is greater
than 2^28, `blockId.reduceId*8` will overflow In the `decompress0`,
`len` and  `unitSize` are  Int type, so `len * unitSize` may lead to
overflow
## How was this patch tested? N/A
Author: liuxian <liu.xian3@zte.com.cn>
Closes #20581 from 10110346/overflow2.
(cherry picked from commit 4a4dd4f36f65410ef5c87f7b61a960373f044e61)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: d31c4ae7ba734356c849347b9a7b448da9a5a9a1)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
The file was modifiedcommon/unsafe/pom.xml (diff)
The file was modifiedtools/pom.xml (diff)
The file was modifiedsql/core/pom.xml (diff)
The file was modifiedstreaming/pom.xml (diff)
The file was modifiedexternal/docker-integration-tests/pom.xml (diff)
The file was modifiedexternal/flume-assembly/pom.xml (diff)
The file was modifiedexternal/kafka-0-8-assembly/pom.xml (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedcommon/tags/pom.xml (diff)
The file was modifiedrepl/pom.xml (diff)
The file was modifieddocs/_config.yml (diff)
The file was modifiedcommon/network-yarn/pom.xml (diff)
The file was modifiedresource-managers/yarn/pom.xml (diff)
The file was modifiedcommon/kvstore/pom.xml (diff)
The file was modifiedresource-managers/mesos/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-sql/pom.xml (diff)
The file was modifiedpython/pyspark/version.py (diff)
The file was modifiedlauncher/pom.xml (diff)
The file was modifiedexternal/kafka-0-10/pom.xml (diff)
The file was modifiedsql/hive-thriftserver/pom.xml (diff)
The file was modifiedhadoop-cloud/pom.xml (diff)
The file was modifiedassembly/pom.xml (diff)
The file was modifiedmllib-local/pom.xml (diff)
The file was modifiedpom.xml (diff)
The file was modifiedexternal/kinesis-asl-assembly/pom.xml (diff)
The file was modifiedmllib/pom.xml (diff)
The file was modifiedcommon/sketch/pom.xml (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedcommon/network-common/pom.xml (diff)
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-assembly/pom.xml (diff)
The file was modifiedexternal/kinesis-asl/pom.xml (diff)
The file was modifiedR/pkg/DESCRIPTION (diff)
The file was modifiedexternal/flume/pom.xml (diff)
The file was modifiedexternal/spark-ganglia-lgpl/pom.xml (diff)
The file was modifiedexamples/pom.xml (diff)
The file was modifiedexternal/flume-sink/pom.xml (diff)
The file was modifiedcommon/network-shuffle/pom.xml (diff)
The file was modifiedexternal/kafka-0-8/pom.xml (diff)
The file was modifiedgraphx/pom.xml (diff)
The file was modifiedsql/hive/pom.xml (diff)
The file was modifiedassembly/pom.xml (diff)
The file was modifiedexternal/docker-integration-tests/pom.xml (diff)
The file was modifiedpom.xml (diff)
The file was modifiedstreaming/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-sql/pom.xml (diff)
The file was modifiedcommon/tags/pom.xml (diff)
The file was modifiedcommon/kvstore/pom.xml (diff)
The file was modifiedexternal/spark-ganglia-lgpl/pom.xml (diff)
The file was modifiedsql/core/pom.xml (diff)
The file was modifiedR/pkg/DESCRIPTION (diff)
The file was modifiedgraphx/pom.xml (diff)
The file was modifiedexternal/flume-assembly/pom.xml (diff)
The file was modifiedexamples/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-assembly/pom.xml (diff)
The file was modifiedexternal/kafka-0-8-assembly/pom.xml (diff)
The file was modifiedsql/hive/pom.xml (diff)
The file was modifiedrepl/pom.xml (diff)
The file was modifiedsql/hive-thriftserver/pom.xml (diff)
The file was modifiedmllib-local/pom.xml (diff)
The file was modifiedresource-managers/yarn/pom.xml (diff)
The file was modifiedexternal/kinesis-asl/pom.xml (diff)
The file was modifiedexternal/kafka-0-8/pom.xml (diff)
The file was modifiedresource-managers/mesos/pom.xml (diff)
The file was modifiedcommon/network-shuffle/pom.xml (diff)
The file was modifiedpython/pyspark/version.py (diff)
The file was modifiedcommon/network-common/pom.xml (diff)
The file was modifiedcommon/network-yarn/pom.xml (diff)
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was modifiedcommon/sketch/pom.xml (diff)
The file was modifiedmllib/pom.xml (diff)
The file was modifieddocs/_config.yml (diff)
The file was modifiedexternal/flume-sink/pom.xml (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedlauncher/pom.xml (diff)
The file was modifiedcommon/unsafe/pom.xml (diff)
The file was modifiedtools/pom.xml (diff)
The file was modifiedexternal/kinesis-asl-assembly/pom.xml (diff)
The file was modifiedexternal/kafka-0-10/pom.xml (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedhadoop-cloud/pom.xml (diff)
The file was modifiedexternal/flume/pom.xml (diff)
Commit 4e138207ebb11a08393c15e5e39f46a5dc1e7c66 by gatorsmile
[SPARK-23388][SQL] Support for Parquet Binary DecimalType in
VectorizedColumnReader
## What changes were proposed in this pull request?
Re-add support for parquet binary DecimalType in VectorizedColumnReader
## How was this patch tested?
Existing test suite
Author: James Thompson <jamesthomp@users.noreply.github.com>
Closes #20580 from jamesthomp/jt/add-back-binary-decimal.
(cherry picked from commit 5bb11411aec18b8d623e54caba5397d7cb8e89f0)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 4e138207ebb11a08393c15e5e39f46a5dc1e7c66)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java (diff)
Commit 9632c461e6931a1a4d05684d0f62ee36f9e90b77 by gatorsmile
[SPARK-22002][SQL][FOLLOWUP][TEST] Add a test to check if the original
schema doesn't have metadata.
## What changes were proposed in this pull request?
This is a follow-up pr of #19231 which modified the behavior to remove
metadata from JDBC table schema. This pr adds a test to check if the
schema doesn't have metadata.
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20585 from ueshin/issues/SPARK-22002/fup1.
(cherry picked from commit 0c66fe4f22f8af4932893134bb0fd56f00fabeae)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 9632c461e6931a1a4d05684d0f62ee36f9e90b77)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala (diff)
Commit 2b80571e215d56d15c59f0fc5db053569a79efae by gatorsmile
[SPARK-23313][DOC] Add a migration guide for ORC
## What changes were proposed in this pull request?
This PR adds a migration guide documentation for ORC.
![orc-guide](https://user-images.githubusercontent.com/9700541/36123859-ec165cae-1002-11e8-90b7-7313be7a81a5.png)
## How was this patch tested?
N/A.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20484 from dongjoon-hyun/SPARK-23313.
(cherry picked from commit 6cb59708c70c03696c772fbb5d158eed57fe67d4)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 2b80571e215d56d15c59f0fc5db053569a79efae)
The file was modifieddocs/sql-programming-guide.md (diff)
Commit befb22de81aad41673eec9dba7585b80c6cb2564 by gatorsmile
[SPARK-23230][SQL] When hive.default.fileformat is other kinds of file
types, create textfile table cause a serde error
When hive.default.fileformat is other kinds of file types, create
textfile table cause a serde error. We should take the default type of
textfile and sequencefile both as
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
``` set hive.default.fileformat=orc; create table tbl( i string ) stored
as textfile; desc formatted tbl;
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat
org.apache.hadoop.mapred.TextInputFormat OutputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
```
Author: sychen <sychen@ctrip.com>
Closes #20406 from cxzl25/default_serde.
(cherry picked from commit 4104b68e958cd13975567a96541dac7cccd8195c)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: befb22de81aad41673eec9dba7585b80c6cb2564)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala (diff)
Commit 43f5e40679f771326b2ee72f14cf1ab0ed2ad692 by gatorsmile
[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in
Pandas UDFs
## What changes were proposed in this pull request?
This PR backports https://github.com/apache/spark/pull/20531:
It explicitly specifies supported types in Pandas UDFs. The main change
here is to add a deduplicated and explicit type checking in `returnType`
ahead with documenting this; however, it happened to fix multiple
things.
1. Currently, we don't support `BinaryType` in Pandas UDFs, for example,
see:
    ```python
   from pyspark.sql.functions import pandas_udf
   pudf = pandas_udf(lambda x: x, "binary")
   df = spark.createDataFrame([[bytearray(1)]])
   df.select(pudf("_1")).show()
   ```
   ```
   ...
   TypeError: Unsupported type in conversion to Arrow: BinaryType
   ```
    We can document this behaviour for its guide.
2. Since we can check the return type ahead, we can fail fast before
actual execution.
    ```python
   # we can fail fast at this stage because we know the schema ahead
   pandas_udf(lambda x: x, BinaryType())
   ```
## How was this patch tested?
Manually tested and unit tests for `BinaryType` and `ArrayType(...)`
were added.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20588 from HyukjinKwon/PR_TOOL_PICK_PR_20531_BRANCH-2.3.
(commit: 43f5e40679f771326b2ee72f14cf1ab0ed2ad692)
The file was modifieddocs/sql-programming-guide.md (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedpython/pyspark/sql/types.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 3737c3d32bb92e73cadaf3b1b9759d9be00b288d by hyukjinkwon
[SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark
## What changes were proposed in this pull request? Deprecating the
field `name` in PySpark is not expected. This PR is to revert the
change.
## How was this patch tested? N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20595 from gatorsmile/removeDeprecate.
(cherry picked from commit 407f67249639709c40c46917700ed6dd736daa7d)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 3737c3d32bb92e73cadaf3b1b9759d9be00b288d)
The file was modifiedpython/pyspark/sql/types.py (diff)
Commit 1c81c0c626f115fbfe121ad6f6367b695e9f3b5f by sowen
[SPARK-23384][WEB-UI] When it has no incomplete(completed) applications
found, the last updated time is not formatted and client local time zone
is not show in history server web ui.
## What changes were proposed in this pull request?
When it has no incomplete(completed) applications found, the last
updated time is not formatted and client local time zone is not show in
history server web ui. It is a bug.
fix before:
![1](https://user-images.githubusercontent.com/26266482/36070635-264d7cf0-0f3a-11e8-8426-14135ffedb16.png)
fix after:
![2](https://user-images.githubusercontent.com/26266482/36070651-8ec3800e-0f3a-11e8-991c-6122cc9539fe.png)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot;
otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Closes #20573 from guoxiaolongzte/SPARK-23384.
(cherry picked from commit 300c40f50ab4258d697f06a814d1491dc875c847)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(commit: 1c81c0c626f115fbfe121ad6f6367b695e9f3b5f)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala (diff)
Commit dbb1b399b6cf8372a3659c472f380142146b1248 by irashid
[SPARK-23053][CORE] taskBinarySerialization and task partitions
calculate in DagScheduler.submitMissingTasks should keep the same RDD
checkpoint status
## What changes were proposed in this pull request?
When we run concurrent jobs using the same rdd which is marked to do
checkpoint. If one job has finished running the job, and start the
process of RDD.doCheckpoint, while another job is submitted, then
submitStage and submitMissingTasks will be called. In
[submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961),
will serialize taskBinaryBytes and calculate task partitions which are
both affected by the status of checkpoint, if the former is calculated
before doCheckpoint finished, while the latter is calculated after
doCheckpoint finished, when run task, rdd.compute will be called, for
some rdds with particular partition type such as
[UnionRDD](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala)
who will do partition type cast, will get a ClassCastException because
the part params is actually a CheckpointRDDPartition. This error occurs
because rdd.doCheckpoint occurs in the same thread that called
sc.runJob, while the task serialization occurs in the DAGSchedulers
event loop.
## How was this patch tested?
the exist uts and also add a test case in DAGScheduerSuite to show the
exception case.
Author: huangtengfei <huangtengfei@huangtengfeideMacBook-Pro.local>
Closes #20244 from ivoson/branch-taskpart-mistype.
(cherry picked from commit 091a000d27f324de8c5c527880854ecfcf5de9a4)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
(commit: dbb1b399b6cf8372a3659c472f380142146b1248)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (diff)
Commit ab01ba718c7752b564e801a1ea546aedc2055dc0 by gatorsmile
[SPARK-23316][SQL] AnalysisException after max iteration reached for IN
query
## What changes were proposed in this pull request? Added flag
ignoreNullability to DataType.equalsStructurally. The previous semantic
is for ignoreNullability=false. When ignoreNullability=true
equalsStructurally ignores nullability of contained types (map key
types, value types, array element types, structure field types).
In.checkInputTypes calls equalsStructurally to check if the children
types match. They should match regardless of nullability (which is just
a hint), so it is now called with ignoreNullability=true.
## How was this patch tested? New test in SubquerySuite
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes #20548 from bogdanrdc/SPARK-23316.
(cherry picked from commit 05d051293fe46938e9cb012342fea6e8a3715cd4)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: ab01ba718c7752b564e801a1ea546aedc2055dc0)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
Commit 320ffb1309571faedb271f2c769b4ab1ee1cd267 by joseph
[SPARK-23154][ML][DOC] Document backwards compatibility guarantees for
ML persistence
## What changes were proposed in this pull request?
Added documentation about what MLlib guarantees in terms of loading ML
models and Pipelines from old Spark versions.  Discussed & confirmed on
linked JIRA.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #20592 from jkbradley/SPARK-23154-backwards-compat-doc.
(cherry picked from commit d58fe28836639e68e262812d911f167cb071007b)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
(commit: 320ffb1309571faedb271f2c769b4ab1ee1cd267)
The file was modifieddocs/ml-pipeline.md (diff)
Commit 4f6a457d464096d791e13e57c55bcf23c01c418f by zsxwing
[SPARK-23400][SQL] Add a constructors for ScalaUDF
## What changes were proposed in this pull request?
In this upcoming 2.3 release, we changed the interface of `ScalaUDF`.
Unfortunately, some Spark packages (e.g., spark-deep-learning) are using
our internal class `ScalaUDF`. In the release 2.3, we added new
parameters into this class. The users hit the binary compatibility
issues and got the exception:
```
> java.lang.NoSuchMethodError:
org.apache.spark.sql.catalyst.expressions.ScalaUDF.&lt;init&gt;(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V
```
This PR is to improve the backward compatibility. However, we definitely
should not encourage the external packages to use our internal classes.
This might make us hard to maintain/develop the codes in Spark.
## How was this patch tested? N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes #20591 from gatorsmile/scalaUDF.
(cherry picked from commit 2ee76c22b6e48e643694c9475e5f0d37124215e7)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
(commit: 4f6a457d464096d791e13e57c55bcf23c01c418f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala (diff)
Commit bb26bdb55fdf84c4e36fd66af9a15e325a3982d6 by wenchen
[SPARK-23399][SQL] Register a task completion listener first for
OrcColumnarBatchReader
This PR aims to resolve an open file leakage issue reported at
[SPARK-23390](https://issues.apache.org/jira/browse/SPARK-23390) by
moving the listener registration position. Currently, the sequence is
like the following.
1. Create `batchReader` 2. `batchReader.initialize` opens a ORC file. 3.
`batchReader.initBatch` may take a long time to alloc memory in some
environment and cause errors. 4.
`Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ =>
iter.close()))`
This PR moves 4 before 2 and 3. To sum up, the new sequence is 1 -> 4 ->
2 -> 3.
Manual. The following test case makes OOM intentionally to cause leaked
filesystem connection in the current code base. With this patch, leakage
doesn't occurs.
```scala
// This should be tested manually because it raises OOM intentionally
// in order to cause `Leaked filesystem connection`.
test("SPARK-23399 Register a task completion listener first for
OrcColumnarBatchReader") {
   withSQLConf(SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE.key ->
s"${Int.MaxValue}") {
     withTempDir { dir =>
       val basePath = dir.getCanonicalPath
       Seq(0).toDF("a").write.format("orc").save(new Path(basePath,
"first").toString)
       Seq(1).toDF("a").write.format("orc").save(new Path(basePath,
"second").toString)
       val df = spark.read.orc(
         new Path(basePath, "first").toString,
         new Path(basePath, "second").toString)
       val e = intercept[SparkException] {
         df.collect()
       }
       assert(e.getCause.isInstanceOf[OutOfMemoryError])
     }
   }
}
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20590 from dongjoon-hyun/SPARK-23399.
(cherry picked from commit 357babde5a8eb9710de7016d7ae82dee21fa4ef3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: bb26bdb55fdf84c4e36fd66af9a15e325a3982d6)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)