SuccessChanges

Summary

  1. [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from (commit: 3e5b4ae63a468858ff8b9f7f3231cc877846a0af) (details)
  2. [SPARK-19826][ML][PYTHON] add spark.ml Python API for PIC (commit: a99d284c16cc4e00ce7c83ecdc3db6facd467552) (details)
  3. [MINOR][CORE] Log committer class used by HadoopMapRedCommitProtocol (commit: 9b6f24202f6f8d9d76bbe53f379743318acb19f9) (details)
  4. [SPARK-24520] Double braces in documentations (commit: 2dc047a3189290411def92f6d7e9a4e01bdb2c30) (details)
  5. [SPARK-24134][DOCS] A missing full-stop in doc "Tuning Spark". (commit: f5af86ea753c446df59a0a8c16c685224690d633) (details)
  6. [SPARK-22144][SQL] ExchangeCoordinator combine the partitions of an 0 (commit: 048197749ef990e4def1fcbf488f3ded38d95cae) (details)
  7. [SPARK-23732][DOCS] Fix source links in generated scaladoc. (commit: dc22465f3e1ef5ad59306b1f591d6fd16d674eb7) (details)
  8. [SPARK-24502][SQL] flaky test: UnsafeRowSerializerSuite (commit: 01452ea9c75ff027ceeb8314368c6bbedefdb2bf) (details)
  9. docs: fix typo (commit: 1d7db65e968de1c601e7f8b1ec9bc783ef2dbd01) (details)
  10. [SPARK-15064][ML] Locale support in StopWordsRemover (commit: 5d6a53d9831cc1e2115560db5cebe0eea2565dcd) (details)
  11. [SPARK-24531][TESTS] Remove version 2.2.0 from testing versions in (commit: 2824f1436bb0371b7216730455f02456ef8479ce) (details)
  12. [SPARK-24416] Fix configuration specification for killBlacklisted (commit: 3af1d3e6d95719e15a997877d5ecd3bb40c08b9c) (details)
  13. [SPARK-23931][SQL] Adds arrays_zip function to sparksql (commit: f0ef1b311dd5399290ad6abe4ca491bdb13478f0) (details)
  14. [SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName (commit: cc88d7fad16e8b5cbf7b6b9bfe412908782b4a45) (details)
  15. [SPARK-23933][SQL] Add map_from_arrays function (commit: ada28f25955a9e8ddd182ad41b2a4ef278f3d809) (details)
  16. [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failure of (commit: 0d3714d221460a2a1141134c3d451f18c4e0d46f) (details)
  17. [SPARK-24506][UI] Add UI filters to tabs added after binding (commit: f53818d35bdef5d20a2718b14a2fed4c468545c6) (details)
  18. [SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as (commit: 9786ce66c52d41b1d58ddedb3a984f561fd09ff3) (details)
  19. [SPARK-24466][SS] Fix TextSocketMicroBatchReader to be compatible with (commit: 3352d6fe9a1efb6dee18e40bdf584930b10d1d3e) (details)
  20. [SPARK-24485][SS] Measure and log elapsed time for filesystem operations (commit: 4c388bccf1bcac8f833fd9214096dd164c3ea065) (details)
  21. [SPARK-24479][SS] Added config for registering streamingQueryListeners (commit: 7703b46d2843db99e28110c4c7ccf60934412504) (details)
  22. [SPARK-24500][SQL] Make sure streams are materialized during Tree (commit: 299d297e250ca3d46616a97e4256aa9ad6a135e5) (details)
  23. [SPARK-24235][SS] Implement continuous shuffle writer for single reader (commit: 1b46f41c55f5cd29956e17d7da95a95580cf273f) (details)
  24. [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1 (commit: 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117) (details)
  25. [MINOR][CORE][TEST] Remove unnecessary sort in UnsafeInMemorySorterSuite (commit: 534065efeb51ff0d308fa6cc9dea0715f8ce25ad) (details)
  26. [SPARK-24495][SQL] EnsureRequirement returns wrong plan when reordering (commit: fdadc4be08dcf1a06383bbb05e53540da2092c63) (details)
  27. [SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf (commit: d3eed8fd6d65d95306abfb513a9e0fde05b703ac) (details)
  28. [SPARK-24543][SQL] Support any type as DDL string for from_json's schema (commit: b8f27ae3b34134a01998b77db4b7935e7f82a4fe) (details)
  29. [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution where no main (commit: 18cb0c07988578156c869682d8a2c4151e8d35e5) (details)
  30. [SPARK-24248][K8S] Use level triggering and state reconciliation in (commit: 270a9a3cac25f3e799460320d0fc94ccd7ecfaea) (details)
  31. [SPARK-24478][SQL] Move projection and filter push down to physical (commit: 22daeba59b3ffaccafc9ff4b521abc265d0e58dd) (details)
  32. [PYTHON] Fix typo in serializer exception (commit: 6567fc43aca75b41900cde976594e21c8b0ca98a) (details)
  33. [SPARK-24490][WEBUI] Use WebUI.addStaticHandler in web UIs (commit: 495d8cf09ae7134aa6d2feb058612980e02955fa) (details)
  34. [SPARK-24396][SS][PYSPARK] Add Structured Streaming ForeachWriter for (commit: b5ccf0d3957a444db93893c0ce4417bfbbb11822) (details)
  35. [SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple (commit: 90da7dc241f8eec2348c0434312c97c116330bc4) (details)
  36. [SPARK-24525][SS] Provide an option to limit number of rows in a (commit: e4fee395ecd93ad4579d9afbf0861f82a303e563) (details)
  37. add one supported type missing from the javadoc (commit: c7c0b086a0b18424725433ade840d5121ac2b86e) (details)
  38. initial commit (commit: 40ce9662006d6131230899f2a8d3da86325c7cf2) (details)
  39. update description (commit: ac0b68d0e70d28acace3cad45bbe718d3aa82e93) (details)
  40. fix test failure (commit: c30f9f090ca84b23412d6d95354ef6fbc9b091cc) (details)
  41. address review comments (commit: b1d4f711419716c79b2488181a791353ed93430e) (details)
  42. introduce ArraySetUtils to reuse code among (commit: 5a47f64c33ddf5ec66ac94430b5ee58d40e2991e) (details)
  43. fix python test failure (commit: b6319ca2caa9383b6ea990840737bdb2af5ad3b9) (details)
  44. fix python test failure (commit: 1e4519be304f34d0c5f783f63bd8730af0e92949) (details)
  45. simplification (commit: 97cd96648d66875fd622e379f4dba475d19da577) (details)
  46. fix pyspark test failure (commit: affd34b9553969c23302d6cd88a4ff842206dfd3) (details)
  47. address review comments (commit: 275d636195526f2ba8bcebdb68be16540c78867f) (details)
  48. add new tests based on review comment (commit: 9283fece6ee3820b09cd49c111a3df1417eb0c48) (details)
  49. fix mistakes in rebase (commit: 1b30010b55c5224a3ae9a47ae6c2192635c0475c) (details)
  50. fix unexpected changes (commit: 87794af152f033d4b4f8bef7b44ff72baf2567ba) (details)
  51. merge changes in #21103 (commit: bc0906c25f0c7b6f0db22064716f1eaf3bbfc867) (details)
  52. use GenericArrayData if UnsafeArrayData cannot be used (commit: 0d1d855d3406d8da5adabec92871b2dd9fdb4aa0) (details)
  53. use BinaryArrayExpressionWithImplicitCast (commit: a8e9032de4a7e2f7297ad9a14f107e447094353e) (details)
  54. update test cases (commit: c9b66f84866ea74bd872b6e862434d177d7299cc) (details)
  55. rebase with master (commit: ac9ec03deb96ce9a82da55c7d0e45d1d75bd0bed) (details)
  56. support complex types (commit: a65fb7acdf3b0b9ac679ae142f2b164cd892828a) (details)
  57. add test cases with duplication in an array (commit: f853cbd5f41f2ebfdd58f2720b6552e042f3c2f5) (details)
  58. rebase with master (commit: 2a9898e72f589de8069de7dd130f729507e20b28) (details)
  59. address review comments (commit: 2edee3681328288667fa1ca815ef7a1672f8586e) (details)
  60. address review comment (commit: c98d9907d48dd77d09d063acf4dc8f98d9e0d270) (details)
  61. keep the order of input array elements (commit: 83f502e3d10281d03a47b7442866eb7f7e68c4e5) (details)
  62. [SPARK-24573][INFRA] Runs SBT checkstyle after the build to work around (commit: b0a935255951280b49c39968f6234163e2f0e379) (details)
  63. [SPARK-23772][SQL] Provide an option to ignore column of all null values (commit: e219e692ef70c161f37a48bfdec2a94b29260004) (details)
  64. [SPARK-24526][BUILD][TEST-MAVEN] Spaces in the build dir causes failures (commit: bce177552564a4862bc979d39790cf553a477d74) (details)
  65. [SPARK-24548][SQL] Fix incorrect schema of Dataset with tuple encoders (commit: 8f225e055c2031ca85d61721ab712170ab4e50c1) (details)
  66. [SPARK-24478][SQL][FOLLOWUP] Move projection and filter push down to (commit: 1737d45e08a5f1fb78515b14321721d7197b443a) (details)
  67. [SPARK-24542][SQL] UDF series UDFXPathXXXX allow users to pass carefully (commit: 9a75c18290fff7d116cf88a44f9120bf67d8bd27) (details)
  68. [SPARK-24521][SQL][TEST] Fix ineffective test in CachedTableSuite (commit: a78a9046413255756653f70165520efd486fb493) (details)
  69. [SPARK-24556][SQL] Always rewrite output partitioning in (commit: 9dbe53eb6bb5916d28000f2c0d646cf23094ac11) (details)
  70. [SPARK-24534][K8S] Bypass non spark-on-k8s commands (commit: 13092d733791b19cd7994084178306e0c449f2ed) (details)
  71. [SPARK-24565][SS] Add API for in Structured Streaming for exposing (commit: 2cb976355c615eee4ebd0a86f3911fa9284fccf6) (details)
  72. [SPARK-24583][SQL] Wrong schema type in InsertIntoDataSourceCommand (commit: bc0498d5820ded2b428277e396502e74ef0ce36d) (details)
  73. address review comments (commit: 6acb5a414108c1039169c61d88540ce7fdc6be18) (details)
Commit 3e5b4ae63a468858ff8b9f7f3231cc877846a0af by hyukjinkwon
[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from
driver to executor
## What changes were proposed in this pull request? SPARK-23754 was
fixed in #21383 by changing the UDF code to wrap the user function, but
this required a hack to save its argspec. This PR reverts this change
and fixes the `StopIteration` bug in the worker
## How does this work?
The root of the problem is that when an user-supplied function raises a
`StopIteration`, pyspark might stop processing data, if this function is
used in a for-loop. The solution is to catch `StopIteration`s exceptions
and re-raise them as `RuntimeError`s, so that the execution fails and
the error is reported to the user. This is done using the
`fail_on_stopiteration` wrapper, in different ways depending on where
the function is used:
- In RDDs, the user function is wrapped in the driver, because this
function is also called in the driver itself.
- In SQL UDFs, the function is wrapped in the worker, since all
processing happens there. Moreover, the worker needs the signature of
the user function, which is lost when wrapping it, but passing this
signature to the worker requires a not so nice hack.
## How was this patch tested?
Same tests, plus tests for pandas UDFs
Author: edorigatti <emilio.dorigatti@gmail.com>
Closes #21467 from e-dorigatti/fix_udf_hack.
(commit: 3e5b4ae63a468858ff8b9f7f3231cc877846a0af)
The file was modifiedpython/pyspark/tests.py (diff)
The file was modifiedpython/pyspark/util.py (diff)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/worker.py (diff)
Commit a99d284c16cc4e00ce7c83ecdc3db6facd467552 by meng
[SPARK-19826][ML][PYTHON] add spark.ml Python API for PIC
## What changes were proposed in this pull request?
add spark.ml Python API for PIC
## How was this patch tested?
add doctest
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes #21513 from huaxingao/spark--19826.
(commit: a99d284c16cc4e00ce7c83ecdc3db6facd467552)
The file was modifiedpython/pyspark/ml/clustering.py (diff)
Commit 9b6f24202f6f8d9d76bbe53f379743318acb19f9 by srowen
[MINOR][CORE] Log committer class used by HadoopMapRedCommitProtocol
## What changes were proposed in this pull request?
When HadoopMapRedCommitProtocol is used (e.g., when using
saveAsTextFile() or saveAsHadoopFile() with RDDs), it's not easy to
determine which output committer class was used, so this PR simply logs
the class that was used, similarly to what is done in
SQLHadoopMapReduceCommitProtocol.
## How was this patch tested?
Built Spark then manually inspected logging when calling
saveAsTextFile():
```scala scala> sc.setLogLevel("INFO") scala>
sc.textFile("README.md").saveAsTextFile("/tmp/out")
... 18/05/29 10:06:20 INFO HadoopMapRedCommitProtocol: Using output
committer class org.apache.hadoop.mapred.FileOutputCommitter
```
Author: Jonathan Kelly <jonathak@amazon.com>
Closes #21452 from ejono/master.
(commit: 9b6f24202f6f8d9d76bbe53f379743318acb19f9)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/io/HadoopMapRedCommitProtocol.scala (diff)
Commit 2dc047a3189290411def92f6d7e9a4e01bdb2c30 by srowen
[SPARK-24520] Double braces in documentations
There are double braces in the markdown, which break the link.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot;
otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Fokko Driesprong <fokkodriesprong@godatadriven.com>
Closes #21528 from Fokko/patch-1.
(commit: 2dc047a3189290411def92f6d7e9a4e01bdb2c30)
The file was modifieddocs/structured-streaming-programming-guide.md (diff)
Commit f5af86ea753c446df59a0a8c16c685224690d633 by srowen
[SPARK-24134][DOCS] A missing full-stop in doc "Tuning Spark".
## What changes were proposed in this pull request?
In the document [Tuning Spark -> Determining Memory
Consumption](https://spark.apache.org/docs/latest/tuning.html#determining-memory-consumption),
a full stop was missing in the second paragraph.
It's `...use SizeEstimator’s estimate method This is useful for
experimenting...`, while there is supposed to be a full stop before
`This`.
Screenshot showing before change is attached below.
<img width="1033" alt="screen shot 2018-05-01 at 5 22 32 pm"
src="https://user-images.githubusercontent.com/11539188/39468206-778e3d8a-4d64-11e8-8a92-38464952b54b.png">
## How was this patch tested?
This is a simple change in doc. Only one full stop was added in plain
text.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Xiaodong <11539188+XD-DENG@users.noreply.github.com>
Closes #21205 from XD-DENG/patch-1.
(commit: f5af86ea753c446df59a0a8c16c685224690d633)
The file was modifieddocs/tuning.md (diff)
Commit 048197749ef990e4def1fcbf488f3ded38d95cae by wenchen
[SPARK-22144][SQL] ExchangeCoordinator combine the partitions of an 0
sized pre-shuffle to 0
## What changes were proposed in this pull request? when the length of
pre-shuffle's partitions is 0, the length of post-shuffle's partitions
should be 0 instead of spark.sql.shuffle.partitions.
## How was this patch tested? ExchangeCoordinator converted a
pre-shuffle that partitions is 0 to a post-shuffle that partitions is 0
instead of one that partitions is spark.sql.shuffle.partitions.
Author: liutang123 <liutang123@yeah.net>
Closes #19364 from liutang123/SPARK-22144.
(commit: 048197749ef990e4def1fcbf488f3ded38d95cae)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ExchangeCoordinator.scala (diff)
Commit dc22465f3e1ef5ad59306b1f591d6fd16d674eb7 by hyukjinkwon
[SPARK-23732][DOCS] Fix source links in generated scaladoc.
Apply the suggestion on the bug to fix source links. Tested with the
2.3.1 release docs.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #21521 from vanzin/SPARK-23732.
(commit: dc22465f3e1ef5ad59306b1f591d6fd16d674eb7)
The file was modifiedproject/SparkBuild.scala (diff)
Commit 01452ea9c75ff027ceeb8314368c6bbedefdb2bf by wenchen
[SPARK-24502][SQL] flaky test: UnsafeRowSerializerSuite
## What changes were proposed in this pull request?
`UnsafeRowSerializerSuite` calls `UnsafeProjection.create` which
accesses `SQLConf.get`, while the current active SparkSession may
already be stopped, and we may hit exception like this
``` sbt.ForkMain$ForkError: java.lang.IllegalStateException:
LiveListenerBus is stopped.
at
org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
at
org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
at
org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:93)
at
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
at
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
at
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
at
org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
at
org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
at
org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
at
org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
at scala.Option.map(Option.scala:146)
at
org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
at
org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
at
org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
at
org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
at
org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
at
org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
at
org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
at
org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
at
org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
...
```
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes #21518 from cloud-fan/test.
(commit: 01452ea9c75ff027ceeb8314368c6bbedefdb2bf)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeRowSerializerSuite.scala (diff)
Commit 1d7db65e968de1c601e7f8b1ec9bc783ef2dbd01 by srowen
docs: fix typo
no => no[t]
## What changes were proposed in this pull request?
Fixing a typo.
## How was this patch tested?
Visual check of the docs.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Tom Saleeba <tom.saleeba@gmail.com>
Closes #21496 from tomsaleeba/patch-1.
(commit: 1d7db65e968de1c601e7f8b1ec9bc783ef2dbd01)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Column.scala (diff)
Commit 5d6a53d9831cc1e2115560db5cebe0eea2565dcd by meng
[SPARK-15064][ML] Locale support in StopWordsRemover
## What changes were proposed in this pull request?
Add locale support for `StopWordsRemover`.
## How was this patch tested?
[Scala|Python] unit tests.
Author: Lee Dongjin <dongjin@apache.org>
Closes #21501 from dongjinleekr/feature/SPARK-15064.
(commit: 5d6a53d9831cc1e2115560db5cebe0eea2565dcd)
The file was modifiedpython/pyspark/ml/tests.py (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala (diff)
The file was modifiedpython/pyspark/ml/feature.py (diff)
Commit 2824f1436bb0371b7216730455f02456ef8479ce by gatorsmile
[SPARK-24531][TESTS] Remove version 2.2.0 from testing versions in
HiveExternalCatalogVersionsSuite
## What changes were proposed in this pull request?
Removing version 2.2.0 from testing versions in
HiveExternalCatalogVersionsSuite as it is not present anymore in the
mirrors and this is blocking all the open PRs.
## How was this patch tested?
running UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21540 from mgaido91/SPARK-24531.
(commit: 2824f1436bb0371b7216730455f02456ef8479ce)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit 3af1d3e6d95719e15a997877d5ecd3bb40c08b9c by irashid
[SPARK-24416] Fix configuration specification for killBlacklisted
executors
## What changes were proposed in this pull request?
spark.blacklist.killBlacklistedExecutors is defined as
(Experimental) If set to "true", allow Spark to automatically kill, and
attempt to re-create, executors when they are blacklisted. Note that,
when an entire node is added to the blacklist, all of the executors on
that node will be killed.
I presume the killing of blacklisted executors only happens after the
stage completes successfully and all tasks have completed or on fetch
failures
(updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It
is confusing because the definition states that the executor will be
attempted to be recreated as soon as it is blacklisted. This is not true
while the stage is in progress and an executor is blacklisted, it will
not attempt to cleanup until the stage finishes.
Author: Sanket Chintapalli <schintap@yahoo-inc.com>
Closes #21475 from redsanket/SPARK-24416.
(commit: 3af1d3e6d95719e15a997877d5ecd3bb40c08b9c)
The file was modifieddocs/configuration.md (diff)
Commit f0ef1b311dd5399290ad6abe4ca491bdb13478f0 by ueshin
[SPARK-23931][SQL] Adds arrays_zip function to sparksql
Signed-off-by: DylanGuedes <djmgguedesgmail.com>
## What changes were proposed in this pull request?
Addition of arrays_zip function to spark sql functions.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests) Unit tests that checks if the results are correct.
Author: DylanGuedes <djmgguedes@gmail.com>
Closes #21045 from DylanGuedes/SPARK-23931.
(commit: f0ef1b311dd5399290ad6abe4ca491bdb13478f0)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
Commit cc88d7fad16e8b5cbf7b6b9bfe412908782b4a45 by wenchen
[SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName
that is not safe in scala
## What changes were proposed in this pull request?
When user create a aggregator object in scala and pass the aggregator to
Spark Dataset's agg() method, Spark's will initialize
TypedAggregateExpression with the nodeName field as
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in
scala environment, depending on how user creates the aggregator object.
For example, if the aggregator class full qualified name is
"com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw
java.lang.InternalError "Malformed class name". This has been reported
in scalatest https://github.com/scalatest/scalatest/pull/1044 and
discussed in many scala upstream jiras such as SI-8110, SI-5425.
To fix this issue, we follow the solution in
https://github.com/scalatest/scalatest/pull/1044 to add safer version of
getSimpleName as a util method, and TypedAggregateExpression will invoke
this util method rather than getClass.getSimpleName.
## How was this patch tested? added unit test
Author: Fangshi Li <fli@linkedin.com>
Closes #21276 from fangshil/SPARK-24216.
(commit: cc88d7fad16e8b5cbf7b6b9bfe412908782b4a45)
The file was modifiedcore/src/test/scala/org/apache/spark/util/UtilsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2StringFormat.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/AccumulatorV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala (diff)
Commit ada28f25955a9e8ddd182ad41b2a4ef278f3d809 by ueshin
[SPARK-23933][SQL] Add map_from_arrays function
## What changes were proposed in this pull request?
The PR adds the SQL function `map_from_arrays`. The behavior of the
function is based on Presto's `map`. Since SparkSQL already had a `map`
function, we prepared the different name for this behavior.
This function returns returns a map from a pair of arrays for keys and
values.
## How was this patch tested?
Added UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #21258 from kiszk/SPARK-23933.
(commit: ada28f25955a9e8ddd182ad41b2a4ef278f3d809)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala (diff)
Commit 0d3714d221460a2a1141134c3d451f18c4e0d46f by vanzin
[SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failure of
kubernetes-integration-tests
## What changes were proposed in this pull request?
Fix java checkstyle failure of kubernetes-integration-tests
## How was this patch tested?
Checked manually on my local environment.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #21545 from jiangxb1987/k8s-checkstyle.
(commit: 0d3714d221460a2a1141134c3d451f18c4e0d46f)
The file was modifiedproject/SparkBuild.scala (diff)
Commit f53818d35bdef5d20a2718b14a2fed4c468545c6 by vanzin
[SPARK-24506][UI] Add UI filters to tabs added after binding
## What changes were proposed in this pull request?
Currently, `spark.ui.filters` are not applied to the handlers added
after binding the server. This means that every page which is added
after starting the UI will not have the filters configured on it. This
can allow unauthorized access to the pages.
The PR adds the filters also to the handlers added after the UI starts.
## How was this patch tested?
manual tests (without the patch, starting the thriftserver with `--conf
spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
--conf
spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"`
you can access `http://localhost:4040/sqlserver`; with the patch, 401 is
the response as for the other pages).
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21523 from mgaido91/SPARK-24506.
(commit: f53818d35bdef5d20a2718b14a2fed4c468545c6)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/JettyUtils.scala (diff)
Commit 9786ce66c52d41b1d58ddedb3a984f561fd09ff3 by hyukjinkwon
[SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as
window functions with unbounded window frames
## What changes were proposed in this pull request? This PR enables
using a grouped aggregate pandas UDFs as window functions. The semantics
is the same as using SQL aggregation function as window functions.
```
      >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
      >>> from pyspark.sql import Window
      >>> df = spark.createDataFrame(
      ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
      ...     ("id", "v"))
      >>> pandas_udf("double", PandasUDFType.GROUPED_AGG)
      ... def mean_udf(v):
      ...     return v.mean()
      >>> w = Window.partitionBy('id')
      >>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
      +---+----+------+
      | id|   v|mean_v|
      +---+----+------+
      |  1| 1.0|   1.5|
      |  1| 2.0|   1.5|
      |  2| 3.0|   6.0|
      |  2| 5.0|   6.0|
      |  2|10.0|   6.0|
      +---+----+------+
```
The scope of this PR is somewhat limited in terms of:
(1) Only supports unbounded window, which acts essentially as group by.
(2) Only supports aggregation functions, not "transform" like window
functions (n -> n mapping)
Both of these are left as future work. Especially, (1) needs careful
thinking w.r.t. how to pass rolling window data to python efficiently.
(2) is a bit easier but does require more changes therefore I think it's
better to leave it as a separate PR.
## How was this patch tested?
WindowPandasUDFTests
Author: Li Jin <ice.xelloss@gmail.com>
Closes #21082 from icexelloss/SPARK-22239-window-udf.
(commit: 9786ce66c52d41b1d58ddedb3a984f561fd09ff3)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala (diff)
The file was modifiedpython/pyspark/rdd.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala (diff)
Commit 3352d6fe9a1efb6dee18e40bdf584930b10d1d3e by hyukjinkwon
[SPARK-24466][SS] Fix TextSocketMicroBatchReader to be compatible with
netcat again
## What changes were proposed in this pull request?
TextSocketMicroBatchReader was no longer be compatible with netcat due
to launching temporary reader for reading schema, and closing reader,
and re-opening reader. While reliable socket server should be able to
handle this without any issue, nc command normally can't handle multiple
connections and simply exits when closing temporary reader.
This patch fixes TextSocketMicroBatchReader to be compatible with netcat
again, via deferring opening socket to the first call of
planInputPartitions() instead of constructor.
## How was this patch tested?
Added unit test which fails on current and succeeds with the patch. And
also manually tested.
Author: Jungtaek Lim <kabhwan@gmail.com>
Closes #21497 from HeartSaVioR/SPARK-24466.
(commit: 3352d6fe9a1efb6dee18e40bdf584930b10d1d3e)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala (diff)
Commit 4c388bccf1bcac8f833fd9214096dd164c3ea065 by hyukjinkwon
[SPARK-24485][SS] Measure and log elapsed time for filesystem operations
in HDFSBackedStateStoreProvider
## What changes were proposed in this pull request?
This patch measures and logs elapsed time for each operation which
communicate with file system (mostly remote HDFS in production) in
HDFSBackedStateStoreProvider to help investigating any latency issue.
## How was this patch tested?
Manually tested.
Author: Jungtaek Lim <kabhwan@gmail.com>
Closes #21506 from HeartSaVioR/SPARK-24485.
(commit: 4c388bccf1bcac8f833fd9214096dd164c3ea065)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (diff)
Commit 7703b46d2843db99e28110c4c7ccf60934412504 by hyukjinkwon
[SPARK-24479][SS] Added config for registering streamingQueryListeners
## What changes were proposed in this pull request?
Currently a "StreamingQueryListener" can only be registered
programatically. We could have a new config
"spark.sql.streamingQueryListeners" similar to
"spark.sql.queryExecutionListeners" and "spark.extraListeners" for users
to register custom streaming listeners.
## How was this patch tested?
New unit test and running example programs.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Arun Mahadevan <arunm@apache.org>
Closes #21504 from arunmahadevan/SPARK-24480.
(commit: 7703b46d2843db99e28110c4c7ccf60934412504)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala (diff)
Commit 299d297e250ca3d46616a97e4256aa9ad6a135e5 by wenchen
[SPARK-24500][SQL] Make sure streams are materialized during Tree
transforms.
## What changes were proposed in this pull request? If you construct
catalyst trees using `scala.collection.immutable.Stream` you can run
into situations where valid transformations do not seem to have any
effect. There are two causes for this behavior:
- `Stream` is evaluated lazily. Note that default implementation will
generally only evaluate a function for the first element (this makes
testing a bit tricky).
- `TreeNode` and `QueryPlan` use side effects to detect if a tree has
changed. Mapping over a stream is lazy and does not need to trigger this
side effect. If this happens the node will invalidly assume that it did
not change and return itself instead if the newly created node (this is
for GC reasons).
This PR fixes this issue by forcing materialization on streams in
`TreeNode` and `QueryPlan`.
## How was this patch tested? Unit tests were added to `TreeNodeSuite`
and `LogicalPlanSuite`. An integration test was added to the
`PlannerSuite`
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes #21539 from hvanhovell/SPARK-24500.
(commit: 299d297e250ca3d46616a97e4256aa9ad6a135e5)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala (diff)
Commit 1b46f41c55f5cd29956e17d7da95a95580cf273f by zsxwing
[SPARK-24235][SS] Implement continuous shuffle writer for single reader
partition.
## What changes were proposed in this pull request?
https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit
Implement continuous shuffle write RDD for a single reader partition. (I
don't believe any implementation changes are actually required for
multiple reader partitions, but this PR is already very large, so I want
to exclude those for now to keep the size down.)
## How was this patch tested?
new unit tests
Author: Jose Torres <torres.joseph.f+github@gmail.com>
Closes #21428 from jose-torres/writerTask.
(commit: 1b46f41c55f5cd29956e17d7da95a95580cf273f)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleReader.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriter.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala (diff)
The file was removedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleReadSuite.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/UnsafeRowReceiver.scala
Commit 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117 by gatorsmile
[SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
## What changes were proposed in this pull request?
The PR updates the 2.3 version tested to the new release 2.3.1.
## How was this patch tested?
existing UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21543 from mgaido91/patch-1.
(commit: 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit 534065efeb51ff0d308fa6cc9dea0715f8ce25ad by hyukjinkwon
[MINOR][CORE][TEST] Remove unnecessary sort in UnsafeInMemorySorterSuite
## What changes were proposed in this pull request?
We don't require specific ordering of the input data, the sort action is
not necessary and misleading.
## How was this patch tested?
Existing test suite.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #21536 from jiangxb1987/sorterSuite.
(commit: 534065efeb51ff0d308fa6cc9dea0715f8ce25ad)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java (diff)
Commit fdadc4be08dcf1a06383bbb05e53540da2092c63 by gatorsmile
[SPARK-24495][SQL] EnsureRequirement returns wrong plan when reordering
equal keys
## What changes were proposed in this pull request?
`EnsureRequirement` in its `reorder` method currently assumes that the
same key appears only once in the join condition. This of course might
not be the case, and when it is not satisfied, it returns a wrong plan
which produces a wrong result of the query.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21529 from mgaido91/SPARK-24495.
(commit: fdadc4be08dcf1a06383bbb05e53540da2092c63)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff)
Commit d3eed8fd6d65d95306abfb513a9e0fde05b703ac by vanzin
[SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf
when creating pysp…
…ark shell
## What changes were proposed in this pull request?
This PR catches TypeError when testing existence of HiveConf when
creating pyspark shell
## How was this patch tested?
Manually tested. Here are the manual test cases:
Build with hive:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 14:55:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome
to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
     /_/
Using Python version 3.6.5 (default, Apr  6 2018 13:44:09) SparkSession
available as 'spark'.
>>> spark.conf.get('spark.sql.catalogImplementation')
'hive'
```
Build without hive:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 15:04:52 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome
to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
     /_/
Using Python version 3.6.5 (default, Apr  6 2018 13:44:09) SparkSession
available as 'spark'.
>>> spark.conf.get('spark.sql.catalogImplementation')
'in-memory'
```
Failed to start shell:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 15:07:53 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/icexelloss/workspace/spark/python/pyspark/shell.py:45:
UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.") Traceback (most
recent call last):
File "/Users/icexelloss/workspace/spark/python/pyspark/shell.py", line
41, in <module>
   spark = SparkSession._create_shell_session()
File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py",
line 581, in _create_shell_session
   return SparkSession.builder.getOrCreate()
File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py",
line 168, in getOrCreate
   raise py4j.protocol.Py4JError("Fake Py4JError")
py4j.protocol.Py4JError: Fake Py4JError
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$
```
Author: Li Jin <ice.xelloss@gmail.com>
Closes #21569 from
icexelloss/SPARK-24563-fix-pyspark-shell-without-hive.
(commit: d3eed8fd6d65d95306abfb513a9e0fde05b703ac)
The file was modifiedpython/pyspark/sql/session.py (diff)
Commit b8f27ae3b34134a01998b77db4b7935e7f82a4fe by wenchen
[SPARK-24543][SQL] Support any type as DDL string for from_json's schema
## What changes were proposed in this pull request?
In the PR, I propose to support any DataType represented as DDL string
for the from_json function. After the changes, it will be possible to
specify `MapType` in SQL like:
```sql select from_json('{"a":1, "b":2}', 'map<string, int>')
``` and in Scala (similar in other languages)
```scala val in = Seq("""{"a": {"b": 1}}""").toDS() val schema =
"map<string, map<string, int>>" val out = in.select(from_json($"value",
schema, Map.empty[String, String]))
```
## How was this patch tested?
Added a couple sql tests and modified existing tests for Python and
Scala. The former tests were modified because it is not imported for
them in which format schema for `from_json` is provided.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes #21550 from MaxGekk/from_json-ddl-schema.
(commit: b8f27ae3b34134a01998b77db4b7935e7f82a4fe)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/json-functions.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/json-functions.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
Commit 18cb0c07988578156c869682d8a2c4151e8d35e5 by vanzin
[SPARK-24319][SPARK SUBMIT] Fix spark-submit execution where no main
class is required.
## What changes were proposed in this pull request?
With [PR 20925](https://github.com/apache/spark/pull/20925) now it's not
possible to execute the following commands:
* run-example
* run-example --help
* run-example --version
* run-example --usage-error
* run-example --status ...
* run-example --kill ...
In this PR the execution will be allowed for the mentioned commands.
## How was this patch tested?
Existing unit tests extended + additional written.
Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Closes #21450 from gaborgsomogyi/SPARK-24319.
(commit: 18cb0c07988578156c869682d8a2c4151e8d35e5)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (diff)
The file was modifiedlauncher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/Main.java (diff)
Commit 270a9a3cac25f3e799460320d0fc94ccd7ecfaea by mcheah
[SPARK-24248][K8S] Use level triggering and state reconciliation in
scheduling and lifecycle
## What changes were proposed in this pull request?
Previously, the scheduler backend was maintaining state in many places,
not only for reading state but also writing to it. For example, state
had to be managed in both the watch and in the executor allocator
runnable. Furthermore, one had to keep track of multiple hash tables.
We can do better here by:
1. Consolidating the places where we manage state. Here, we take
inspiration from traditional Kubernetes controllers. These controllers
tend to follow a level-triggered mechanism. This means that the
controller will continuously monitor the API server via watches and
polling, and on periodic passes, the controller will reconcile the
current state of the cluster with the desired state. We implement this
by introducing the concept of a pod snapshot, which is a given state of
the executors in the Kubernetes cluster. We operate periodically on
snapshots. To prevent overloading the API server with polling requests
to get the state of the cluster (particularly for executor allocation
where we want to be checking frequently to get executors to launch
without unbearably bad latency), we use watches to populate snapshots by
applying observed events to a previous snapshot to get a new snapshot.
Whenever we do poll the cluster, the polled state replaces any existing
snapshot - this ensures eventual consistency and mirroring of the
cluster, as is desired in a level triggered architecture.
2. Storing less specialized in-memory state in general. Previously we
were creating hash tables to represent the state of executors. Instead,
it's easier to represent state solely by the snapshots.
## How was this patch tested?
Integration tests should test there's no regressions end to end. Unit
tests to be updated, in particular focusing on different orderings of
events, particularly accounting for when events come in unexpected
ordering.
Author: mcheah <mcheah@palantir.com>
Closes #21366 from mccheah/event-queue-driven-scheduling.
(commit: 270a9a3cac25f3e799460320d0fc94ccd7ecfaea)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotSuite.scala
The file was modifiedcore/src/main/scala/org/apache/spark/util/ThreadUtils.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManagerSuite.scala
The file was addedlicenses/LICENSE-jmock.txt
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/DeterministicExecutorPodsSnapshotsStore.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorLifecycleTestUtils.scala
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackendSuite.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreSuite.scala
The file was modifiedpom.xml (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodStates.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/Fabric8Aliases.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStore.scala
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSourceSuite.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSourceSuite.scala
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
The file was modifiedLICENSE (diff)
Commit 22daeba59b3ffaccafc9ff4b521abc265d0e58dd by wenchen
[SPARK-24478][SQL] Move projection and filter push down to physical
conversion
## What changes were proposed in this pull request?
This removes the v2 optimizer rule for push-down and instead pushes
filters and required columns when converting to a physical plan, as
suggested by marmbrus. This makes the v2 relation cleaner because the
output and filters do not change in the logical plan.
A side-effect of this change is that the stats from the logical
(optimized) plan no longer reflect pushed filters and projection. This
is a temporary state, until the planner gathers stats from the physical
plan instead. An alternative to this approach is
https://github.com/rdblue/spark/commit/9d3a11e68bca6c5a56a2be47fb09395350362ac5.
The first commit was proposed in #21262. This PR replaces #21262.
## How was this patch tested?
Existing tests.
Author: Ryan Blue <blue@apache.org>
Closes #21503 from
rdblue/SPARK-24478-move-push-down-to-physical-conversion.
(commit: 22daeba59b3ffaccafc9ff4b521abc265d0e58dd)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportStatistics.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
Commit 6567fc43aca75b41900cde976594e21c8b0ca98a by hyukjinkwon
[PYTHON] Fix typo in serializer exception
## What changes were proposed in this pull request?
Fix typo in exception raised in Python serializer
## How was this patch tested?
No code changes
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
Closes #21566 from rberenguel/fix_typo_pyspark_serializers.
(commit: 6567fc43aca75b41900cde976594e21c8b0ca98a)
The file was modifiedpython/pyspark/serializers.py (diff)
Commit 495d8cf09ae7134aa6d2feb058612980e02955fa by vanzin
[SPARK-24490][WEBUI] Use WebUI.addStaticHandler in web UIs
`WebUI` defines `addStaticHandler` that web UIs don't use (and simply
introduce duplication). Let's clean them up and remove duplications.
Local build and waiting for Jenkins
Author: Jacek Laskowski <jacek@japila.pl>
Closes #21510 from
jaceklaskowski/SPARK-24490-Use-WebUI.addStaticHandler.
(commit: 495d8cf09ae7134aa6d2feb058612980e02955fa)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerWebUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala (diff)
The file was modifiedresource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/ui/MesosClusterUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/SparkUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/WebUI.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/ui/StreamingTab.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala (diff)
Commit b5ccf0d3957a444db93893c0ce4417bfbbb11822 by tathagata.das1565
[SPARK-24396][SS][PYSPARK] Add Structured Streaming ForeachWriter for
python
## What changes were proposed in this pull request?
This PR adds `foreach` for streaming queries in Python. Users will be
able to specify their processing logic in two different ways.
- As a function that takes a row as input.
- As an object that has methods `open`, `process`, and `close` methods.
See the python docs in this PR for more details.
## How was this patch tested? Added java and python unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #21477 from tdas/SPARK-24396.
(commit: b5ccf0d3957a444db93893c0ce4417bfbbb11822)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonForeachWriterSuite.scala
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ForeachWriterProvider.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonForeachWriter.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedpython/pyspark/tests.py (diff)
Commit 90da7dc241f8eec2348c0434312c97c116330bc4 by wenchen
[SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple
## What changes were proposed in this pull request?
This PR fixes possible overflow in int add or multiply. In particular,
their overflows in multiply are detected by
[Spotbugs](https://spotbugs.github.io/)
The following assignments may cause overflow in right hand side. As a
result, the result may be negative.
``` long = int * int long = int + int
```
To avoid this problem, this PR performs cast from int to long in right
hand side.
## How was this patch tested?
Existing UTs.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #21481 from kiszk/SPARK-24452.
(commit: 90da7dc241f8eec2348c0434312c97c116330bc4)
The file was modifiedcore/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala (diff)
Commit e4fee395ecd93ad4579d9afbf0861f82a303e563 by brkyvz
[SPARK-24525][SS] Provide an option to limit number of rows in a
MemorySink
## What changes were proposed in this pull request?
Provide an option to limit number of rows in a MemorySink. Currently,
MemorySink and MemorySinkV2 have unbounded size, meaning that if they're
used on big data, they can OOM the stream. This change adds a
maxMemorySinkRows option to limit how many rows MemorySink and
MemorySinkV2 can hold. By default, they are still unbounded.
## How was this patch tested?
Added new unit tests.
Author: Mukul Murthy <mukul.murthy@databricks.com>
Closes #21559 from mukulmurthy/SPARK-24525.
(commit: e4fee395ecd93ad4579d9afbf0861f82a303e563)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MemorySinkV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MemorySinkSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala (diff)
Commit c7c0b086a0b18424725433ade840d5121ac2b86e by rxin
add one supported type missing from the javadoc
## What changes were proposed in this pull request?
The supported java.math.BigInteger type is not mentioned in the javadoc
of Encoders.bean()
## How was this patch tested?
only Javadoc fix
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: James Yu <james@ispot.tv>
Closes #21544 from yuj/master.
(commit: c7c0b086a0b18424725433ade840d5121ac2b86e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
Commit 5a47f64c33ddf5ec66ac94430b5ee58d40e2991e by ishizaki
introduce ArraySetUtils to reuse code among
array_union/array_intersect/array_except
(commit: 5a47f64c33ddf5ec66ac94430b5ee58d40e2991e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
Commit 0d1d855d3406d8da5adabec92871b2dd9fdb4aa0 by ishizaki
use GenericArrayData if UnsafeArrayData cannot be used
use ctx.addReferenceObj for DataType
(commit: 0d1d855d3406d8da5adabec92871b2dd9fdb4aa0)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
Commit a8e9032de4a7e2f7297ad9a14f107e447094353e by ishizaki
use BinaryArrayExpressionWithImplicitCast
rename ArraySetUtils to ArraySetLike
update an condition to use GenericArrayData
(commit: a8e9032de4a7e2f7297ad9a14f107e447094353e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
Commit b0a935255951280b49c39968f6234163e2f0e379 by hyukjinkwon
[SPARK-24573][INFRA] Runs SBT checkstyle after the build to work around
a side-effect
## What changes were proposed in this pull request?
Seems checkstyle affects the build in the PR builder in Jenkins. I can't
reproduce in my local and seems it can only be reproduced in the PR
builder.
I was checking the places it goes through and this is just a speculation
that checkstyle's compilation in SBT has a side effect to the assembly
build.
This PR proposes to run the SBT checkstyle after the build.
## How was this patch tested?
Jenkins tests.
Author: hyukjinkwon <gurwls223@apache.org>
Closes #21579 from HyukjinKwon/investigate-javastyle.
(commit: b0a935255951280b49c39968f6234163e2f0e379)
The file was modifieddev/run-tests.py (diff)
Commit e219e692ef70c161f37a48bfdec2a94b29260004 by hyukjinkwon
[SPARK-23772][SQL] Provide an option to ignore column of all null values
or empty array during JSON schema inference
## What changes were proposed in this pull request? This pr added a new
JSON option `dropFieldIfAllNull ` to ignore column of all null values or
empty array/struct during JSON schema inference.
## How was this patch tested? Added tests in `JsonSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org> Author: Xiangrui Meng
<meng@databricks.com>
Closes #20929 from maropu/SPARK-23772.
(commit: e219e692ef70c161f37a48bfdec2a94b29260004)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
Commit bce177552564a4862bc979d39790cf553a477d74 by hyukjinkwon
[SPARK-24526][BUILD][TEST-MAVEN] Spaces in the build dir causes failures
in the build/mvn script
## What changes were proposed in this pull request?
Fix the call to ${MVN_BIN} to be wrapped in quotes so it will handle
having spaces in the path.
## How was this patch tested?
Ran the following to confirm using the build/mvn tool with a space in
the build dir now works without error
``` mkdir /tmp/test\ spaces cd /tmp/test\ spaces git clone
https://github.com/apache/spark.git cd spark
# Remove all mvn references in PATH so the script will download mvn to
the local dir
./build/mvn -DskipTests clean package
```
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: trystanleftwich <trystan@atscale.com>
Closes #21534 from trystanleftwich/SPARK-24526.
(commit: bce177552564a4862bc979d39790cf553a477d74)
The file was modifiedbuild/mvn (diff)
Commit 8f225e055c2031ca85d61721ab712170ab4e50c1 by wenchen
[SPARK-24548][SQL] Fix incorrect schema of Dataset with tuple encoders
## What changes were proposed in this pull request?
When creating tuple expression encoders, we should give the serializer
expressions of tuple items correct names, so we can have correct output
schema when we use such tuple encoders.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #21576 from viirya/SPARK-24548.
(commit: 8f225e055c2031ca85d61721ab712170ab4e50c1)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java (diff)
Commit 1737d45e08a5f1fb78515b14321721d7197b443a by wenchen
[SPARK-24478][SQL][FOLLOWUP] Move projection and filter push down to
physical conversion
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/21503, to
completely move operator pushdown to the planner rule.
The code are mostly from https://github.com/apache/spark/pull/21319
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes #21574 from cloud-fan/followup.
(commit: 1737d45e08a5f1fb78515b14321721d7197b443a)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportStatistics.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
Commit 9a75c18290fff7d116cf88a44f9120bf67d8bd27 by wenchen
[SPARK-24542][SQL] UDF series UDFXPathXXXX allow users to pass carefully
crafted XML to access arbitrary files
## What changes were proposed in this pull request?
UDF series UDFXPathXXXX allow users to pass carefully crafted XML to
access arbitrary files. Spark does not have built-in access control.
When users use the external access control library, users might bypass
them and access the file contents.
This PR basically patches the Hive fix to Apache Spark.
https://issues.apache.org/jira/browse/HIVE-18879
## How was this patch tested?
A unit test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes #21549 from gatorsmile/xpathSecurity.
(commit: 9a75c18290fff7d116cf88a44f9120bf67d8bd27)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/xml/XPathExpressionSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/xml/UDFXPathUtilSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/xml/UDFXPathUtil.java (diff)
Commit a78a9046413255756653f70165520efd486fb493 by gatorsmile
[SPARK-24521][SQL][TEST] Fix ineffective test in CachedTableSuite
## What changes were proposed in this pull request?
test("withColumn doesn't invalidate cached dataframe") in
CachedTableSuite doesn't not work because:
The UDF is executed and test count incremented when "df.cache()" is
called and the subsequent "df.collect()" has no effect on the test
result.
This PR fixed this test and add another test for caching UDF.
## How was this patch tested?
Add new tests.
Author: Li Jin <ice.xelloss@gmail.com>
Closes #21531 from icexelloss/fix-cache-test.
(commit: a78a9046413255756653f70165520efd486fb493)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
Commit 9dbe53eb6bb5916d28000f2c0d646cf23094ac11 by wenchen
[SPARK-24556][SQL] Always rewrite output partitioning in
ReusedExchangeExec and InMemoryTableScanExec
## What changes were proposed in this pull request?
Currently, ReusedExchange and InMemoryTableScanExec only rewrite output
partitioning if child's partitioning is HashPartitioning and do nothing
for other partitioning, e.g., RangePartitioning. We should always
rewrite it, otherwise, unnecessary shuffle could be introduced like
https://issues.apache.org/jira/browse/SPARK-24556.
## How was this patch tested?
Add new tests.
Author: yucai <yyu1@ebay.com>
Closes #21564 from yucai/SPARK-24556.
(commit: 9dbe53eb6bb5916d28000f2c0d646cf23094ac11)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala (diff)
Commit 13092d733791b19cd7994084178306e0c449f2ed by eerlands
[SPARK-24534][K8S] Bypass non spark-on-k8s commands
## What changes were proposed in this pull request? This PR changes the
entrypoint.sh to provide an option to run non spark-on-k8s commands
(init, driver, executor) in order to let the user keep with the normal
workflow without hacking the image to bypass the entrypoint
## How was this patch tested? This patch was built manually in my local
machine and I ran some tests with a combination of ```docker run```
commands.
Author: rimolive <ricardo.martinelli.oliveira@gmail.com>
Closes #21572 from rimolive/rimolive-spark-24534.
(commit: 13092d733791b19cd7994084178306e0c449f2ed)
The file was modifiedresource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh (diff)
Commit 2cb976355c615eee4ebd0a86f3911fa9284fccf6 by zsxwing
[SPARK-24565][SS] Add API for in Structured Streaming for exposing
output rows of each microbatch as a DataFrame
## What changes were proposed in this pull request?
Currently, the micro-batches in the MicroBatchExecution is not exposed
to the user through any public API. This was because we did not want to
expose the micro-batches, so that all the APIs we expose, we can
eventually support them in the Continuous engine. But now that we have
better sense of buiding a ContinuousExecution, I am considering adding
APIs which will run only the MicroBatchExecution. I have quite a few use
cases where exposing the microbatch output as a dataframe is useful.
- Pass the output rows of each batch to a library that is designed only
the batch jobs (example, uses many ML libraries need to collect() while
learning).
- Reuse batch data sources for output whose streaming version does not
exists (e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each
batch. This is not the most elegant thing to do for multiple-output
streaming queries but is likely to be better than running two streaming
queries processing the same data twice.
The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to
Scala/Java/Python `DataStreamWriter`.
## How was this patch tested? New unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #21571 from tdas/foreachBatch.
(commit: 2cb976355c615eee4ebd0a86f3911fa9284fccf6)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSink.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedpython/pyspark/java_gateway.py (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala
The file was modifiedpython/pyspark/sql/utils.py (diff)
The file was modifiedpython/pyspark/streaming/context.py (diff)
Commit bc0498d5820ded2b428277e396502e74ef0ce36d by gatorsmile
[SPARK-24583][SQL] Wrong schema type in InsertIntoDataSourceCommand
## What changes were proposed in this pull request?
Change insert input schema type: "insertRelationType" ->
"insertRelationType.asNullable", in order to avoid nullable being
overridden.
## How was this patch tested?
Added one test in InsertSuite.
Author: Maryann Xue <maryannxue@apache.org>
Closes #21585 from maryannxue/spark-24583.
(commit: bc0498d5820ded2b428277e396502e74ef0ce36d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoDataSourceCommand.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)