SuccessChanges

Summary

  1. [SPARK-24416] Fix configuration specification for killBlacklisted (commit: 3af1d3e6d95719e15a997877d5ecd3bb40c08b9c) (details)
  2. [SPARK-23931][SQL] Adds arrays_zip function to sparksql (commit: f0ef1b311dd5399290ad6abe4ca491bdb13478f0) (details)
  3. [SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName (commit: cc88d7fad16e8b5cbf7b6b9bfe412908782b4a45) (details)
  4. [SPARK-23933][SQL] Add map_from_arrays function (commit: ada28f25955a9e8ddd182ad41b2a4ef278f3d809) (details)
  5. [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failure of (commit: 0d3714d221460a2a1141134c3d451f18c4e0d46f) (details)
  6. [SPARK-24506][UI] Add UI filters to tabs added after binding (commit: f53818d35bdef5d20a2718b14a2fed4c468545c6) (details)
  7. [SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as (commit: 9786ce66c52d41b1d58ddedb3a984f561fd09ff3) (details)
  8. [SPARK-24466][SS] Fix TextSocketMicroBatchReader to be compatible with (commit: 3352d6fe9a1efb6dee18e40bdf584930b10d1d3e) (details)
  9. [SPARK-24485][SS] Measure and log elapsed time for filesystem operations (commit: 4c388bccf1bcac8f833fd9214096dd164c3ea065) (details)
  10. [SPARK-24479][SS] Added config for registering streamingQueryListeners (commit: 7703b46d2843db99e28110c4c7ccf60934412504) (details)
  11. [SPARK-24500][SQL] Make sure streams are materialized during Tree (commit: 299d297e250ca3d46616a97e4256aa9ad6a135e5) (details)
  12. [SPARK-24235][SS] Implement continuous shuffle writer for single reader (commit: 1b46f41c55f5cd29956e17d7da95a95580cf273f) (details)
  13. [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1 (commit: 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117) (details)
  14. [MINOR][CORE][TEST] Remove unnecessary sort in UnsafeInMemorySorterSuite (commit: 534065efeb51ff0d308fa6cc9dea0715f8ce25ad) (details)
  15. [SPARK-24495][SQL] EnsureRequirement returns wrong plan when reordering (commit: fdadc4be08dcf1a06383bbb05e53540da2092c63) (details)
  16. [SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf (commit: d3eed8fd6d65d95306abfb513a9e0fde05b703ac) (details)
  17. [SPARK-24543][SQL] Support any type as DDL string for from_json's schema (commit: b8f27ae3b34134a01998b77db4b7935e7f82a4fe) (details)
  18. [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution where no main (commit: 18cb0c07988578156c869682d8a2c4151e8d35e5) (details)
  19. [SPARK-24248][K8S] Use level triggering and state reconciliation in (commit: 270a9a3cac25f3e799460320d0fc94ccd7ecfaea) (details)
  20. [SPARK-24478][SQL] Move projection and filter push down to physical (commit: 22daeba59b3ffaccafc9ff4b521abc265d0e58dd) (details)
  21. [PYTHON] Fix typo in serializer exception (commit: 6567fc43aca75b41900cde976594e21c8b0ca98a) (details)
  22. [SPARK-24490][WEBUI] Use WebUI.addStaticHandler in web UIs (commit: 495d8cf09ae7134aa6d2feb058612980e02955fa) (details)
  23. [SPARK-24396][SS][PYSPARK] Add Structured Streaming ForeachWriter for (commit: b5ccf0d3957a444db93893c0ce4417bfbbb11822) (details)
  24. [SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple (commit: 90da7dc241f8eec2348c0434312c97c116330bc4) (details)
  25. [SPARK-24525][SS] Provide an option to limit number of rows in a (commit: e4fee395ecd93ad4579d9afbf0861f82a303e563) (details)
Commit 3af1d3e6d95719e15a997877d5ecd3bb40c08b9c by irashid
[SPARK-24416] Fix configuration specification for killBlacklisted
executors
## What changes were proposed in this pull request?
spark.blacklist.killBlacklistedExecutors is defined as
(Experimental) If set to "true", allow Spark to automatically kill, and
attempt to re-create, executors when they are blacklisted. Note that,
when an entire node is added to the blacklist, all of the executors on
that node will be killed.
I presume the killing of blacklisted executors only happens after the
stage completes successfully and all tasks have completed or on fetch
failures
(updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It
is confusing because the definition states that the executor will be
attempted to be recreated as soon as it is blacklisted. This is not true
while the stage is in progress and an executor is blacklisted, it will
not attempt to cleanup until the stage finishes.
Author: Sanket Chintapalli <schintap@yahoo-inc.com>
Closes #21475 from redsanket/SPARK-24416.
(commit: 3af1d3e6d95719e15a997877d5ecd3bb40c08b9c)
The file was modifieddocs/configuration.md (diff)
Commit f0ef1b311dd5399290ad6abe4ca491bdb13478f0 by ueshin
[SPARK-23931][SQL] Adds arrays_zip function to sparksql
Signed-off-by: DylanGuedes <djmgguedesgmail.com>
## What changes were proposed in this pull request?
Addition of arrays_zip function to spark sql functions.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests) Unit tests that checks if the results are correct.
Author: DylanGuedes <djmgguedes@gmail.com>
Closes #21045 from DylanGuedes/SPARK-23931.
(commit: f0ef1b311dd5399290ad6abe4ca491bdb13478f0)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
Commit cc88d7fad16e8b5cbf7b6b9bfe412908782b4a45 by wenchen
[SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName
that is not safe in scala
## What changes were proposed in this pull request?
When user create a aggregator object in scala and pass the aggregator to
Spark Dataset's agg() method, Spark's will initialize
TypedAggregateExpression with the nodeName field as
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in
scala environment, depending on how user creates the aggregator object.
For example, if the aggregator class full qualified name is
"com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw
java.lang.InternalError "Malformed class name". This has been reported
in scalatest https://github.com/scalatest/scalatest/pull/1044 and
discussed in many scala upstream jiras such as SI-8110, SI-5425.
To fix this issue, we follow the solution in
https://github.com/scalatest/scalatest/pull/1044 to add safer version of
getSimpleName as a util method, and TypedAggregateExpression will invoke
this util method rather than getClass.getSimpleName.
## How was this patch tested? added unit test
Author: Fangshi Li <fli@linkedin.com>
Closes #21276 from fangshil/SPARK-24216.
(commit: cc88d7fad16e8b5cbf7b6b9bfe412908782b4a45)
The file was modifiedcore/src/main/scala/org/apache/spark/util/AccumulatorV2.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/UtilsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2StringFormat.scala (diff)
Commit ada28f25955a9e8ddd182ad41b2a4ef278f3d809 by ueshin
[SPARK-23933][SQL] Add map_from_arrays function
## What changes were proposed in this pull request?
The PR adds the SQL function `map_from_arrays`. The behavior of the
function is based on Presto's `map`. Since SparkSQL already had a `map`
function, we prepared the different name for this behavior.
This function returns returns a map from a pair of arrays for keys and
values.
## How was this patch tested?
Added UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #21258 from kiszk/SPARK-23933.
(commit: ada28f25955a9e8ddd182ad41b2a4ef278f3d809)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
Commit 0d3714d221460a2a1141134c3d451f18c4e0d46f by vanzin
[SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failure of
kubernetes-integration-tests
## What changes were proposed in this pull request?
Fix java checkstyle failure of kubernetes-integration-tests
## How was this patch tested?
Checked manually on my local environment.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #21545 from jiangxb1987/k8s-checkstyle.
(commit: 0d3714d221460a2a1141134c3d451f18c4e0d46f)
The file was modifiedproject/SparkBuild.scala (diff)
Commit f53818d35bdef5d20a2718b14a2fed4c468545c6 by vanzin
[SPARK-24506][UI] Add UI filters to tabs added after binding
## What changes were proposed in this pull request?
Currently, `spark.ui.filters` are not applied to the handlers added
after binding the server. This means that every page which is added
after starting the UI will not have the filters configured on it. This
can allow unauthorized access to the pages.
The PR adds the filters also to the handlers added after the UI starts.
## How was this patch tested?
manual tests (without the patch, starting the thriftserver with `--conf
spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
--conf
spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"`
you can access `http://localhost:4040/sqlserver`; with the patch, 401 is
the response as for the other pages).
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21523 from mgaido91/SPARK-24506.
(commit: f53818d35bdef5d20a2718b14a2fed4c468545c6)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/JettyUtils.scala (diff)
Commit 9786ce66c52d41b1d58ddedb3a984f561fd09ff3 by hyukjinkwon
[SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as
window functions with unbounded window frames
## What changes were proposed in this pull request? This PR enables
using a grouped aggregate pandas UDFs as window functions. The semantics
is the same as using SQL aggregation function as window functions.
```
      >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
      >>> from pyspark.sql import Window
      >>> df = spark.createDataFrame(
      ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
      ...     ("id", "v"))
      >>> pandas_udf("double", PandasUDFType.GROUPED_AGG)
      ... def mean_udf(v):
      ...     return v.mean()
      >>> w = Window.partitionBy('id')
      >>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
      +---+----+------+
      | id|   v|mean_v|
      +---+----+------+
      |  1| 1.0|   1.5|
      |  1| 2.0|   1.5|
      |  2| 3.0|   6.0|
      |  2| 5.0|   6.0|
      |  2|10.0|   6.0|
      +---+----+------+
```
The scope of this PR is somewhat limited in terms of:
(1) Only supports unbounded window, which acts essentially as group by.
(2) Only supports aggregation functions, not "transform" like window
functions (n -> n mapping)
Both of these are left as future work. Especially, (1) needs careful
thinking w.r.t. how to pass rolling window data to python efficiently.
(2) is a bit easier but does require more changes therefore I think it's
better to leave it as a separate PR.
## How was this patch tested?
WindowPandasUDFTests
Author: Li Jin <ice.xelloss@gmail.com>
Closes #21082 from icexelloss/SPARK-22239-window-udf.
(commit: 9786ce66c52d41b1d58ddedb3a984f561fd09ff3)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala (diff)
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (diff)
The file was modifiedpython/pyspark/rdd.py (diff)
Commit 3352d6fe9a1efb6dee18e40bdf584930b10d1d3e by hyukjinkwon
[SPARK-24466][SS] Fix TextSocketMicroBatchReader to be compatible with
netcat again
## What changes were proposed in this pull request?
TextSocketMicroBatchReader was no longer be compatible with netcat due
to launching temporary reader for reading schema, and closing reader,
and re-opening reader. While reliable socket server should be able to
handle this without any issue, nc command normally can't handle multiple
connections and simply exits when closing temporary reader.
This patch fixes TextSocketMicroBatchReader to be compatible with netcat
again, via deferring opening socket to the first call of
planInputPartitions() instead of constructor.
## How was this patch tested?
Added unit test which fails on current and succeeds with the patch. And
also manually tested.
Author: Jungtaek Lim <kabhwan@gmail.com>
Closes #21497 from HeartSaVioR/SPARK-24466.
(commit: 3352d6fe9a1efb6dee18e40bdf584930b10d1d3e)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala (diff)
Commit 4c388bccf1bcac8f833fd9214096dd164c3ea065 by hyukjinkwon
[SPARK-24485][SS] Measure and log elapsed time for filesystem operations
in HDFSBackedStateStoreProvider
## What changes were proposed in this pull request?
This patch measures and logs elapsed time for each operation which
communicate with file system (mostly remote HDFS in production) in
HDFSBackedStateStoreProvider to help investigating any latency issue.
## How was this patch tested?
Manually tested.
Author: Jungtaek Lim <kabhwan@gmail.com>
Closes #21506 from HeartSaVioR/SPARK-24485.
(commit: 4c388bccf1bcac8f833fd9214096dd164c3ea065)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (diff)
Commit 7703b46d2843db99e28110c4c7ccf60934412504 by hyukjinkwon
[SPARK-24479][SS] Added config for registering streamingQueryListeners
## What changes were proposed in this pull request?
Currently a "StreamingQueryListener" can only be registered
programatically. We could have a new config
"spark.sql.streamingQueryListeners" similar to
"spark.sql.queryExecutionListeners" and "spark.extraListeners" for users
to register custom streaming listeners.
## How was this patch tested?
New unit test and running example programs.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Arun Mahadevan <arunm@apache.org>
Closes #21504 from arunmahadevan/SPARK-24480.
(commit: 7703b46d2843db99e28110c4c7ccf60934412504)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala (diff)
Commit 299d297e250ca3d46616a97e4256aa9ad6a135e5 by wenchen
[SPARK-24500][SQL] Make sure streams are materialized during Tree
transforms.
## What changes were proposed in this pull request? If you construct
catalyst trees using `scala.collection.immutable.Stream` you can run
into situations where valid transformations do not seem to have any
effect. There are two causes for this behavior:
- `Stream` is evaluated lazily. Note that default implementation will
generally only evaluate a function for the first element (this makes
testing a bit tricky).
- `TreeNode` and `QueryPlan` use side effects to detect if a tree has
changed. Mapping over a stream is lazy and does not need to trigger this
side effect. If this happens the node will invalidly assume that it did
not change and return itself instead if the newly created node (this is
for GC reasons).
This PR fixes this issue by forcing materialization on streams in
`TreeNode` and `QueryPlan`.
## How was this patch tested? Unit tests were added to `TreeNodeSuite`
and `LogicalPlanSuite`. An integration test was added to the
`PlannerSuite`
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes #21539 from hvanhovell/SPARK-24500.
(commit: 299d297e250ca3d46616a97e4256aa9ad6a135e5)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
Commit 1b46f41c55f5cd29956e17d7da95a95580cf273f by zsxwing
[SPARK-24235][SS] Implement continuous shuffle writer for single reader
partition.
## What changes were proposed in this pull request?
https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit
Implement continuous shuffle write RDD for a single reader partition. (I
don't believe any implementation changes are actually required for
multiple reader partitions, but this PR is already very large, so I want
to exclude those for now to keep the size down.)
## How was this patch tested?
new unit tests
Author: Jose Torres <torres.joseph.f+github@gmail.com>
Closes #21428 from jose-torres/writerTask.
(commit: 1b46f41c55f5cd29956e17d7da95a95580cf273f)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/UnsafeRowReceiver.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriter.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala
The file was removedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleReadSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleReader.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala
Commit 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117 by gatorsmile
[SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
## What changes were proposed in this pull request?
The PR updates the 2.3 version tested to the new release 2.3.1.
## How was this patch tested?
existing UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21543 from mgaido91/patch-1.
(commit: 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit 534065efeb51ff0d308fa6cc9dea0715f8ce25ad by hyukjinkwon
[MINOR][CORE][TEST] Remove unnecessary sort in UnsafeInMemorySorterSuite
## What changes were proposed in this pull request?
We don't require specific ordering of the input data, the sort action is
not necessary and misleading.
## How was this patch tested?
Existing test suite.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #21536 from jiangxb1987/sorterSuite.
(commit: 534065efeb51ff0d308fa6cc9dea0715f8ce25ad)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java (diff)
Commit fdadc4be08dcf1a06383bbb05e53540da2092c63 by gatorsmile
[SPARK-24495][SQL] EnsureRequirement returns wrong plan when reordering
equal keys
## What changes were proposed in this pull request?
`EnsureRequirement` in its `reorder` method currently assumes that the
same key appears only once in the join condition. This of course might
not be the case, and when it is not satisfied, it returns a wrong plan
which produces a wrong result of the query.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21529 from mgaido91/SPARK-24495.
(commit: fdadc4be08dcf1a06383bbb05e53540da2092c63)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
Commit d3eed8fd6d65d95306abfb513a9e0fde05b703ac by vanzin
[SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf
when creating pysp…
…ark shell
## What changes were proposed in this pull request?
This PR catches TypeError when testing existence of HiveConf when
creating pyspark shell
## How was this patch tested?
Manually tested. Here are the manual test cases:
Build with hive:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 14:55:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome
to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
     /_/
Using Python version 3.6.5 (default, Apr  6 2018 13:44:09) SparkSession
available as 'spark'.
>>> spark.conf.get('spark.sql.catalogImplementation')
'hive'
```
Build without hive:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 15:04:52 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome
to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
     /_/
Using Python version 3.6.5 (default, Apr  6 2018 13:44:09) SparkSession
available as 'spark'.
>>> spark.conf.get('spark.sql.catalogImplementation')
'in-memory'
```
Failed to start shell:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 15:07:53 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/icexelloss/workspace/spark/python/pyspark/shell.py:45:
UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.") Traceback (most
recent call last):
File "/Users/icexelloss/workspace/spark/python/pyspark/shell.py", line
41, in <module>
   spark = SparkSession._create_shell_session()
File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py",
line 581, in _create_shell_session
   return SparkSession.builder.getOrCreate()
File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py",
line 168, in getOrCreate
   raise py4j.protocol.Py4JError("Fake Py4JError")
py4j.protocol.Py4JError: Fake Py4JError
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$
```
Author: Li Jin <ice.xelloss@gmail.com>
Closes #21569 from
icexelloss/SPARK-24563-fix-pyspark-shell-without-hive.
(commit: d3eed8fd6d65d95306abfb513a9e0fde05b703ac)
The file was modifiedpython/pyspark/sql/session.py (diff)
Commit b8f27ae3b34134a01998b77db4b7935e7f82a4fe by wenchen
[SPARK-24543][SQL] Support any type as DDL string for from_json's schema
## What changes were proposed in this pull request?
In the PR, I propose to support any DataType represented as DDL string
for the from_json function. After the changes, it will be possible to
specify `MapType` in SQL like:
```sql select from_json('{"a":1, "b":2}', 'map<string, int>')
``` and in Scala (similar in other languages)
```scala val in = Seq("""{"a": {"b": 1}}""").toDS() val schema =
"map<string, map<string, int>>" val out = in.select(from_json($"value",
schema, Map.empty[String, String]))
```
## How was this patch tested?
Added a couple sql tests and modified existing tests for Python and
Scala. The former tests were modified because it is not imported for
them in which format schema for `from_json` is provided.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes #21550 from MaxGekk/from_json-ddl-schema.
(commit: b8f27ae3b34134a01998b77db4b7935e7f82a4fe)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/json-functions.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/json-functions.sql (diff)
Commit 18cb0c07988578156c869682d8a2c4151e8d35e5 by vanzin
[SPARK-24319][SPARK SUBMIT] Fix spark-submit execution where no main
class is required.
## What changes were proposed in this pull request?
With [PR 20925](https://github.com/apache/spark/pull/20925) now it's not
possible to execute the following commands:
* run-example
* run-example --help
* run-example --version
* run-example --usage-error
* run-example --status ...
* run-example --kill ...
In this PR the execution will be allowed for the mentioned commands.
## How was this patch tested?
Existing unit tests extended + additional written.
Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Closes #21450 from gaborgsomogyi/SPARK-24319.
(commit: 18cb0c07988578156c869682d8a2c4151e8d35e5)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/Main.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (diff)
The file was modifiedlauncher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java (diff)
Commit 270a9a3cac25f3e799460320d0fc94ccd7ecfaea by mcheah
[SPARK-24248][K8S] Use level triggering and state reconciliation in
scheduling and lifecycle
## What changes were proposed in this pull request?
Previously, the scheduler backend was maintaining state in many places,
not only for reading state but also writing to it. For example, state
had to be managed in both the watch and in the executor allocator
runnable. Furthermore, one had to keep track of multiple hash tables.
We can do better here by:
1. Consolidating the places where we manage state. Here, we take
inspiration from traditional Kubernetes controllers. These controllers
tend to follow a level-triggered mechanism. This means that the
controller will continuously monitor the API server via watches and
polling, and on periodic passes, the controller will reconcile the
current state of the cluster with the desired state. We implement this
by introducing the concept of a pod snapshot, which is a given state of
the executors in the Kubernetes cluster. We operate periodically on
snapshots. To prevent overloading the API server with polling requests
to get the state of the cluster (particularly for executor allocation
where we want to be checking frequently to get executors to launch
without unbearably bad latency), we use watches to populate snapshots by
applying observed events to a previous snapshot to get a new snapshot.
Whenever we do poll the cluster, the polled state replaces any existing
snapshot - this ensures eventual consistency and mirroring of the
cluster, as is desired in a level triggered architecture.
2. Storing less specialized in-memory state in general. Previously we
were creating hash tables to represent the state of executors. Instead,
it's easier to represent state solely by the snapshots.
## How was this patch tested?
Integration tests should test there's no regressions end to end. Unit
tests to be updated, in particular focusing on different orderings of
events, particularly accounting for when events come in unexpected
ordering.
Author: mcheah <mcheah@palantir.com>
Closes #21366 from mccheah/event-queue-driven-scheduling.
(commit: 270a9a3cac25f3e799460320d0fc94ccd7ecfaea)
The file was modifiedpom.xml (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/DeterministicExecutorPodsSnapshotsStore.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/Fabric8Aliases.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManagerSuite.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorLifecycleTestUtils.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSourceSuite.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodStates.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStore.scala
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackendSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
The file was modifiedLICENSE (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/ThreadUtils.scala (diff)
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreSuite.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSourceSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala (diff)
The file was addedlicenses/LICENSE-jmock.txt
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala
Commit 22daeba59b3ffaccafc9ff4b521abc265d0e58dd by wenchen
[SPARK-24478][SQL] Move projection and filter push down to physical
conversion
## What changes were proposed in this pull request?
This removes the v2 optimizer rule for push-down and instead pushes
filters and required columns when converting to a physical plan, as
suggested by marmbrus. This makes the v2 relation cleaner because the
output and filters do not change in the logical plan.
A side-effect of this change is that the stats from the logical
(optimized) plan no longer reflect pushed filters and projection. This
is a temporary state, until the planner gathers stats from the physical
plan instead. An alternative to this approach is
https://github.com/rdblue/spark/commit/9d3a11e68bca6c5a56a2be47fb09395350362ac5.
The first commit was proposed in #21262. This PR replaces #21262.
## How was this patch tested?
Existing tests.
Author: Ryan Blue <blue@apache.org>
Closes #21503 from
rdblue/SPARK-24478-move-push-down-to-physical-conversion.
(commit: 22daeba59b3ffaccafc9ff4b521abc265d0e58dd)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportStatistics.java (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala (diff)
Commit 6567fc43aca75b41900cde976594e21c8b0ca98a by hyukjinkwon
[PYTHON] Fix typo in serializer exception
## What changes were proposed in this pull request?
Fix typo in exception raised in Python serializer
## How was this patch tested?
No code changes
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
Closes #21566 from rberenguel/fix_typo_pyspark_serializers.
(commit: 6567fc43aca75b41900cde976594e21c8b0ca98a)
The file was modifiedpython/pyspark/serializers.py (diff)
Commit 495d8cf09ae7134aa6d2feb058612980e02955fa by vanzin
[SPARK-24490][WEBUI] Use WebUI.addStaticHandler in web UIs
`WebUI` defines `addStaticHandler` that web UIs don't use (and simply
introduce duplication). Let's clean them up and remove duplications.
Local build and waiting for Jenkins
Author: Jacek Laskowski <jacek@japila.pl>
Closes #21510 from
jaceklaskowski/SPARK-24490-Use-WebUI.addStaticHandler.
(commit: 495d8cf09ae7134aa6d2feb058612980e02955fa)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/ui/StreamingTab.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerWebUI.scala (diff)
The file was modifiedresource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/ui/MesosClusterUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/WebUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/SparkUI.scala (diff)
Commit b5ccf0d3957a444db93893c0ce4417bfbbb11822 by tathagata.das1565
[SPARK-24396][SS][PYSPARK] Add Structured Streaming ForeachWriter for
python
## What changes were proposed in this pull request?
This PR adds `foreach` for streaming queries in Python. Users will be
able to specify their processing logic in two different ways.
- As a function that takes a row as input.
- As an object that has methods `open`, `process`, and `close` methods.
See the python docs in this PR for more details.
## How was this patch tested? Added java and python unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #21477 from tdas/SPARK-24396.
(commit: b5ccf0d3957a444db93893c0ce4417bfbbb11822)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonForeachWriterSuite.scala
The file was modifiedpython/pyspark/tests.py (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ForeachWriterProvider.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonForeachWriter.scala
Commit 90da7dc241f8eec2348c0434312c97c116330bc4 by wenchen
[SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple
## What changes were proposed in this pull request?
This PR fixes possible overflow in int add or multiply. In particular,
their overflows in multiply are detected by
[Spotbugs](https://spotbugs.github.io/)
The following assignments may cause overflow in right hand side. As a
result, the result may be negative.
``` long = int * int long = int + int
```
To avoid this problem, this PR performs cast from int to long in right
hand side.
## How was this patch tested?
Existing UTs.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #21481 from kiszk/SPARK-24452.
(commit: 90da7dc241f8eec2348c0434312c97c116330bc4)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala (diff)
Commit e4fee395ecd93ad4579d9afbf0861f82a303e563 by brkyvz
[SPARK-24525][SS] Provide an option to limit number of rows in a
MemorySink
## What changes were proposed in this pull request?
Provide an option to limit number of rows in a MemorySink. Currently,
MemorySink and MemorySinkV2 have unbounded size, meaning that if they're
used on big data, they can OOM the stream. This change adds a
maxMemorySinkRows option to limit how many rows MemorySink and
MemorySinkV2 can hold. By default, they are still unbounded.
## How was this patch tested?
Added new unit tests.
Author: Mukul Murthy <mukul.murthy@databricks.com>
Closes #21559 from mukulmurthy/SPARK-24525.
(commit: e4fee395ecd93ad4579d9afbf0861f82a303e563)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MemorySinkV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MemorySinkSuite.scala (diff)