SuccessChanges

Summary

  1. [SPARK-24235][SS] Implement continuous shuffle writer for single reader (commit: 1b46f41c55f5cd29956e17d7da95a95580cf273f) (details)
  2. [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1 (commit: 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117) (details)
  3. [MINOR][CORE][TEST] Remove unnecessary sort in UnsafeInMemorySorterSuite (commit: 534065efeb51ff0d308fa6cc9dea0715f8ce25ad) (details)
  4. [SPARK-24495][SQL] EnsureRequirement returns wrong plan when reordering (commit: fdadc4be08dcf1a06383bbb05e53540da2092c63) (details)
  5. [SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf (commit: d3eed8fd6d65d95306abfb513a9e0fde05b703ac) (details)
  6. [SPARK-24543][SQL] Support any type as DDL string for from_json's schema (commit: b8f27ae3b34134a01998b77db4b7935e7f82a4fe) (details)
  7. [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution where no main (commit: 18cb0c07988578156c869682d8a2c4151e8d35e5) (details)
  8. [SPARK-24248][K8S] Use level triggering and state reconciliation in (commit: 270a9a3cac25f3e799460320d0fc94ccd7ecfaea) (details)
  9. [SPARK-24478][SQL] Move projection and filter push down to physical (commit: 22daeba59b3ffaccafc9ff4b521abc265d0e58dd) (details)
  10. [PYTHON] Fix typo in serializer exception (commit: 6567fc43aca75b41900cde976594e21c8b0ca98a) (details)
  11. Fix indentation for class parameters (commit: 7433244b0e9e0f3f269e8d8dc135862d78c1105d) (details)
Commit 1b46f41c55f5cd29956e17d7da95a95580cf273f by zsxwing
[SPARK-24235][SS] Implement continuous shuffle writer for single reader
partition.
## What changes were proposed in this pull request?
https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit
Implement continuous shuffle write RDD for a single reader partition. (I
don't believe any implementation changes are actually required for
multiple reader partitions, but this PR is already very large, so I want
to exclude those for now to keep the size down.)
## How was this patch tested?
new unit tests
Author: Jose Torres <torres.joseph.f+github@gmail.com>
Closes #21428 from jose-torres/writerTask.
(commit: 1b46f41c55f5cd29956e17d7da95a95580cf273f)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriter.scala
The file was removedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleReadSuite.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleReader.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/UnsafeRowReceiver.scala
Commit 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117 by gatorsmile
[SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
## What changes were proposed in this pull request?
The PR updates the 2.3 version tested to the new release 2.3.1.
## How was this patch tested?
existing UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21543 from mgaido91/patch-1.
(commit: 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit 534065efeb51ff0d308fa6cc9dea0715f8ce25ad by hyukjinkwon
[MINOR][CORE][TEST] Remove unnecessary sort in UnsafeInMemorySorterSuite
## What changes were proposed in this pull request?
We don't require specific ordering of the input data, the sort action is
not necessary and misleading.
## How was this patch tested?
Existing test suite.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes #21536 from jiangxb1987/sorterSuite.
(commit: 534065efeb51ff0d308fa6cc9dea0715f8ce25ad)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java (diff)
Commit fdadc4be08dcf1a06383bbb05e53540da2092c63 by gatorsmile
[SPARK-24495][SQL] EnsureRequirement returns wrong plan when reordering
equal keys
## What changes were proposed in this pull request?
`EnsureRequirement` in its `reorder` method currently assumes that the
same key appears only once in the join condition. This of course might
not be the case, and when it is not satisfied, it returns a wrong plan
which produces a wrong result of the query.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21529 from mgaido91/SPARK-24495.
(commit: fdadc4be08dcf1a06383bbb05e53540da2092c63)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
Commit d3eed8fd6d65d95306abfb513a9e0fde05b703ac by vanzin
[SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf
when creating pysp…
…ark shell
## What changes were proposed in this pull request?
This PR catches TypeError when testing existence of HiveConf when
creating pyspark shell
## How was this patch tested?
Manually tested. Here are the manual test cases:
Build with hive:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 14:55:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome
to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
     /_/
Using Python version 3.6.5 (default, Apr  6 2018 13:44:09) SparkSession
available as 'spark'.
>>> spark.conf.get('spark.sql.catalogImplementation')
'hive'
```
Build without hive:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 15:04:52 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome
to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
     /_/
Using Python version 3.6.5 (default, Apr  6 2018 13:44:09) SparkSession
available as 'spark'.
>>> spark.conf.get('spark.sql.catalogImplementation')
'in-memory'
```
Failed to start shell:
```
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5
| packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type
"help", "copyright", "credits" or "license" for more information.
18/06/14 15:07:53 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/icexelloss/workspace/spark/python/pyspark/shell.py:45:
UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.") Traceback (most
recent call last):
File "/Users/icexelloss/workspace/spark/python/pyspark/shell.py", line
41, in <module>
   spark = SparkSession._create_shell_session()
File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py",
line 581, in _create_shell_session
   return SparkSession.builder.getOrCreate()
File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py",
line 168, in getOrCreate
   raise py4j.protocol.Py4JError("Fake Py4JError")
py4j.protocol.Py4JError: Fake Py4JError
(pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$
```
Author: Li Jin <ice.xelloss@gmail.com>
Closes #21569 from
icexelloss/SPARK-24563-fix-pyspark-shell-without-hive.
(commit: d3eed8fd6d65d95306abfb513a9e0fde05b703ac)
The file was modifiedpython/pyspark/sql/session.py (diff)
Commit b8f27ae3b34134a01998b77db4b7935e7f82a4fe by wenchen
[SPARK-24543][SQL] Support any type as DDL string for from_json's schema
## What changes were proposed in this pull request?
In the PR, I propose to support any DataType represented as DDL string
for the from_json function. After the changes, it will be possible to
specify `MapType` in SQL like:
```sql select from_json('{"a":1, "b":2}', 'map<string, int>')
``` and in Scala (similar in other languages)
```scala val in = Seq("""{"a": {"b": 1}}""").toDS() val schema =
"map<string, map<string, int>>" val out = in.select(from_json($"value",
schema, Map.empty[String, String]))
```
## How was this patch tested?
Added a couple sql tests and modified existing tests for Python and
Scala. The former tests were modified because it is not imported for
them in which format schema for `from_json` is provided.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes #21550 from MaxGekk/from_json-ddl-schema.
(commit: b8f27ae3b34134a01998b77db4b7935e7f82a4fe)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/json-functions.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/json-functions.sql (diff)
Commit 18cb0c07988578156c869682d8a2c4151e8d35e5 by vanzin
[SPARK-24319][SPARK SUBMIT] Fix spark-submit execution where no main
class is required.
## What changes were proposed in this pull request?
With [PR 20925](https://github.com/apache/spark/pull/20925) now it's not
possible to execute the following commands:
* run-example
* run-example --help
* run-example --version
* run-example --usage-error
* run-example --status ...
* run-example --kill ...
In this PR the execution will be allowed for the mentioned commands.
## How was this patch tested?
Existing unit tests extended + additional written.
Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Closes #21450 from gaborgsomogyi/SPARK-24319.
(commit: 18cb0c07988578156c869682d8a2c4151e8d35e5)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/Main.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (diff)
The file was modifiedlauncher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java (diff)
Commit 270a9a3cac25f3e799460320d0fc94ccd7ecfaea by mcheah
[SPARK-24248][K8S] Use level triggering and state reconciliation in
scheduling and lifecycle
## What changes were proposed in this pull request?
Previously, the scheduler backend was maintaining state in many places,
not only for reading state but also writing to it. For example, state
had to be managed in both the watch and in the executor allocator
runnable. Furthermore, one had to keep track of multiple hash tables.
We can do better here by:
1. Consolidating the places where we manage state. Here, we take
inspiration from traditional Kubernetes controllers. These controllers
tend to follow a level-triggered mechanism. This means that the
controller will continuously monitor the API server via watches and
polling, and on periodic passes, the controller will reconcile the
current state of the cluster with the desired state. We implement this
by introducing the concept of a pod snapshot, which is a given state of
the executors in the Kubernetes cluster. We operate periodically on
snapshots. To prevent overloading the API server with polling requests
to get the state of the cluster (particularly for executor allocation
where we want to be checking frequently to get executors to launch
without unbearably bad latency), we use watches to populate snapshots by
applying observed events to a previous snapshot to get a new snapshot.
Whenever we do poll the cluster, the polled state replaces any existing
snapshot - this ensures eventual consistency and mirroring of the
cluster, as is desired in a level triggered architecture.
2. Storing less specialized in-memory state in general. Previously we
were creating hash tables to represent the state of executors. Instead,
it's easier to represent state solely by the snapshots.
## How was this patch tested?
Integration tests should test there's no regressions end to end. Unit
tests to be updated, in particular focusing on different orderings of
events, particularly accounting for when events come in unexpected
ordering.
Author: mcheah <mcheah@palantir.com>
Closes #21366 from mccheah/event-queue-driven-scheduling.
(commit: 270a9a3cac25f3e799460320d0fc94ccd7ecfaea)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStore.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/DeterministicExecutorPodsSnapshotsStore.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreSuite.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodStates.scala
The file was modifiedLICENSE (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala (diff)
The file was addedlicenses/LICENSE-jmock.txt
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManagerSuite.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSourceSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotSuite.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSourceSuite.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala
The file was modifiedcore/src/main/scala/org/apache/spark/util/ThreadUtils.scala (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/Fabric8Aliases.scala
The file was modifiedpom.xml (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorLifecycleTestUtils.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackendSuite.scala (diff)
Commit 22daeba59b3ffaccafc9ff4b521abc265d0e58dd by wenchen
[SPARK-24478][SQL] Move projection and filter push down to physical
conversion
## What changes were proposed in this pull request?
This removes the v2 optimizer rule for push-down and instead pushes
filters and required columns when converting to a physical plan, as
suggested by marmbrus. This makes the v2 relation cleaner because the
output and filters do not change in the logical plan.
A side-effect of this change is that the stats from the logical
(optimized) plan no longer reflect pushed filters and projection. This
is a temporary state, until the planner gathers stats from the physical
plan instead. An alternative to this approach is
https://github.com/rdblue/spark/commit/9d3a11e68bca6c5a56a2be47fb09395350362ac5.
The first commit was proposed in #21262. This PR replaces #21262.
## How was this patch tested?
Existing tests.
Author: Ryan Blue <blue@apache.org>
Closes #21503 from
rdblue/SPARK-24478-move-push-down-to-physical-conversion.
(commit: 22daeba59b3ffaccafc9ff4b521abc265d0e58dd)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportStatistics.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala (diff)
Commit 6567fc43aca75b41900cde976594e21c8b0ca98a by hyukjinkwon
[PYTHON] Fix typo in serializer exception
## What changes were proposed in this pull request?
Fix typo in exception raised in Python serializer
## How was this patch tested?
No code changes
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
Closes #21566 from rberenguel/fix_typo_pyspark_serializers.
(commit: 6567fc43aca75b41900cde976594e21c8b0ca98a)
The file was modifiedpython/pyspark/serializers.py (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeSpec.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStep.scala (diff)