SuccessChanges

Summary

  1. [SPARK-24290][ML] add support for Array input for (commit: b24d3dba6571fd3c9e2649aceeaadc3f9c6cc90f) (details)
  2. [SPARK-24300][ML] change the way to set seed in (commit: ff0501b0c27dc8149bd5fb38a19d9b0056698766) (details)
  3. [SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark (commit: dbb4d83829ec4b51d6e6d3a96f7a4e611d8827bc) (details)
  4. [SPARK-16451][REPL] Fail shell if SparkSession fails to start. (commit: b3417b731d4e323398a0d7ec6e86405f4464f4f9) (details)
  5. [SPARK-15784] Add Power Iteration Clustering to spark.ml (commit: e8c1a0c2fdb09a628d9cc925676af870d5a7a946) (details)
  6. [SPARK-24453][SS] Fix error recovering from the failure in a no-data (commit: 2c2a86b5d5be6f77ee72d16f990b39ae59f479b9) (details)
  7. [SPARK-22384][SQL] Refine partition pruning when attribute is wrapped in (commit: 93df3cd03503fca7745141fbd2676b8bf70fe92f) (details)
  8. [SPARK-24187][R][SQL] Add array_join function to SparkR (commit: e9efb62e0795c8d5233b7e5bfc276d74953942b8) (details)
  9. [SPARK-23803][SQL] Support bucket pruning (commit: e76b0124fbe463def00b1dffcfd8fd47e04772fe) (details)
  10. Process cluster snapshots instead of deltas. (commit: 8615c067328c0c64d0d048922b221477580acdb4) (details)
  11. Remove hanging comment (commit: 3b85ab52b523dc182227b1ece517765903e64109) (details)
  12. Remove incorrect comment (commit: edc982bb68d8da697a46def612fa270aa702001f) (details)
  13. Fix log message (commit: e077c7e5d96016d39d3133431525c60054ea2374) (details)
  14. Whitespace (commit: a97fc5d5b87a8caf14ae923f89a7bc106c48d411) (details)
Commit b24d3dba6571fd3c9e2649aceeaadc3f9c6cc90f by meng
[SPARK-24290][ML] add support for Array input for
instrumentation.logNamedValue
## What changes were proposed in this pull request?
Extend instrumentation.logNamedValue to support Array input change the
logging for "clusterSizes" to new method
## How was this patch tested?
N/A
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Lu WANG <lu.wang@databricks.com>
Closes #21347 from ludatabricks/SPARK-24290.
(commit: b24d3dba6571fd3c9e2649aceeaadc3f9c6cc90f)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala (diff)
Commit ff0501b0c27dc8149bd5fb38a19d9b0056698766 by meng
[SPARK-24300][ML] change the way to set seed in
ml.cluster.LDASuite.generateLDAData
## What changes were proposed in this pull request?
Using different RNG in all different partitions.
## How was this patch tested?
manually
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Lu WANG <lu.wang@databricks.com>
Closes #21492 from ludatabricks/SPARK-24300.
(commit: ff0501b0c27dc8149bd5fb38a19d9b0056698766)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala (diff)
Commit dbb4d83829ec4b51d6e6d3a96f7a4e611d8827bc by hyukjinkwon
[SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark
## What changes were proposed in this pull request?
Implement `_repr_html_` for PySpark while in notebook and add config
named "spark.sql.repl.eagerEval.enabled" to control this.
The dev list thread for context:
http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html
## How was this patch tested?
New ut in DataFrameSuite and manual test in jupyter. Some screenshot
below.
**After:**
![image](https://user-images.githubusercontent.com/4833765/40268422-8db5bef0-5b9f-11e8-80f1-04bc654a4f2c.png)
**Before:**
![image](https://user-images.githubusercontent.com/4833765/40268431-9f92c1b8-5b9f-11e8-9db9-0611f0940b26.png)
Author: Yuanjian Li <xyliyuanjian@gmail.com>
Closes #21370 from xuanyuanking/SPARK-24215.
(commit: dbb4d83829ec4b51d6e6d3a96f7a4e611d8827bc)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifieddocs/configuration.md (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit b3417b731d4e323398a0d7ec6e86405f4464f4f9 by hyukjinkwon
[SPARK-16451][REPL] Fail shell if SparkSession fails to start.
Currently, in spark-shell, if the session fails to start, the user sees
a bunch of unrelated errors which are caused by code in the shell
initialization that references the "spark" variable, which does not
exist in that case. Things like:
```
<console>:14: error: not found: value spark
      import spark.sql
```
The user is also left with a non-working shell (unless they want to just
write non-Spark Scala or Python code, that is).
This change fails the whole shell session at the point where the failure
occurs, so that the last error message is the one with the actual
information about the failure.
For the python error handling, I moved the session initialization code
to session.py, so that traceback.print_exc() only shows the last error.
Otherwise, the printed exception would contain all previous exceptions
with a message "During handling of the above exception, another
exception occurred", making the actual error kinda hard to parse.
Tested with spark-shell, pyspark (with 2.7 and 3.5), by forcing an error
during SparkContext initialization.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #21368 from vanzin/SPARK-16451.
(commit: b3417b731d4e323398a0d7ec6e86405f4464f4f9)
The file was modifiedpython/pyspark/sql/session.py (diff)
The file was modifiedrepl/src/main/scala/org/apache/spark/repl/Main.scala (diff)
The file was modifiedpython/pyspark/shell.py (diff)
Commit e8c1a0c2fdb09a628d9cc925676af870d5a7a946 by meng
[SPARK-15784] Add Power Iteration Clustering to spark.ml
## What changes were proposed in this pull request?
According to the discussion on JIRA. I rewrite the Power Iteration
Clustering API in `spark.ml`.
## How was this patch tested?
Unit test.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: WeichenXu <weichen.xu@databricks.com>
Closes #21493 from WeichenXu123/pic_api.
(commit: e8c1a0c2fdb09a628d9cc925676af870d5a7a946)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala (diff)
Commit 2c2a86b5d5be6f77ee72d16f990b39ae59f479b9 by tathagata.das1565
[SPARK-24453][SS] Fix error recovering from the failure in a no-data
batch
## What changes were proposed in this pull request?
The error occurs when we are recovering from a failure in a no-data
batch (say X) that has been planned (i.e. written to offset log) but not
executed (i.e. not written to commit log). Upon recovery the following
sequence of events happen.
1. `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to
X. Since there was no data in the batch, the `availableOffsets` is same
as `committedOffsets`, so `isNewDataAvailable` is `false`. 2. When
`MicroBatchExecution.constructNextBatch` is called, ideally it should
immediately return true because the next batch has already been
constructed. However, the check of whether the batch has been
constructed was `if (isNewDataAvailable) return true`. Since the planned
batch is a no-data batch, it escaped this check and proceeded to plan
the same batch X *once again*.
The solution is to have an explicit flag that signifies whether a batch
has already been constructed or not. `populateStartOffsets` is going to
set the flag appropriately.
## How was this patch tested?
new unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #21491 from tdas/SPARK-24453.
(commit: 2c2a86b5d5be6f77ee72d16f990b39ae59f479b9)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecutionSuite.scala
Commit 93df3cd03503fca7745141fbd2676b8bf70fe92f by wenchen
[SPARK-22384][SQL] Refine partition pruning when attribute is wrapped in
Cast
## What changes were proposed in this pull request?
Sql below will get all partitions from metastore, which put much burden
on metastore;
``` CREATE TABLE `partition_test`(`col` int) PARTITIONED BY (`pt` byte)
SELECT * FROM partition_test WHERE CAST(pt AS INT)=1
``` The reason is that the the analyzed attribute `dt` is wrapped in
`Cast` and `HiveShim` fails to generate a proper partition filter. This
pr proposes to take `Cast` into consideration when generate partition
filter.
## How was this patch tested? Test added. This pr proposes to use
analyzed expressions in `HiveClientSuite`
Author: jinxing <jinxing6042@126.com>
Closes #19602 from jinxing64/SPARK-22384.
(commit: 93df3cd03503fca7745141fbd2676b8bf70fe92f)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala (diff)
Commit e9efb62e0795c8d5233b7e5bfc276d74953942b8 by hyukjinkwon
[SPARK-24187][R][SQL] Add array_join function to SparkR
## What changes were proposed in this pull request?
This PR adds array_join function to SparkR
## How was this patch tested?
Add unit test in test_sparkSQL.R
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes #21313 from huaxingao/spark-24187.
(commit: e9efb62e0795c8d5233b7e5bfc276d74953942b8)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedR/pkg/R/generics.R (diff)
The file was modifiedR/pkg/NAMESPACE (diff)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL.R (diff)
Commit e76b0124fbe463def00b1dffcfd8fd47e04772fe by wenchen
[SPARK-23803][SQL] Support bucket pruning
## What changes were proposed in this pull request? support bucket
pruning when filtering on a single bucketed column on the following
predicates - EqualTo, EqualNullSafe, In, And/Or predicates
## How was this patch tested? refactored unit tests to test the above.
based on gatorsmile work in
https://github.com/apache/spark/commit/e3c75c6398b1241500343ff237e9bcf78b5396f9
Author: Asher Saban <asaban@palantir.com> Author: asaban
<asaban@palantir.com>
Closes #20915 from sabanas/filter-prune-buckets.
(commit: e76b0124fbe463def00b1dffcfd8fd47e04772fe)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala (diff)
The file was removedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsEventQueueImpl.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingEventSourceSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackendSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala (diff)
The file was removedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/DeterministicExecutorPodsEventQueue.scala
The file was removedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsEventQueue.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStore.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/DeterministicExecutorPodsSnapshotsStore.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchEventSource.scala (diff)
The file was removedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodBatchSubscriber.scala
The file was removedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleEventHandler.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala
The file was removedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodBatchSubscriberSuite.scala
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorLifecycleTestUtils.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreSuite.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
The file was removedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleEventHandlerSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala (diff)
The file was addedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManagerSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodStates.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchEventSourceSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingEventSource.scala (diff)
The file was removedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsEventQueueSuite.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)