Changes

Summary

  1. [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 (commit: 4ad9bfd) (details)
  2. [SPARK-32241][SQL] Remove empty children of union (commit: 24be816) (details)
  3. [SPARK-32258][SQL] Not duplicate normalization on children for (commit: cc9371d) (details)
  4. [SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql (commit: d6a68e0) (details)
  5. [SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude (commit: a47b69a) (details)
  6. [SPARK-32309][PYSPARK] Import missing sys import (commit: 2a0faca) (details)
  7. [SPARK-32305][BUILD] Make `mvn clean` remove `metastore_db` and (commit: 5e0cb3e) (details)
  8. [SPARK-32311][PYSPARK][TESTS] Remove duplicate import (commit: c602d79) (details)
  9. [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI (commit: 90b0c26) (details)
  10. [SPARK-32303][PYTHON][TESTS] Remove leftover from editable mode (commit: 902e134) (details)
  11. [SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with (commit: 676d92e) (details)
  12. [MINOR][R] Match collectAsArrowToR with non-streaming (commit: 03b5707) (details)
  13. [SPARK-32316][TESTS][INFRA] Test PySpark with Python 3.8 in Github (commit: 6bdd710) (details)
  14. [SPARK-32276][SQL] Remove redundant sorts before repartition nodes (commit: af8e65f) (details)
  15. [SPARK-31985][SS] Remove incomplete/undocumented stateful aggregation in (commit: 542aefb) (details)
  16. Revert "[SPARK-32276][SQL] Remove redundant sorts before repartition (commit: 2527fbc) (details)
  17. [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's (commit: e449993) (details)
  18. [SPARK-32318][SQL][TESTS] Add a test case to EliminateSortsSuite for (commit: 8950dcb) (details)
  19. [SPARK-32036] Replace references to blacklist/whitelist language with (commit: cf22d94) (details)
  20. [SPARK-32140][ML][PYSPARK] Add training summary to FMClassificationModel (commit: b05f309) (details)
  21. [SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) (commit: c28a6fa) (details)
  22. [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest (commit: db47c6e) (details)
  23. [SPARK-32272][SQL] Add  SQL standard command SET TIME ZONE (commit: bdeb626) (details)
  24. [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc (commit: 6be8b93) (details)
  25. [SPARK-30648][SQL] Support filters pushdown in JSON datasource (commit: c1f160e) (details)
  26. [SPARK-32315][ML] Provide an explanation error message when calling (commit: d5c672a) (details)
  27. [SPARK-32310][ML][PYSPARK] ML params default value parity in (commit: 383f5e9) (details)
  28. [SPARK-32335][K8S][TESTS] Remove Python2 test from K8s IT (commit: fb51925) (details)
Commit 4ad9bfd53b84a6d2497668c73af6899bae14c187 by gurwls223
[SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?
This PR aims to drop Python 2.7, 3.4 and 3.5.
Roughly speaking, it removes all the widely known Python 2 compatibility
workarounds such as `sys.version` comparison, `__future__`. Also, it
removes the Python 2 dedicated codes such as `ArrayConstructor` in
Spark.
### Why are the changes needed?
1. Unsupport EOL Python versions
2. Reduce maintenance overhead and remove a bit of legacy codes and
hacks for Python 2.
3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given
my testing and investigation.
4. Users can use Python type hints with Pandas UDFs without thinking
about Python version
5. Users can leverage one latest cloudpickle,
https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also
leverage C pickle.
### Does this PR introduce _any_ user-facing change?
Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark
version.
### How was this patch tested?
Manually tested and also tested in Jenkins.
Closes #28957 from HyukjinKwon/SPARK-32138.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 4ad9bfd)
The file was modifiedexamples/src/main/python/ml/sql_transformer.py (diff)
The file was modifiedexamples/src/main/python/ml/rformula_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/kernel_density_estimation_example.py (diff)
The file was modifiedpython/pyspark/mllib/linalg/distributed.py (diff)
The file was modifiedexamples/src/main/python/ml/fvalue_selector_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/naive_bayes_example.py (diff)
The file was modifiedpython/pyspark/sql/pandas/functions.py (diff)
The file was modifiedexamples/src/main/python/pi.py (diff)
The file was modifiedpython/pyspark/sql/context.py (diff)
The file was modifiedexamples/src/main/python/ml/logistic_regression_summary_example.py (diff)
The file was modifiedpython/pyspark/ml/feature.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_arrow.py (diff)
The file was modifiedexamples/src/main/python/mllib/summary_statistics_example.py (diff)
The file was modifiedpython/pyspark/mllib/tests/test_linalg.py (diff)
The file was modifiedexamples/src/main/python/ml/naive_bayes_example.py (diff)
The file was modifiedexamples/src/main/python/ml/stopwords_remover_example.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedexamples/src/main/python/mllib/word2vec_example.py (diff)
The file was modifiedpython/pyspark/mllib/linalg/__init__.py (diff)
The file was modifiedexamples/src/main/python/ml/decision_tree_classification_example.py (diff)
The file was modifiedpython/pyspark/ml/image.py (diff)
The file was modifiedexamples/src/main/python/ml/lda_example.py (diff)
The file was modifiedexamples/src/main/python/ml/feature_hasher_example.py (diff)
The file was modifiedexamples/src/main/python/ml/bucketed_random_projection_lsh_example.py (diff)
The file was modifiedexamples/src/main/python/ml/linearsvc.py (diff)
The file was modifiedpython/pyspark/ml/linalg/__init__.py (diff)
The file was modifiedpython/pyspark/ml/pipeline.py (diff)
The file was modifiedpython/run-tests.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_context.py (diff)
The file was modifiedexamples/src/main/python/ml/bucketizer_example.py (diff)
The file was modifiedpython/pyspark/rdd.py (diff)
The file was modifiedpython/pyspark/tests/test_taskcontext.py (diff)
The file was modifiedpython/setup.py (diff)
The file was modifiedexamples/src/main/python/mllib/correlations.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_grouped_map.py (diff)
The file was modifiedexamples/src/main/python/mllib/decision_tree_classification_example.py (diff)
The file was modifiedexamples/src/main/python/ml/aft_survival_regression.py (diff)
The file was modifiedexamples/src/main/python/avro_inputformat.py (diff)
The file was modifiedexamples/src/main/python/ml/word2vec_example.py (diff)
The file was modifiedexamples/src/main/python/sql/streaming/structured_kafka_wordcount.py (diff)
The file was modifiedexamples/src/main/python/ml/random_forest_classifier_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/bisecting_k_means_example.py (diff)
The file was modifiedexamples/src/main/python/ml/tokenizer_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/standard_scaler_example.py (diff)
The file was modifiedpython/pyspark/ml/wrapper.py (diff)
The file was modifiedexamples/src/main/python/mllib/random_rdd_generation.py (diff)
The file was modifiedpython/pyspark/mllib/fpm.py (diff)
The file was modifiedpython/pyspark/sql/conf.py (diff)
The file was modifiedpython/pyspark/accumulators.py (diff)
The file was modifiedexamples/src/main/python/ml/interaction_example.py (diff)
The file was modifiedexamples/src/main/python/ml/n_gram_example.py (diff)
The file was modifiedexamples/src/main/python/ml/multilayer_perceptron_classification.py (diff)
The file was modifieddev/create-release/releaseutils.py (diff)
The file was modifiedpython/pyspark/find_spark_home.py (diff)
The file was modifiedexamples/src/main/python/wordcount.py (diff)
The file was modifiedexamples/src/main/python/ml/dct_example.py (diff)
The file was modifiedexamples/src/main/python/ml/random_forest_regressor_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/linear_regression_with_sgd_example.py (diff)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
The file was modifiedexamples/src/main/python/mllib/hypothesis_testing_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py (diff)
The file was modifieddocs/index.md (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedexamples/src/main/python/mllib/gaussian_mixture_model.py (diff)
The file was modifiedpython/pyspark/mllib/stat/_statistics.py (diff)
The file was modifiedexamples/src/main/python/ml/fm_regressor_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/logistic_regression.py (diff)
The file was modifiedpython/pyspark/context.py (diff)
The file was modifiedexamples/src/main/python/ml/decision_tree_regression_example.py (diff)
The file was modifiedexamples/src/main/python/streaming/recoverable_network_wordcount.py (diff)
The file was modifiedpython/pyspark/sql/pandas/serializers.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_typehints.py (diff)
The file was modifiedpython/pyspark/sql/__init__.py (diff)
The file was modifiedexamples/src/main/python/als.py (diff)
The file was modifiedexamples/src/main/python/status_api_demo.py (diff)
The file was modifiedexamples/src/main/python/pagerank.py (diff)
The file was modifieddev/merge_spark_pr.py (diff)
The file was modifiedexamples/src/main/python/mllib/stratified_sampling_example.py (diff)
The file was modifiedexamples/src/main/python/ml/estimator_transformer_param_example.py (diff)
The file was modifiedexamples/src/main/python/ml/dataframe_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/tf_idf_example.py (diff)
The file was modifiedexamples/src/main/python/parquet_inputformat.py (diff)
The file was modifiedexamples/src/main/python/ml/vector_assembler_example.py (diff)
The file was modifiedexamples/src/main/python/ml/elementwise_product_example.py (diff)
The file was modifiedexamples/src/main/python/sql/streaming/structured_network_wordcount.py (diff)
The file was modifiedexamples/src/main/python/ml/gaussian_mixture_example.py (diff)
The file was modifiedpython/pyspark/streaming/context.py (diff)
The file was modifiedexamples/src/main/python/transitive_closure.py (diff)
The file was modifiedexamples/src/main/python/ml/summarizer_example.py (diff)
The file was modifiedexamples/src/main/python/ml/als_example.py (diff)
The file was modifiedpython/pyspark/ml/tree.py (diff)
The file was modifiedpython/pyspark/conf.py (diff)
The file was modifiedexamples/src/main/python/mllib/streaming_k_means_example.py (diff)
The file was modifiedexamples/src/main/python/sql/datasource.py (diff)
The file was modifieddev/github_jira_sync.py (diff)
The file was modifiedpython/pyspark/ml/common.py (diff)
The file was modifiedpython/pyspark/shell.py (diff)
The file was modifiedpython/pyspark/mllib/stat/KernelDensity.py (diff)
The file was modifiedpython/pyspark/mllib/feature.py (diff)
The file was modifiedpython/pyspark/sql/utils.py (diff)
The file was modified.github/workflows/master.yml (diff)
The file was modifiedexamples/src/main/python/ml/normalizer_example.py (diff)
The file was modifiedpython/pyspark/serializers.py (diff)
The file was modifiedpython/pyspark/sql/types.py (diff)
The file was modifiedpython/pyspark/ml/param/__init__.py (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/tests/pyfiles.py (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/tests/worker_memory_check.py (diff)
The file was modifiedexamples/src/main/python/mllib/random_forest_classification_example.py (diff)
The file was modifiedpython/pyspark/mllib/tree.py (diff)
The file was modifiedpython/pyspark/streaming/dstream.py (diff)
The file was modifiedexamples/src/main/python/mllib/word2vec.py (diff)
The file was modifiedpython/pyspark/tests/test_shuffle.py (diff)
The file was modifiedexamples/src/main/python/streaming/network_wordcount.py (diff)
The file was modifiedexamples/src/main/python/ml/logistic_regression_with_elastic_net.py (diff)
The file was modifiedexamples/src/main/python/ml/variance_threshold_selector_example.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_param.py (diff)
The file was modifieddev/run-tests-jenkins.py (diff)
The file was modifiedexamples/src/main/python/ml/tf_idf_example.py (diff)
The file was modifieddocs/configuration.md (diff)
The file was modifiedpython/pyspark/java_gateway.py (diff)
The file was modifiedpython/pyspark/sql/pandas/conversion.py (diff)
The file was modifiedexamples/src/main/python/mllib/streaming_linear_regression_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/random_forest_regression_example.py (diff)
The file was modifiedexamples/src/main/python/kmeans.py (diff)
The file was modifiedpython/pyspark/sql/column.py (diff)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedexamples/src/main/python/mllib/isotonic_regression_example.py (diff)
The file was modifiedexamples/src/main/python/streaming/sql_network_wordcount.py (diff)
The file was modifiedpython/pyspark/mllib/__init__.py (diff)
The file was modifiedexamples/src/main/python/mllib/k_means_example.py (diff)
The file was modifiedexamples/src/main/python/sql/basic.py (diff)
The file was modifiedexamples/src/main/python/ml/cross_validator.py (diff)
The file was modifiedexamples/src/main/python/ml/linear_regression_with_elastic_net.py (diff)
The file was modifiedexamples/src/main/python/mllib/power_iteration_clustering_example.py (diff)
The file was modifiedexamples/src/main/python/streaming/hdfs_wordcount.py (diff)
The file was modifiedexamples/src/main/python/ml/generalized_linear_regression_example.py (diff)
The file was modifiedexamples/src/main/python/ml/gradient_boosted_tree_classifier_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/decision_tree_regression_example.py (diff)
The file was modifiedpython/pyspark/sql/group.py (diff)
The file was modifiedexamples/src/main/python/mllib/gaussian_mixture_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/gradient_boosting_regression_example.py (diff)
The file was modifiedpython/pyspark/sql/catalog.py (diff)
The file was modifiedexamples/src/main/python/mllib/binary_classification_metrics_example.py (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/mllib/clustering.py (diff)
The file was modifiedexamples/src/main/python/ml/chisq_selector_example.py (diff)
The file was modifiedexamples/src/main/python/ml/quantile_discretizer_example.py (diff)
The file was modifiedpython/pyspark/sql/session.py (diff)
The file was modifiedexamples/src/main/python/mllib/kmeans.py (diff)
The file was modifiedpython/pyspark/mllib/util.py (diff)
The file was modifiedexamples/src/main/python/ml/one_vs_rest_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/elementwise_product_example.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_column.py (diff)
The file was modifiedexamples/src/main/python/ml/gradient_boosted_tree_regressor_example.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_cogrouped_map.py (diff)
The file was modifiedpython/pyspark/util.py (diff)
The file was modifiedexamples/src/main/python/ml/chi_square_test_example.py (diff)
The file was modifiedexamples/src/main/python/sort.py (diff)
The file was modifiedexamples/src/main/python/ml/kmeans_example.py (diff)
The file was modifiedexamples/src/main/python/ml/onehot_encoder_example.py (diff)
The file was modifiedexamples/src/main/python/ml/count_vectorizer_example.py (diff)
The file was modifiedexamples/src/main/python/sql/arrow.py (diff)
The file was modifiedexamples/src/main/python/streaming/network_wordjoinsentiments.py (diff)
The file was modifiedexternal/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py (diff)
The file was modifiedpython/pyspark/resultiterable.py (diff)
The file was modifiedpython/pyspark/tests/test_readwrite.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_training_summary.py (diff)
The file was modifiedpython/pyspark/taskcontext.py (diff)
The file was modifiedpython/pyspark/tests/test_worker.py (diff)
The file was modifiedexamples/src/main/python/ml/index_to_string_example.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_functions.py (diff)
The file was modifiedexamples/src/main/python/ml/max_abs_scaler_example.py (diff)
The file was modifiedexamples/src/main/python/ml/robust_scaler_example.py (diff)
The file was modifiedexamples/src/main/python/ml/isotonic_regression_example.py (diff)
The file was modifiedpython/pyspark/ml/param/_shared_params_code_gen.py (diff)
The file was modifieddocs/rdd-programming-guide.md (diff)
The file was modifiedexamples/src/main/python/mllib/gradient_boosting_classification_example.py (diff)
The file was modifiedpython/pyspark/sql/avro/functions.py (diff)
The file was modifiedexamples/src/main/python/ml/correlation_example.py (diff)
The file was modifiedexamples/src/main/python/sql/hive.py (diff)
The file was modifiedexamples/src/main/python/ml/vector_indexer_example.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala (diff)
The file was modifiedexamples/src/main/python/ml/fvalue_test_example.py (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedpython/pyspark/testing/sqlutils.py (diff)
The file was modifiedexamples/src/main/python/mllib/normalizer_example.py (diff)
The file was modifiedpython/pyspark/ml/util.py (diff)
The file was modifiedexamples/src/main/python/logistic_regression.py (diff)
The file was modifiedexamples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py (diff)
The file was modifiedexamples/src/main/python/ml/pca_example.py (diff)
The file was modifiedexamples/src/main/python/ml/vector_slicer_example.py (diff)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/tests/test_profiler.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_types.py (diff)
The file was modifiedsql/hive/src/test/resources/data/scripts/dumpdata_script.py (diff)
The file was modifiedexamples/src/main/python/ml/standard_scaler_example.py (diff)
The file was modifiedexamples/src/main/python/ml/min_hash_lsh_example.py (diff)
The file was modifiedexamples/src/main/python/ml/bisecting_k_means_example.py (diff)
The file was modifiedpython/pyspark/broadcast.py (diff)
The file was modifieddev/lint-python (diff)
The file was modifiedexamples/src/main/python/ml/anova_test_example.py (diff)
The file was modifiedpython/pyspark/tests/test_rdd.py (diff)
The file was modifiedexamples/src/main/python/ml/vector_size_hint_example.py (diff)
The file was modifiedexamples/src/main/python/ml/fm_classifier_example.py (diff)
The file was modifiedpython/pyspark/mllib/common.py (diff)
The file was modifiedexamples/src/main/python/ml/polynomial_expansion_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/recommendation_example.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_scalar.py (diff)
The file was modifiedsql/hive/src/test/resources/data/scripts/cat.py (diff)
The file was modifiedexamples/src/main/python/ml/string_indexer_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/hypothesis_testing_kolmogorov_smirnov_test_example.py (diff)
The file was modifiedexamples/src/main/python/ml/min_max_scaler_example.py (diff)
The file was modifiedpython/pyspark/tests/test_util.py (diff)
The file was modifiedexamples/src/main/python/streaming/stateful_network_wordcount.py (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_feature.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_map.py (diff)
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedexamples/src/main/python/mllib/sampled_rdds.py (diff)
The file was modifieddev/sparktestsupport/toposort.py (diff)
The file was modifiedexamples/src/main/python/ml/binarizer_example.py (diff)
The file was modifiedexamples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py (diff)
The file was modifiedexamples/src/main/python/ml/anova_selector_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/correlations_example.py (diff)
The file was modifiedexamples/src/main/python/mllib/latent_dirichlet_allocation_example.py (diff)
Commit 24be81689cee76e03cd5136dfd089123bbff4595 by wenchen
[SPARK-32241][SQL] Remove empty children of union
### What changes were proposed in this pull request? This PR removes the
empty child relations of a `Union`.
E.g. the query `SELECT c FROM t UNION ALL SELECT c FROM t WHERE false`
has the following plan before this PR:
```
== Physical Plan == Union
:- *(1) Project [value#219 AS c#222]
:  +- *(1) LocalTableScan [value#219]
+- LocalTableScan <empty>, [c#224]
``` and after this PR:
```
== Physical Plan ==
*(1) Project [value#219 AS c#222]
+- *(1) LocalTableScan [value#219]
```
### Why are the changes needed? To have a simpler plan.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Added new UTs.
Closes #29053 from
peter-toth/SPARK-32241-remove-empty-children-of-union.
Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: 24be816)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
Commit cc9371d885867b2cbc5d61d90083d23ce017f7a7 by wenchen
[SPARK-32258][SQL] Not duplicate normalization on children for
float/double If/CaseWhen/Coalesce
### What changes were proposed in this pull request?
This is followup to #29061. See
https://github.com/apache/spark/pull/29061#discussion_r453458611.
Basically this moves If/CaseWhen/Coalesce case patterns after
float/double case so we don't duplicate normalization on children for
float/double If/CaseWhen/Coalesce.
### Why are the changes needed?
Simplify expression tree.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Modify unit tests.
Closes #29091 from viirya/SPARK-32258-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: cc9371d)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingPointNumbersSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala (diff)
Commit d6a68e0b67ff7de58073c176dd097070e88ac831 by dongjoon
[SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql
compiling for Scala 2.13
### What changes were proposed in this pull request?
Continuation of https://github.com/apache/spark/pull/28971 which lets
streaming, catalyst and sql compile for 2.13. Same idea.
### Why are the changes needed?
Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests. (2.13 was not tested; this is about getting it to
compile without breaking 2.12)
Closes #29078 from srowen/SPARK-29292.2.
Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: d6a68e0)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala (diff)
The file was modifiedstreaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ContinuousRecordEndpoint.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
The file was modifiedexternal/kafka-0-10/src/test/java/org/apache/spark/streaming/kafka010/JavaConsumerStrategySuite.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachWriterSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala (diff)
The file was modifiedstreaming/src/test/scala/org/apache/spark/streaming/DStreamScopeSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ComplexDataSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala (diff)
The file was modifiedexternal/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/util/BatchedWriteAheadLog.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeWriteCompatibilitySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala (diff)
The file was modifiedstreaming/src/test/scala/org/apache/spark/streaming/MasterFailureTest.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/PandasGroupUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala (diff)
The file was modifiedexternal/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/ConsumerStrategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsTestUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memory.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/AggregateProcessor.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/SparkPlanGraph.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingListenerWrapper.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/AggregatingAccumulator.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/dstream/UnionDStream.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala (diff)
The file was modifiedstreaming/src/test/scala/org/apache/spark/streaming/JavaTestUtils.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverSchedulingPolicy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ShowCreateTableSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala (diff)
Commit a47b69a88a271e423271709ee491e2de57c5581b by dongjoon
[SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude
inputEncoders
### What changes were proposed in this pull request?
Override `canonicalized` to empty the `inputEncoders` for the
canonicalized `ScalaUDF`.
### Why are the changes needed?
The following fails on `branch-3.0` currently, not on Apache Spark 3.0.0
release.
```scala spark.udf.register("key", udf((m: Map[String, String]) =>
m.keys.head.toInt)) Seq(Map("1" -> "one", "2" ->
"two")).toDF("a").createOrReplaceTempView("t") checkAnswer(sql("SELECT
key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)
[info]   org.apache.spark.sql.AnalysisException: expression 't.`a`' is
neither present in the group by, nor is it an aggregate function. Add to
group by or wrap in first() (or first_value) if you don't care which
value you get.;;
[info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8]
[info] +- SubqueryAlias t
[info]    +- Project [value#3 AS a#6]
[info]       +- LocalRelation [value#3]
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
[info]   at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
[info]   at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[info]   at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[info]   at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
...
```
We use the rule`ResolveEncodersInUDF` to resolve `inputEncoders` and the
original`ScalaUDF` instance will be updated to a new `ScalaUDF` instance
with the resolved encoders at the end. Note, during encoder resolving,
types like `map`, `array` will be resolved to new expression(e.g.
`MapObjects`, `CatalystToExternalMap`).
However, `ExpressionEncoder` can't be canonicalized. Thus, the
canonicalized `ScalaUDF`s become different even if their original
`ScalaUDF`s are the same. Finally, it fails the
`checkValidAggregateExpression` when this `ScalaUDF` is used as a group
expression.
### Does this PR introduce _any_ user-facing change?
Yes, users will not hit the exception after this fix.
### How was this patch tested?
Added tests.
Closes #29106 from Ngone51/spark-32307.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: a47b69a)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala (diff)
Commit 2a0faca830a418ffed8da0c1962defc081a26aa2 by dongjoon
[SPARK-32309][PYSPARK] Import missing sys import
# What changes were proposed in this pull request?
While seeing if we can use mypy for checking the Python types, I've
stumbled across this missing import:
https://github.com/apache/spark/blob/34fa913311bc1730015f1af111ff4a03c2bad9f6/python/pyspark/ml/feature.py#L5773-L5774
### Why are the changes needed?
The `import` is required because it's used.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual.
Closes #29108 from Fokko/SPARK-32309.
Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 2a0faca)
The file was modifiedpython/pyspark/ml/feature.py (diff)
Commit 5e0cb3ee16dde3666af3bf5f2b152c7d0dfe9d7b by dongjoon
[SPARK-32305][BUILD] Make `mvn clean` remove `metastore_db` and
`spark-warehouse`
### What changes were proposed in this pull request?
Add additional configuration to `maven-clean-plugin` to ensure cleanup
`metastore_db` and `spark-warehouse` directory when execute `mvn clean`
command.
### Why are the changes needed? Now Spark support two version of
build-in hive and there are some test generated meta data not in target
dir like `metastore_db`,  they don't clean up automatically when we run
`mvn clean` command.
So if we run `mvn clean test -pl sql/hive -am -Phadoop-2.7 -Phive
-Phive-1.2 ` , the `metastore_db` dir will created and meta data will
remains after test complete.
Then we need manual cleanup `metastore_db` directory to ensure `mvn
clean test -pl sql/hive -am -Phadoop-2.7 -Phive` command use hive2.3
profile can succeed because the residual metastore data is not
compatible.
`spark-warehouse` will also cause test failure in some data residual
scenarios because test case thinks that meta data should not exist.
This pr is used to simplify manual cleanup `metastore_db` and
`spark-warehouse` directory operation.
### How was this patch tested?
Manual execute `mvn clean test -pl sql/hive -am -Phadoop-2.7 -Phive
-Phive-1.2`, then execute `mvn clean test -pl sql/hive -am -Phadoop-2.7
-Phive`, both commands should succeed.
Closes #29103 from LuciferYang/add-clean-directory.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 5e0cb3e)
The file was modifiedpom.xml (diff)
Commit c602d79f89a133d6cbc9cc8d95cb09510cbd9c30 by dongjoon
[SPARK-32311][PYSPARK][TESTS] Remove duplicate import
### What changes were proposed in this pull request?
`datetime` is already imported a few lines below :)
https://github.com/apache/spark/blob/ce27cc54c1b2e533cd91e31f2414f3e0a172c328/python/pyspark/sql/tests/test_pandas_udf_scalar.py#L24
### Why are the changes needed?
This is the last instance of the duplicate import.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual.
Closes #29109 from Fokko/SPARK-32311.
Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: c602d79)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_scalar.py (diff)
Commit 90b0c26b222dcb8f207f152494604aac090eb940 by kabhwan.opensource
[SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI
faster
### What changes were proposed in this pull request? Add a new class
HybridStore to make the history server faster when loading event files.
When rebuilding the application state from event logs, HybridStore will
write data to InMemoryStore at first and use a background thread to dump
data to LevelDB once the writing to InMemoryStore is completed.
HybridStore is to make content serving faster by using more memory. It's
only safe to enable it when the cluster is not having a heavy load.
### Why are the changes needed? HybridStore can greatly reduce the event
logs loading time, especially for large log files. In general, it has 4x
- 6x UI loading speed improvement for large log files. The detailed
result is shown in comments.
### Does this PR introduce any user-facing change? This PR adds new
configs `spark.history.store.hybridStore.enabled` and
`spark.history.store.hybridStore.maxMemoryUsage`.
### How was this patch tested? A test suite for HybridStore is added. I
also manually tested it on 3.1.0 on mac os.
This is a follow-up for the work done by Hieu Huynh in 2019.
Closes #28412 from baohe-zhang/SPARK-31608.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by:
Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 90b0c26)
The file was modifieddocs/monitoring.md (diff)
The file was addedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServerMemoryManager.scala
The file was addedcore/src/main/scala/org/apache/spark/deploy/history/HybridStore.scala
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/History.scala (diff)
Commit 902e1342a324c9e1e01dc68817850d9241a58227 by dongjoon
[SPARK-32303][PYTHON][TESTS] Remove leftover from editable mode
installation in PIP test
### What changes were proposed in this pull request?
Currently the Jenkins PIP packaging test fails as below intermediately:
``` Installing dist into virtual env Processing
./python/dist/pyspark-3.1.0.dev0.tar.gz Collecting py4j==0.10.9 (from
pyspark==3.1.0.dev0)
Downloading
https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl
(198kB) Installing collected packages: py4j, pyspark
Found existing installation: py4j 0.10.9
   Uninstalling py4j-0.10.9:
     Successfully uninstalled py4j-0.10.9
Found existing installation: pyspark 3.1.0.dev0 Exception: Traceback
(most recent call last):
File
"/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
line 179, in main
   status = self.run(options, args)
File
"/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/commands/install.py",
line 393, in run
   use_user_site=options.use_user_site,
File
"/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
line 50, in install_given_reqs
   auto_confirm=True
File
"/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
line 816, in uninstall
   uninstalled_pathset = UninstallPathSet.from_dist(dist)
File
"/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
line 505, in from_dist
   '(at %s)' % (link_pointer, dist.project_name, dist.location)
AssertionError: Egg-link
/home/jenkins/workspace/SparkPullRequestBuilder3/python does not match
installed
```
- https://github.com/apache/spark/pull/29099#issuecomment-658073453
(amp-jenkins-worker-04)
- https://github.com/apache/spark/pull/29090#issuecomment-657819973
(amp-jenkins-worker-03)
Seems like the previous installation of editable mode affects other PRs.
This PR simply works around by removing the symbolic link from the
previous editable installation. This is a common workaround up to my
knowledge.
### Why are the changes needed?
To recover the Jenkins build.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Jenkins build will test it out.
Closes #29102 from HyukjinKwon/SPARK-32303.
Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by:
Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 902e134)
The file was modifieddev/run-pip-tests (diff)
Commit 676d92ecceb3d46baa524c725b9f9a14450f1e9d by gurwls223
[SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with
empty partitioned Spark DataFrame
### What changes were proposed in this pull request?
This PR proposes to port the test case from
https://github.com/apache/spark/pull/29098 to branch-3.0 and master.  In
the master and branch-3.0, this was fixed together at
https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af
but no partition case is not being tested.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
Unit test was forward-ported.
Closes #29099 from HyukjinKwon/SPARK-32300-1.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 676d92e)
The file was modifiedpython/pyspark/sql/tests/test_arrow.py (diff)
Commit 03b5707b516187aaa8012049fce8b1cd0ac0fddd by gurwls223
[MINOR][R] Match collectAsArrowToR with non-streaming
collectAsArrowToPython
### What changes were proposed in this pull request?
This PR proposes to port forward #29098 to `collectAsArrowToR`.
`collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due
to the limitation of ARROW-4512. SparkR vectorization currently cannot
use streaming format.
### Why are the changes needed?
For simplicity and consistency.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The same code is being tested in `collectAsArrowToPython` of branch-2.4.
Closes #29100 from HyukjinKwon/minor-parts.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 03b5707)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
Commit 6bdd710c4d4125b0801a93d57f53e05e301ebebd by dongjoon
[SPARK-32316][TESTS][INFRA] Test PySpark with Python 3.8 in Github
Actions
### What changes were proposed in this pull request?
This PR aims to test PySpark with Python 3.8 in Github Actions. In the
script side, it is already ready:
https://github.com/apache/spark/blob/4ad9bfd53b84a6d2497668c73af6899bae14c187/python/run-tests.py#L161
This PR includes small related fixes together:
1. Install Python 3.8 2. Only install one Python implementation instead
of installing many for SQL and Yarn test cases because they need one
Python executable in their test cases that is higher than Python 2. 3.
Do not install Python 2 which is not needed anymore after we dropped
Python 2 at SPARK-32138 4. Remove a comment about installing PyPy3 on
Jenkins - SPARK-32278. It is already installed.
### Why are the changes needed?
Currently, only PyPy3 and Python 3.6 are being tested with PySpark in
Github Actions. We should test the latest version of Python as well
because some optimizations can be only enabled with Python 3.8+. See
also https://github.com/apache/spark/pull/29114
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Was not tested. Github Actions build in this PR will test it out.
Closes #29116 from HyukjinKwon/test-python3.8-togehter.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 6bdd710)
The file was modifiedpython/run-tests.py (diff)
The file was modified.github/workflows/master.yml (diff)
Commit af8e65fca989518cf65ec47f77eea2ce649bd6bb by dongjoon
[SPARK-32276][SQL] Remove redundant sorts before repartition nodes
### What changes were proposed in this pull request?
This PR removes redundant sorts before repartition nodes with shuffles
and repartitionByExpression with deterministic expressions.
### Why are the changes needed?
It looks like our `EliminateSorts` rule can be extended further to
remove sorts before repartition nodes that shuffle data as such
repartition operations change the ordering and distribution of data.
That's why it seems safe to perform the following rewrites:
- `Repartition -> Sort -> Scan` as `Repartition -> Scan`
- `Repartition -> Project -> Sort -> Scan` as `Repartition -> Project ->
Scan`
We don't apply this optimization to coalesce as it uses
`DefaultPartitionCoalescer` that may preserve the ordering of data if
there is no locality info in the parent RDD. At the same time, there is
no guarantee that will happen.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
More test cases.
Closes #29089 from aokolnychyi/spark-32276.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by:
Dongjoon Hyun <dongjoon@apache.org>
(commit: af8e65f)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
Commit 542aefb4c4dd5ca2734773ffe983ba740729d074 by kabhwan.opensource
[SPARK-31985][SS] Remove incomplete/undocumented stateful aggregation in
continuous mode
### What changes were proposed in this pull request?
This removes the undocumented and incomplete feature of "stateful
aggregation" in continuous mode, which would reduce 1100+ lines of code.
### Why are the changes needed?
The work for the feature had been stopped for over an year, and no one
asked/requested for the availability of such feature in community.
Current state for the feature is that it only works with `coalesce(1)`
which force the query to read and process, and write in "a" task, which
doesn't make sense in production.
The remaining code increases the work on DSv2 changes as well - that's
why I don't simply propose reverting relevant commits - the code path
has been changed due to DSv2 evolution.
### Does this PR introduce _any_ user-facing change?
Technically no, because it's never documented and can't be used in
production in current shape.
### How was this patch tested?
Existing tests.
Closes #29077 from HeartSaVioR/SPARK-31985.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 542aefb)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleReader.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was removedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleSuite.scala
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceExec.scala
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReader.scala
The file was removedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousAggregationSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriter.scala
Commit 2527fbc896dc8a26f5a281ed719fb59b5df8cd2f by dongjoon
Revert "[SPARK-32276][SQL] Remove redundant sorts before repartition
nodes"
This reverts commit af8e65fca989518cf65ec47f77eea2ce649bd6bb.
(commit: 2527fbc)
The file was removedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
Commit e4499932da03743cb05c6bcc5d0149728380383a by dkbiswal
[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's
Scan Node
### What changes were proposed in this pull request? Improve the EXPLAIN
FORMATTED output of DSV2 Scan nodes (file based ones).
**Before**
```
== Physical Plan ==
* Project (4)
+- * Filter (3)
  +- * ColumnarToRow (2)
     +- BatchScan (1)
(1) BatchScan Output [2]: [value#7, id#8] Arguments: [value#7, id#8],
ParquetScan(org.apache.spark.sql.test.TestSparkSession17477bbb,Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
__spark_hadoop_conf__.xml,org.apache.spark.sql.execution.datasources.InMemoryFileIndexa6c363ce,StructType(StructField(value,IntegerType,true)),StructType(StructField(value,IntegerType,true)),StructType(StructField(id,IntegerType,true)),[Lorg.apache.spark.sql.sources.Filter;40fee459,org.apache.spark.sql.util.CaseInsensitiveStringMapfeca1ec6,Vector(isnotnull(id#8),
(id#8 > 1)),List(isnotnull(value#7), (value#7 > 2)))
(2) ...
(3) ...
(4) ...
```
**After**
```
== Physical Plan ==
* Project (4)
+- * Filter (3)
  +- * ColumnarToRow (2)
     +- BatchScan (1)
(1) BatchScan Output [2]: [value#7, id#8] DataFilters:
[isnotnull(value#7), (value#7 > 2)] Format: parquet Location:
InMemoryFileIndex[....] PartitionFilters: [isnotnull(id#8), (id#8 > 1)]
PushedFilers: [IsNotNull(id), IsNotNull(value), GreaterThan(id,1),
GreaterThan(value,2)] ReadSchema: struct<value:int>
(2) ...
(3) ...
(4) ...
```
### Why are the changes needed? The old format is not very readable.
This improves the readability of the plan.
### Does this PR introduce any user-facing change? Yes. the explain
output will be different.
### How was this patch tested? Added a test case in ExplainSuite.
Closes #28425 from dilipbiswal/dkb_dsv2_explain.
Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com> Co-authored-by:
Dilip Biswal <dkbiswal@apache.org> Signed-off-by: Dilip Biswal
<dkbiswal@apache.org>
(commit: e449993)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala
Commit 8950dcbb1cafccc2ba8bbf030ab7ac86cfe203a4 by dongjoon
[SPARK-32318][SQL][TESTS] Add a test case to EliminateSortsSuite for
ORDER BY in DISTRIBUTE BY
### What changes were proposed in this pull request?
This PR aims to add a test case to EliminateSortsSuite to protect a
valid use case which is using ORDER BY in DISTRIBUTE BY statement.
### Why are the changes needed?
```scala scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2,
x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
scala> sql("select * from (select * from t order by b) distribute by
a").write.orc("/tmp/master")
$ ls -al /tmp/master/ total 56 drwxr-xr-x  10 dongjoon  wheel  320 Jul
14 22:12 ./ drwxrwxrwt  15 root      wheel  480 Jul 14 22:12 ../
-rw-r--r--   1 dongjoon  wheel    8 Jul 14 22:12 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Jul 14 22:12
.part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12
.part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12
.part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Jul 14 22:12 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  119 Jul 14 22:12
part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  932 Jul 14 22:12
part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12
part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
```
The following was found during SPARK-32276. If Spark optimizer removes
the inner `ORDER BY`, the file size increases.
```scala scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2,
x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
scala> sql("select * from (select * from t order by b) distribute by
a").write.orc("/tmp/SPARK-32276")
$ ls -al /tmp/SPARK-32276/ total 632 drwxr-xr-x  10 dongjoon  wheel   
320 Jul 14 22:08 ./ drwxrwxrwt  14 root      wheel     448 Jul 14 22:08
../
-rw-r--r--   1 dongjoon  wheel       8 Jul 14 22:08 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel      12 Jul 14 22:08
.part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    1188 Jul 14 22:08
.part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    1188 Jul 14 22:08
.part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel       0 Jul 14 22:08 _SUCCESS
-rw-r--r--   1 dongjoon  wheel     119 Jul 14 22:08
part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  150735 Jul 14 22:08
part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08
part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
```
### Does this PR introduce _any_ user-facing change?
No. This only improves the test coverage.
### How was this patch tested?
Pass the GitHub Action or Jenkins.
Closes #29118 from dongjoon-hyun/SPARK-32318.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 8950dcb)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala (diff)
Commit cf22d947fb8f37aa4d394b6633d6f08dbbf6dc1c by tgraves
[SPARK-32036] Replace references to blacklist/whitelist language with
more appropriate terminology, excluding the blacklisting feature
### What changes were proposed in this pull request?
This PR will remove references to these "blacklist" and "whitelist"
terms besides the blacklisting feature as a whole, which can be handled
in a separate JIRA/PR.
This touches quite a few files, but the changes are straightforward
(variable/method/etc. name changes) and most quite self-contained.
### Why are the changes needed?
As per discussion on the Spark dev list, it will be beneficial to remove
references to problematic language that can alienate potential community
members. One such reference is "blacklist" and "whitelist". While it
seems to me that there is some valid debate as to whether these terms
have racist origins, the cultural connotations are inescapable in
today's world.
### Does this PR introduce _any_ user-facing change?
In the test file `HiveQueryFileTest`, a developer has the ability to
specify the system property `spark.hive.whitelist` to specify a list of
Hive query files that should be tested. This system property has been
renamed to `spark.hive.includelist`. The old property has been kept for
compatibility, but will log a warning if used. I am open to feedback
from others on whether keeping a deprecated property here is unnecessary
given that this is just for developers running tests.
### How was this patch tested?
Existing tests should be suitable since no behavior changes are expected
as a result of this PR.
Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.
Authored-by: Erik Krogen <ekrogen@linkedin.com> Signed-off-by: Thomas
Graves <tgraves@apache.org>
(commit: cf22d94)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java (diff)
The file was removedsql/hive/src/test/resources/ql/src/test/queries/clientpositive/add_partition_with_whitelist.q
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala (diff)
The file was modifiedpython/run-tests.py (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQueryFileTest.scala (diff)
The file was modifieddocs/streaming-programming-guide.md (diff)
The file was addedsql/hive/src/test/resources/ql/src/test/queries/clientpositive/add_partition_with_includelist.q
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/security/HiveHadoopDelegationTokenManagerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala (diff)
The file was removedsql/core/src/test/resources/sql-tests/inputs/blacklist.sql
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/crypto/README.md (diff)
The file was addedsql/core/src/test/resources/sql-tests/inputs/ignored.sql
The file was modifiedcore/src/test/scala/org/apache/spark/ThreadAudit.scala (diff)
The file was modifiedproject/SparkBuild.scala (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonOutputWriter.scala (diff)
The file was modifiedsql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (diff)
The file was modifiedresource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala (diff)
The file was modifiedpython/pyspark/sql/pandas/typehints.py (diff)
The file was modifiedpython/pyspark/cloudpickle.py (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala (diff)
The file was addedsql/hive/src/test/resources/ql/src/test/queries/clientpositive/alter_partition_with_includelist.q
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala (diff)
The file was addedsql/hive/src/test/resources/ql/src/test/queries/clientpositive/add_partition_no_includelist.q
The file was removedsql/hive/src/test/resources/ql/src/test/queries/clientpositive/add_partition_no_whitelist.q
The file was modifiedcore/src/main/scala/org/apache/spark/util/JsonProtocol.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala (diff)
The file was modifiedpython/pylintrc (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/util/DockerUtils.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala (diff)
The file was modifiedsql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala (diff)
The file was removedsql/hive/src/test/resources/ql/src/test/queries/clientpositive/alter_partition_with_whitelist.q
The file was modifiedexamples/src/main/python/streaming/recoverable_network_wordcount.py (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullupCorrelatedPredicatesSuite.scala (diff)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL.R (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala (diff)
The file was modifiedR/pkg/tests/fulltests/test_context.R (diff)
The file was modifiedR/pkg/tests/run-all.R (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala (diff)
Commit b05f309bc9e51e8f7b480b5d176589773b5d59f7 by huaxing
[SPARK-32140][ML][PYSPARK] Add training summary to FMClassificationModel
### What changes were proposed in this pull request? Add training
summary for FMClassificationModel...
### Why are the changes needed? so that user can get the training
process status, such as loss value of each iteration and total iteration
number.
### Does this PR introduce _any_ user-facing change? Yes
FMClassificationModel.summary FMClassificationModel.evaluate
### How was this patch tested? new tests
Closes #28960 from huaxingao/fm_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao
<huaxing@us.ibm.com>
(commit: b05f309)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/FMClassifier.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/FMClassifierSuite.scala (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_training_summary.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/FMRegressor.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala (diff)
Commit c28a6fa5112c9ba3839f52b737266f24fdfcf75b by dongjoon
[SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc)
for Scala 2.13 compilation
### What changes were proposed in this pull request?
Same as https://github.com/apache/spark/pull/29078 and
https://github.com/apache/spark/pull/28971 . This makes the rest of the
default modules (i.e. those you get without specifying `-Pyarn` etc)
compile under Scala 2.13. It does not close the JIRA, as a result. this
also of course does not demonstrate that tests pass yet in 2.13.
Note, this does not fix the `repl` module; that's separate.
### Why are the changes needed?
Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests. (2.13 was not tested; this is about getting it to
compile without breaking 2.12)
Closes #29111 from srowen/SPARK-29292.3.
Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: c28a6fa)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/SparkKMeans.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Variance.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Entropy.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/params.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveShowCreateTableSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/Estimator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/util/NumericParser.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala (diff)
Commit db47c6e340a63100d7c0e85abf237adc4e2174cc by gengliang.wang
[SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest
API
### What changes were proposed in this pull request? Support fetching
taskList by status as below:
```
/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed
```
### Why are the changes needed?
When there're large number of tasks in one stage, current api is hard to
get taskList by status
### Does this PR introduce _any_ user-facing change? Yes. Updated
monitoring doc.
### How was this patch tested? Added tests in `HistoryServerSuite`
Closes #28942 from warrenzhu25/SPARK-32125.
Authored-by: Warren Zhu <zhonzh@microsoft.com> Signed-off-by: Gengliang
Wang <gengliang.wang@databricks.com>
(commit: db47c6e)
The file was addedcore/src/test/resources/HistoryServerExpectations/stage_task_list_w__status___offset___length_expectation.json
The file was modifieddocs/monitoring.md (diff)
The file was addedcore/src/main/java/org/apache/spark/status/api/v1/TaskStatus.java
The file was addedcore/src/test/resources/HistoryServerExpectations/stage_task_list_w__status_expectation.json
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusStore.scala (diff)
The file was addedcore/src/test/resources/HistoryServerExpectations/stage_task_list_w__status___sortBy_short_names__runtime_expectation.json
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala (diff)
Commit bdeb626c5a6f16101917140b1d30296e7d35b2ce by wenchen
[SPARK-32272][SQL] Add  SQL standard command SET TIME ZONE
### What changes were proposed in this pull request?
This PR adds the SQL standard command - `SET TIME ZONE` to the current
default time zone displacement for the current SQL-session, which is the
same as the existing `set spark.sql.session.timeZone=xxx'.
All in all, this PR adds syntax as following,
``` SET TIME ZONE LOCAL; SET TIME ZONE 'valid time zone';  -- zone
offset or region SET TIME ZONE INTERVAL XXXX; -- xxx must in [-18, + 18]
hours, * this range is bigger than ansi  [-14, + 14]
```
### Why are the changes needed?
ANSI compliance and supply pure SQL users a way to retrieve all
supported TimeZones
### Does this PR introduce _any_ user-facing change?
yes, add new syntax.
### How was this patch tested?
add unit tests.
and locally verified reference doc
![image](https://user-images.githubusercontent.com/8326978/87510244-c8dc3680-c6a5-11ea-954c-b098be84afee.png)
Closes #29064 from yaooqinn/SPARK-32272.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: bdeb626)
The file was modifieddocs/sql-ref-ansi-compliance.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was addedsql/core/src/test/resources/sql-tests/results/timezone.sql.out
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala (diff)
The file was modifieddocs/_data/menu-sql.yaml (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was addedsql/core/src/test/resources/sql-tests/inputs/timezone.sql
The file was addeddocs/sql-ref-syntax-aux-conf-mgmt-set-timezone.md
The file was modifieddocs/sql-ref-syntax-aux-conf-mgmt.md (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
Commit 6be8b935a4f7ce0dea2d7aaaf747c2e8e1a9f47a by wenchen
[SPARK-32234][SQL] Spark sql commands are failing on selecting the orc
tables
### What changes were proposed in this pull request? Spark sql commands
are failing on selecting the orc tables Steps to reproduce Example 1 -
Prerequisite -  This is the
location(/Users/test/tpcds_scale5data/date_dim) for orc data which is
generated by the hive.
``` val table = """CREATE TABLE `date_dim` (
`d_date_sk` INT,
`d_date_id` STRING,
`d_date` TIMESTAMP,
`d_month_seq` INT,
`d_week_seq` INT,
`d_quarter_seq` INT,
`d_year` INT,
`d_dow` INT,
`d_moy` INT,
`d_dom` INT,
`d_qoy` INT,
`d_fy_year` INT,
`d_fy_quarter_seq` INT,
`d_fy_week_seq` INT,
`d_day_name` STRING,
`d_quarter_name` STRING,
`d_holiday` STRING,
`d_weekend` STRING,
`d_following_holiday` STRING,
`d_first_dom` INT,
`d_last_dom` INT,
`d_same_day_ly` INT,
`d_same_day_lq` INT,
`d_current_day` STRING,
`d_current_week` STRING,
`d_current_month` STRING,
`d_current_quarter` STRING,
`d_current_year` STRING) USING orc LOCATION
'/Users/test/tpcds_scale5data/date_dim'"""
spark.sql(table).collect
val u = """select date_dim.d_date_id from date_dim limit 5"""
spark.sql(u).collect
``` Example 2
```
val table = """CREATE TABLE `test_orc_data` (
`_col1` INT,
`_col2` STRING,
`_col3` INT)
USING orc"""
spark.sql(table).collect
spark.sql("insert into test_orc_data values(13, '155', 2020)").collect
val df = """select _col2 from test_orc_data limit 5"""
spark.sql(df).collect
```
Its Failing with below error
``` org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 2, 192.168.0.103, executor driver):
java.lang.ArrayIndexOutOfBoundsException: 1
   at
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
   at
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
   at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
   at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
   at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
   at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
   at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
   at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
   at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
   at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
   at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
   at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
   at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:133)
   at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
   at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)`
```
The reason behind this initBatch is not getting the schema that is
needed to find out the column value in OrcFileFormat.scala
``` batchReader.initBatch(
TypeDescription.fromString(resultSchemaString)
```
### Why are the changes needed? Spark sql queries for orc tables are
failing
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Unit test is added for this .Also Tested
through spark shell and spark submit the failing queries
Closes #29045 from SaurabhChawla100/SPARK-32234.
Lead-authored-by: SaurabhChawla <saurabhc@qubole.com> Co-authored-by:
SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 6be8b93)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
Commit c1f160e0972858d727d58ceab5dce0f6b48425dd by gurwls223
[SPARK-30648][SQL] Support filters pushdown in JSON datasource
### What changes were proposed in this pull request? In the PR, I
propose to support pushed down filters in JSON datasource. The reason of
pushing a filter up to `JacksonParser` is to apply the filter as soon as
all its attributes become available i.e. converted from JSON field
values to desired values according to the schema. This allows to skip
parsing of the rest of JSON record and conversions of other values if
the filter returns `false`. This can improve performance when pushed
filters are highly selective and conversion of JSON string fields to
desired values are comparably expensive ( for example, the conversion to
`TIMESTAMP` values).
The main idea behind of `JsonFilters` is to group pushdown filters by
their references, convert the grouped filters to expressions, and then
compile to predicates. The predicates are indexed by schema field
positions. Each predicate has a state with reference counter to non-set
row fields. As soon as the counter reaches `0`, it can be applied to the
row because all its dependencies has been set. Before processing new
row, predicate's reference counter is reset to total number of predicate
references (dependencies in a row).
The common code shared between `CSVFilters` and `JsonFilters` is moved
to the `StructFilters` class and its companion object.
### Why are the changes needed? The changes improve performance on
synthetic benchmarks up to **27 times** on JDK 8 and **25** times on JDK
11:
``` OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on
Linux 4.15.0-1044-aws Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz Filters
pushdown:                         Best Time(ms)   Avg Time(ms) 
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                       25230          25255 
       22          0.0      252299.6       1.0X pushdown disabled      
                         25248          25282          33          0.0 
   252475.6       1.0X w/ filters                                      
  905            911           8          0.1        9047.9      27.9X
```
### Does this PR introduce any user-facing change? No
### How was this patch tested?
- Added new test suites `JsonFiltersSuite` and `JacksonParserSuite`.
- By new end-to-end and case sensitivity tests in `JsonSuite`.
- By `CSVFiltersSuite`, `UnivocityParserSuite` and `CSVSuite`.
- Re-running `CSVBenchmark` and `JsonBenchmark` using Amazon EC2:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5
(ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository
ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|
and `./dev/run-benchmarks`:
```python
#!/usr/bin/env python3
import os from sparktestsupport.shellutils import run_cmd
benchmarks = [
   ['sql/test',
'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
   ['sql/test',
'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
   print("Run benchmark: %s" % b[1])
   run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes #27366 from MaxGekk/json-filters-pushdown.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max
Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: c1f160e)
The file was modifiedsql/core/benchmarks/CSVBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/CSVBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVFilters.scala (diff)
The file was modifiedsql/core/benchmarks/JsonBenchmark-results.txt (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala (diff)
The file was modifiedsql/core/benchmarks/JsonBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVFiltersSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonPartitionReaderFactory.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/StructFilters.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScanBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala (diff)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonParserSuite.scala
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonFilters.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonFiltersSuite.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/StructFiltersSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala (diff)
Commit d5c672af589d55318d311593a9add24e02c215f5 by huaxing
[SPARK-32315][ML] Provide an explanation error message when calling
require
### What changes were proposed in this pull request? Small improvement
in the error message shown to user
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L537-L538
### Why are the changes needed? The current behavior is an exception is
thrown but without any specific details on the cause
``` Caused by: java.lang.IllegalArgumentException: requirement
failedCaused by: java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:212) at
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:508)
at
org.apache.spark.mllib.clustering.EuclideanDistanceMeasure$.fastSquaredDistance(DistanceMeasure.scala:232)
at
org.apache.spark.mllib.clustering.EuclideanDistanceMeasure.isCenterConverged(DistanceMeasure.scala:190)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:336)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:334)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at
scala.collection.mutable.HashMap.foreach(HashMap.scala:130) at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245) at
org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:334)
at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:251) at
org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:233)
```
### Does this PR introduce _any_ user-facing change? Yes, this PR adds
an explanation message to be shown to user when requirement check is not
meant
### How was this patch tested? manually
Closes #29115 from dzlab/patch/SPARK-32315.
Authored-by: dzlab <dzlabs@outlook.com> Signed-off-by: Huaxin Gao
<huaxing@us.ibm.com>
(commit: d5c672a)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala (diff)
Commit 383f5e9cbed636ae91b84252a2c249120c698588 by huaxing
[SPARK-32310][ML][PYSPARK] ML params default value parity in
classification, regression, clustering and fpm
### What changes were proposed in this pull request? set params default
values in trait ...Params in both Scala and Python. I will do this in
two PRs. I will change classification, regression, clustering and fpm in
this PR. Will change the rest in another PR.
### Why are the changes needed? Make ML has the same default param
values between estimator and its corresponding transformer, and also
between Scala and Python.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Existing tests
Closes #29112 from huaxingao/set_default.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao
<huaxing@us.ibm.com>
(commit: 383f5e9)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/RankingEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala (diff)
The file was modifiedpython/pyspark/ml/clustering.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala (diff)
The file was modifiedpython/pyspark/ml/regression.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (diff)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/FMClassifier.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala (diff)
Commit fb519251237892ad474592d19bf4f193e2a9e2b6 by dongjoon
[SPARK-32335][K8S][TESTS] Remove Python2 test from K8s IT
### What changes were proposed in this pull request?
This PR aims to remove Python 2 test case from K8s IT.
### Why are the changes needed?
Since Apache Spark 3.1.0 dropped Python 2.7, 3.4 and 3.5 support
officially via SPARK-32138, K8s IT fails.
``` KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment
variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example *** FAILED ***
The code passed to eventually never returned normally. Attempted 113
times over 2.0014854648999996 minutes. Last failure message: false was
not true. (KubernetesSuite.scala:370)
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning
- Run SparkR on simple dataframe.R example Run completed in 11 minutes,
15 seconds. Total number of tests run: 20 Suites: completed 2, aborted 0
Tests: succeeded 19, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass Jenkins K8s IT.
Closes #29136 from dongjoon-hyun/SPARK-32335.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: fb51925)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PythonTestsSuite.scala (diff)