FailedChanges

Summary

  1. [SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution (commit: bca9de66847dab562d44d65a284bf75e7ede6421) (details)
  2. [SPARK-30164][TESTS][DOCS] Exclude Hive domain in Unidoc build (commit: a57bbf2ee02e30053e67f62a0afd6f525bba5c66) (details)
  3. [SPARK-29883][SQL] Implement a helper method for aliasing bool_and() and (commit: dcea7a4c9a04190dffec184eb286e9709faf3272) (details)
  4. [SPARK-30138][SQL] Separate configuration key of max iterations for (commit: c2f29d5ea58eb4565cc5602937d6d0bb75558513) (details)
  5. [SPARK-30159][SQL][TESTS] Fix the method calls of (commit: a717d219a66d0e7b18b8ff392e1e03cd2781c457) (details)
  6. [SPARK-27189][CORE] Add Executor metrics and memory usage (commit: 729f43f499f3dd2718c0b28d73f2ca29cc811eac) (details)
  7. [SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary (commit: 538b8d101cf06b059288f013579dafaafa388bdc) (details)
  8. [SPARK-30146][ML][PYSPARK] Add setWeightCol to GBTs in PySpark (commit: 8a9cccf1f3f4365e40f682bb111ec6c15cbc9be4) (details)
  9. [SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13 (commit: 36fa1980c24c5c697982b107c8f9714f3eb57f36) (details)
  10. [SPARK-30179][SQL][TESTS] Improve test in SingleSessionSuite (commit: 3d98c9f9854c6078d0784d3aa5cc1bb4b5e6a8e8) (details)
  11. [SPARK-30196][BUILD] Bump lz4-java version to 1.7.0 (commit: be867e8a9ee8fc5e4831521770f51793e9265550) (details)
  12. [SPARK-30151][SQL] Issue better error message when user-specified schema (commit: aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852) (details)
  13. [SPARK-29967][ML][PYTHON] KMeans support instance weighting (commit: 1cac9b2cc669b9cc20a07a97f3caba48a3b30f01) (details)
  14. [SPARK-30206][SQL] Rename normalizeFilters in DataSourceStrategy to be (commit: a9f1809a2a1ea84b5c96bc7fd22cda052a270b41) (details)
  15. [SPARK-30125][SQL] Remove PostgreSQL dialect (commit: d9b30694122f8716d3acb448638ef1e2b96ebc7a) (details)
  16. [SPARK-30200][SQL] Add ExplainMode for Dataset.explain (commit: 6103cf196081ab3e63713b623fe2ca3704420616) (details)
  17. [SPARK-28351][SQL][FOLLOWUP] Remove 'DELETE FROM' from (commit: 24c4ce1e6497a7ad80803babd9f11ee54607f7d1) (details)
  18. [SPARK-29587][SQL] Support SQL Standard type real as float(4) numeric as (commit: 8f0eb7dc868f59db6bee4f009bc148c09cf0df57) (details)
  19. [SPARK-30205][PYSPARK] Import ABCs from collections.abc to remove (commit: aec1d95f3b43a9bf349006ea5655d61fad740dd0) (details)
  20. [SPARK-29976][CORE] Trigger speculation for stages with too few tasks (commit: ad238a2238a9d0da89be4424574436cbfaee579d) (details)
Commit bca9de66847dab562d44d65a284bf75e7ede6421 by liangchi
[SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution
### What changes were proposed in this pull request?
Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same
catalog/table resolution framework of v2 commands.
We don’t have this methods in the catalog to implement an V2 command
* catalog.listFunctions
### Why are the changes needed?
It's important to make all the commands have the same table resolution
behavior, to avoid confusing
`SHOW FUNCTIONS LIKE namespace.function`
### Does this PR introduce any user-facing change?
Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the
command if the current catalog is set to a v2 catalog.
### How was this patch tested?
Unit tests.
Closes #26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog.
Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Liang-Chi
Hsieh <liangchi@uber.com>
(commit: bca9de66847dab562d44d65a284bf75e7ede6421)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
Commit a57bbf2ee02e30053e67f62a0afd6f525bba5c66 by gurwls223
[SPARK-30164][TESTS][DOCS] Exclude Hive domain in Unidoc build
explicitly
### What changes were proposed in this pull request?
This PR proposes to exclude Unidoc checking in Hive domain. We don't
publish this as a part of Spark documentation (see also
https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30)
and most of them are copy of Hive thrift server so that we can
officially use Hive 2.3 release.
It doesn't much make sense to check the documentation generation against
another domain, and that we don't use in documentation publish.
### Why are the changes needed?
To avoid unnecessary computation.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
By Jenkins:
```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-2.7
-Phive-2.3 -Phive -Pmesos -Pkubernetes -Phive-thriftserver
-Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl -Pyarn test:package
streaming-kinesis-asl-assembly/assembly
...
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:
-Phadoop-2.7 -Phive-2.3 -Phive -Pmesos -Pkubernetes -Phive-thriftserver
-Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl -Pyarn unidoc
...
[info] Main Java API documentation successful.
...
[info] Main Scala API documentation successful.
```
Closes #26800 from HyukjinKwon/do-not-merge.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: a57bbf2ee02e30053e67f62a0afd6f525bba5c66)
The file was modifiedproject/SparkBuild.scala (diff)
Commit dcea7a4c9a04190dffec184eb286e9709faf3272 by wenchen
[SPARK-29883][SQL] Implement a helper method for aliasing bool_and() and
bool_or()
### What changes were proposed in this pull request? This PR introduces
a method `expressionWithAlias` in class `FunctionRegistry` which is used
to register function's constructor. Currently, `expressionWithAlias` is
used to register `BoolAnd` & `BoolOr`.
### Why are the changes needed? Error message is wrong when alias name
is used for `BoolAnd` & `BoolOr`.
### Does this PR introduce any user-facing change? No
### How was this patch tested? Tested manually.
For query,
`select every('true');`
Output before this PR,
> Error in query: cannot resolve 'bool_and('true')' due to data type
mismatch: Input to function 'bool_and' should have been boolean, but
it's [string].; line 1 pos 7;
After this PR,
> Error in query: cannot resolve 'every('true')' due to data type
mismatch: Input to function 'every' should have been boolean, but it's
[string].; line 1 pos 7;
Closes #26712 from amanomer/29883.
Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: dcea7a4c9a04190dffec184eb286e9709faf3272)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/group-by.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala (diff)
Commit c2f29d5ea58eb4565cc5602937d6d0bb75558513 by yamamuro
[SPARK-30138][SQL] Separate configuration key of max iterations for
analyzer and optimizer
### What changes were proposed in this pull request? separate the
configuration keys "spark.sql.optimizer.maxIterations" and
"spark.sql.analyzer.maxIterations".
### Why are the changes needed? Currently, both Analyzer and Optimizer
use conf "spark.sql.optimizer.maxIterations" to set the max iterations
to run, which is a little confusing. It is clearer to add a new conf
"spark.sql.analyzer.maxIterations" for analyzer max iterations.
### Does this PR introduce any user-facing change? no
### How was this patch tested? Existing unit tests.
Closes #26766 from fuwhu/SPARK-30138.
Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Takeshi Yamamuro
<yamamuro@apache.org>
(commit: c2f29d5ea58eb4565cc5602937d6d0bb75558513)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit a717d219a66d0e7b18b8ff392e1e03cd2781c457 by gurwls223
[SPARK-30159][SQL][TESTS] Fix the method calls of
`QueryTest.checkAnswer`
### What changes were proposed in this pull request?
Before this PR, the method `checkAnswer` in Object `QueryTest` returns
an optional string. It doesn't throw exceptions when errors happen. The
actual exceptions are thrown in the trait `QueryTest`.
However, there are some test suites(`StreamSuite`, `SessionStateSuite`,
`BinaryFileFormatSuite`, etc.) that use the no-op method
`QueryTest.checkAnswer` and expect it to fail test cases when the
execution results don't match the expected answers.
After this PR: 1. the method `checkAnswer` in Object `QueryTest` will
fail tests on errors or unexpected results. 2. add a new method
`getErrorMessageInCheckAnswer`, which is exactly the same as the
previous version of `checkAnswer`. There are some test suites use this
one to customize the test failure message. 3. for the test suites that
extend the trait `QueryTest`, we should use the method `checkAnswer`
directly, instead of calling the method from Object `QueryTest`.
### Why are the changes needed?
We should fix these method calls to perform actual validations in test
suites.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes #26788 from gengliangwang/fixCheckAnswer.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: a717d219a66d0e7b18b8ff392e1e03cd2781c457)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala (diff)
The file was modifiedsql/hive/src/test/java/org/apache/spark/sql/hive/JavaMetastoreDataSourcesSuite.java (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaSaveLoadSuite.java (diff)
The file was modifiedexternal/avro/src/test/java/org/apache/spark/sql/avro/JavaAvroFunctionsSuite.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ReduceNumShufflePartitionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala (diff)
The file was modifiedsql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala (diff)
Commit 729f43f499f3dd2718c0b28d73f2ca29cc811eac by irashid
[SPARK-27189][CORE] Add Executor metrics and memory usage
instrumentation to the metrics system
## What changes were proposed in this pull request?
This PR proposes to add instrumentation of memory usage via the Spark
Dropwizard/Codahale metrics system. Memory usage metrics are available
via the Executor metrics, recently implemented as detailed in
https://issues.apache.org/jira/browse/SPARK-23206. Additional notes:
This takes advantage of the metrics poller introduced in #23767.
## Why are the changes needed? Executor metrics bring have many useful
insights on memory usage, in particular on the usage of storage memory
and executor memory. This is useful for troubleshooting. Having the
information in the metrics systems allows to add those metrics to Spark
performance dashboards and study memory usage as a function of time, as
in the example graph
https://issues.apache.org/jira/secure/attachment/12962810/Example_dashboard_Spark_Memory_Metrics.PNG
## Does this PR introduce any user-facing change? Adds `ExecutorMetrics`
source to publish executor metrics via the Dropwizard metrics system.
Details of the available metrics in docs/monitoring.md Adds
configuration parameter `spark.metrics.executormetrics.source.enabled`
## How was this patch tested?
Tested on YARN cluster and with an existing setup for a Spark dashboard
based on InfluxDB and Grafana.
Closes #24132 from LucaCanali/memoryMetricsSource.
Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Imran
Rashid <irashid@cloudera.com>
(commit: 729f43f499f3dd2718c0b28d73f2ca29cc811eac)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/ExecutorMetricsPoller.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/executor/ExecutorMetricsSource.scala
The file was modifieddocs/monitoring.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/metrics/source/SourceConfigSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
Commit 538b8d101cf06b059288f013579dafaafa388bdc by dhyun
[SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary
imports
### What changes were proposed in this pull request?
This patch fixes the Java code style violations in SPARK-30159 (#26788)
which are caught by lint-java (Github Action caught it and I can
reproduce it locally). Looks like Jenkins build may have different
policy on checking Java style check or less accurate.
### Why are the changes needed?
Java linter starts complaining.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
lint-java passed locally
This closes #26819
Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 538b8d101cf06b059288f013579dafaafa388bdc)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaSaveLoadSuite.java (diff)
The file was modifiedexternal/avro/src/test/java/org/apache/spark/sql/avro/JavaAvroFunctionsSuite.java (diff)
The file was modifiedsql/hive/src/test/java/org/apache/spark/sql/hive/JavaMetastoreDataSourcesSuite.java (diff)
The file was modifiedsql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java (diff)
Commit 8a9cccf1f3f4365e40f682bb111ec6c15cbc9be4 by srowen
[SPARK-30146][ML][PYSPARK] Add setWeightCol to GBTs in PySpark
### What changes were proposed in this pull request? add
```setWeightCol``` and ```setMinWeightFractionPerNode``` in Python side
of ```GBTClassifier``` and ```GBTRegressor```
### Why are the changes needed?
https://github.com/apache/spark/pull/25926 added ```setWeightCol``` and
```setMinWeightFractionPerNode``` in GBTs on scala side. This PR will
add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on
python side
### Does this PR introduce any user-facing change? Yes
### How was this patch tested? doc test
Closes #26774 from huaxingao/spark-30146.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 8a9cccf1f3f4365e40f682bb111ec6c15cbc9be4)
The file was modifiedpython/pyspark/ml/regression.py (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
Commit 36fa1980c24c5c697982b107c8f9714f3eb57f36 by srowen
[SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13
compatibility; remove WrappedArray
### What changes were proposed in this pull request?
Use Seq instead of Array in sc.parallelize, with reference types. Remove
usage of WrappedArray.
### Why are the changes needed?
These both enable building on Scala 2.13.
### Does this PR introduce any user-facing change?
None
### How was this patch tested?
Existing tests
Closes #26787 from srowen/SPARK-30158.
Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: 36fa1980c24c5c697982b107c8f9714f3eb57f36)
The file was modifiedmllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala (diff)
The file was removedsql/core/src/test/java/test/org/apache/spark/sql/JavaTestUtils.java
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/mllib/ElementwiseProductExample.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaHigherOrderFunctionsSuite.java (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/pmml/PMMLExportable.scala (diff)
Commit 3d98c9f9854c6078d0784d3aa5cc1bb4b5e6a8e8 by gurwls223
[SPARK-30179][SQL][TESTS] Improve test in SingleSessionSuite
### What changes were proposed in this pull request?
improve the temporary functions test in SingleSessionSuite by verifying
the result in a query
### Why are the changes needed?
### Does this PR introduce any user-facing change?
### How was this patch tested?
Closes #26812 from leoluan2009/SPARK-30179.
Authored-by: Luan <xuluan@ebay.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 3d98c9f9854c6078d0784d3aa5cc1bb4b5e6a8e8)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (diff)
Commit be867e8a9ee8fc5e4831521770f51793e9265550 by gurwls223
[SPARK-30196][BUILD] Bump lz4-java version to 1.7.0
### What changes were proposed in this pull request?
This pr intends to upgrade lz4-java from 1.6.0 to 1.7.0.
### Why are the changes needed?
This release includes a performance bug
(https://github.com/lz4/lz4-java/pull/143) fixed by JoshRosen and some
improvements (e.g., LZ4 binary update). You can see the link below for
the changes; https://github.com/lz4/lz4-java/blob/master/CHANGES.md#170
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests.
Closes #26823 from maropu/LZ4_1_7_0.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: be867e8a9ee8fc5e4831521770f51793e9265550)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-1.2 (diff)
Commit aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852 by wenchen
[SPARK-30151][SQL] Issue better error message when user-specified schema
mismatched
### What changes were proposed in this pull request?
Issue better error message when user-specified schema and not match
relation schema
### Why are the changes needed?
Inspired by
https://github.com/apache/spark/pull/25248#issuecomment-559594305, user
could get a weird error message when type mapping behavior change
between Spark schema and datasource schema(e.g. JDBC). Instead of saying
"SomeProvider does not allow user-specified schemas.", we'd better tell
user what is really happening here to make user be more clearly about
the error.
### Does this PR introduce any user-facing change?
Yes, user will see error message changes.
### How was this patch tested?
Updated existed tests.
Closes #26781 from Ngone51/dev-mismatch-schema.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala (diff)
Commit 1cac9b2cc669b9cc20a07a97f3caba48a3b30f01 by srowen
[SPARK-29967][ML][PYTHON] KMeans support instance weighting
### What changes were proposed in this pull request? add weight support
in KMeans
### Why are the changes needed? KMeans should support weighting
### Does this PR introduce any user-facing change? Yes.
```KMeans.setWeightCol```
### How was this patch tested? Unit Tests
Closes #26739 from huaxingao/spark-29967.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 1cac9b2cc669b9cc20a07a97f3caba48a3b30f01)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala (diff)
The file was modifiedpython/pyspark/ml/clustering.py (diff)
Commit a9f1809a2a1ea84b5c96bc7fd22cda052a270b41 by dhyun
[SPARK-30206][SQL] Rename normalizeFilters in DataSourceStrategy to be
generic
### What changes were proposed in this pull request?
This PR renames `normalizeFilters` in `DataSourceStrategy` to be more
generic as the logic is not specific to filters.
### Why are the changes needed?
These changes are needed to support PR #26751.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes #26830 from aokolnychyi/rename-normalize-exprs.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: a9f1809a2a1ea84b5c96bc7fd22cda052a270b41)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
Commit d9b30694122f8716d3acb448638ef1e2b96ebc7a by wenchen
[SPARK-30125][SQL] Remove PostgreSQL dialect
### What changes were proposed in this pull request? Reprocess all
PostgreSQL dialect related PRs, listing in order:
- #25158: PostgreSQL integral division support [revert]
- #25170: UT changes for the integral division support [revert]
- #25458: Accept "true", "yes", "1", "false", "no", "0", and unique
prefixes as input and trim input for the boolean data type. [revert]
- #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert]
- #26112: Date substraction support [keep the ANSI-compliant part]
- #26444: Rename config "spark.sql.ansi.enabled" to
"spark.sql.dialect.spark.ansi.enabled" [revert]
- #26463: Cast to boolean support for PostgreSQL dialect [revert]
- #26584: Make the behavior of Postgre dialect independent of ansi mode
config [keep the ANSI-compliant part]
### Why are the changes needed? As the discussion in
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html,
we need to remove PostgreSQL dialect form code base for several reasons:
1. The current approach makes the codebase complicated and hard to
maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not
our focus for now.
### Does this PR introduce any user-facing change? Yes, the config
`spark.sql.dialect` will be removed.
### How was this patch tested? Existing UT.
Closes #26763 from xuanyuanking/SPARK-30125.
Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by:
Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: d9b30694122f8716d3acb448638ef1e2b96ebc7a)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/window_part1.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (diff)
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/PostgreSQLDialect.scala
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/boolean.sql.out (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/postgreSQL/udf-case.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala (diff)
The file was removedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/postgreSQL/CastSuite.scala
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/postgreSQL/boolean.sql (diff)
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/postgreSQL/StringUtils.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DecimalExpressionSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableIdentifierParserSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/case.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/postgreSQL/udf-select_implicit.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala (diff)
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/postgreSQL/PostgreCastToBoolean.scala
The file was modifieddocs/sql-keywords.md (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/RowEncoderSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDFSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int8.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/select_implicit.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was removedsql/core/src/test/scala/org/apache/spark/sql/PostgreSQLDialectQuerySuite.scala
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int2.sql.out (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala (diff)
Commit 6103cf196081ab3e63713b623fe2ca3704420616 by dhyun
[SPARK-30200][SQL] Add ExplainMode for Dataset.explain
### What changes were proposed in this pull request?
This pr intends to add `ExplainMode` for explaining `Dataset/DataFrame`
with a given format mode (`ExplainMode`). `ExplainMode` has four types
along with the SQL EXPLAIN command: `Simple`, `Extended`, `Codegen`,
`Cost`, and `Formatted`.
For example, this pr enables users to explain DataFrame/Dataset with the
`FORMATTED` format implemented in #24759;
``` scala>
spark.range(10).groupBy("id").count().explain(ExplainMode.Formatted)
== Physical Plan ==
* HashAggregate (3)
+- * HashAggregate (2)
  +- * Range (1)
(1) Range [codegen id : 1] Output: [id#0L]
(2) HashAggregate [codegen id : 1] Input: [id#0L]
(3) HashAggregate [codegen id : 1] Input: [id#0L, count#8L]
```
This comes from [the cloud-fan
suggestion.](https://github.com/apache/spark/pull/24759#issuecomment-560211270)
### Why are the changes needed?
To follow the SQL EXPLAIN command.
### Does this PR introduce any user-facing change?
No, this is just for a new API in Dataset.
### How was this patch tested?
Add tests in `ExplainSuite`.
Closes #26829 from maropu/DatasetExplain.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: 6103cf196081ab3e63713b623fe2ca3704420616)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/ExplainMode.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
Commit 24c4ce1e6497a7ad80803babd9f11ee54607f7d1 by dhyun
[SPARK-28351][SQL][FOLLOWUP] Remove 'DELETE FROM' from
unsupportedHiveNativeCommands
### What changes were proposed in this pull request?
Minor change, rm `DELETE FROM` from unsupported hive native operation,
because it is supported in parser.
### Why are the changes needed? clear ambiguous ambiguous
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
no
Closes #26836 from yaooqinn/SPARK-28351.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 24c4ce1e6497a7ad80803babd9f11ee54607f7d1)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
Commit 8f0eb7dc868f59db6bee4f009bc148c09cf0df57 by wenchen
[SPARK-29587][SQL] Support SQL Standard type real as float(4) numeric as
decimal
### What changes were proposed in this pull request? The types decimal
and numeric are equivalent. Both types are part of the SQL standard.
the real type is  4 bytes, variable-precision, inexact, 6 decimal digits
precision, same as our float, part of the SQL standard.
### Why are the changes needed?
improve sql standard support other dbs
https://www.postgresql.org/docs/9.3/datatype-numeric.html
https://prestodb.io/docs/current/language/types.html#floating-point
http://www.sqlservertutorial.net/sql-server-basics/sql-server-data-types/
MySQL treats REAL as a synonym for DOUBLE PRECISION (a nonstandard
variation), unless the REAL_AS_FLOAT SQL mode is enabled. In MySQL,
NUMERIC is implemented as DECIMAL, so the following remarks about
DECIMAL apply equally to NUMERIC.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut
Closes #26537 from yaooqinn/SPARK-29587.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 8f0eb7dc868f59db6bee4f009bc148c09cf0df57)
The file was modifiedsql/core/src/test/resources/sql-tests/results/show-create-table.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/show-create-table.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
Commit aec1d95f3b43a9bf349006ea5655d61fad740dd0 by dhyun
[SPARK-30205][PYSPARK] Import ABCs from collections.abc to remove
deprecation warnings
### What changes were proposed in this pull request?
This PR aims to remove deprecation warnings by importing ABCs from
`collections.abc` instead of `collections`.
- https://github.com/python/cpython/pull/10596
### Why are the changes needed?
This will remove deprecation warnings in Python 3.7 and 3.8.
```
$ python -V Python 3.7.5
$ python python/pyspark/resultiterable.py
python/pyspark/resultiterable.py:23: DeprecationWarning: Using or
importing the ABCs from 'collections' instead of from 'collections.abc'
is deprecated since Python 3.3,and in 3.9 it will stop working
class ResultIterable(collections.Iterable):
```
### Does this PR introduce any user-facing change?
No, this doesn't introduce user-facing change
### How was this patch tested?
Manually because this is about deprecation warning messages.
Closes #26835 from tirkarthi/spark-30205-fix-abc-warnings.
Authored-by: Karthikeyan Singaravelan <tir.karthi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: aec1d95f3b43a9bf349006ea5655d61fad740dd0)
The file was modifiedpython/pyspark/resultiterable.py (diff)
Commit ad238a2238a9d0da89be4424574436cbfaee579d by tgraves
[SPARK-29976][CORE] Trigger speculation for stages with too few tasks
### What changes were proposed in this pull request? This PR add an
optional spark conf for speculation to allow speculative runs for stages
where there are only a few tasks.
``` spark.speculation.task.duration.threshold
```
If provided, tasks would be speculatively run if the TaskSet contains
less tasks than the number of slots on a single executor and the task is
taking longer time than the threshold.
### Why are the changes needed? This change helps avoid scenarios where
there is single executor that could hang forever due to disk issue and
we unfortunately assigned the single task in a TaskSet to that executor
and cause the whole job to hang forever.
### Does this PR introduce any user-facing change? yes. If the new
config `spark.speculation.task.duration.threshold` is provided and the
TaskSet contains less tasks than the number of slots on a single
executor and the task is taking longer time than the threshold, then
speculative tasks would be submitted for the running tasks in the
TaskSet.
### How was this patch tested? Unit tests are added to
TaskSetManagerSuite.
Closes #26614 from yuchenhuo/SPARK-29976.
Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by:
Thomas Graves <tgraves@apache.org>
(commit: ad238a2238a9d0da89be4424574436cbfaee579d)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (diff)
The file was modifieddocs/configuration.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala (diff)