SuccessChanges

Summary

  1. [SPARK-32490][BUILD] Upgrade netty-all to 4.1.51.Final (commit: 0693d8bbf2942ab96ffe705ef0fc3fe4b0d9ec11) (details)
  2. [SPARK-32274][SQL] Make SQL cache serialization pluggable (commit: 713124d5e32fb5984ef3c15ed655284b58399032) (details)
  3. [SPARK-32510][SQL] Check duplicate nested columns in read from JDBC (commit: fda397d9c8ff52f9e0785da8f470c37ed40af616) (details)
  4. [SPARK-32509][SQL] Ignore unused DPP True Filter in Canonicalization (commit: 7a09e71198a094250f04e0f82f0c7c9860169540) (details)
  5. [SPARK-24884][SQL] Support regexp function regexp_extract_all (commit: 42f9ee4c7deb83a915fad070af5eedab56399382) (details)
  6. [SPARK-31709][SQL] Proper base path for database/table location when it (commit: 3deb59d5c29fada8f08b0068fbaf3ee706cea312) (details)
  7. [SPARK-32492][SQL] Fulfill missing column meta information COLUMN_SIZE (commit: 7f5326c082081af465deb81c368306e569325b53) (details)
  8. [SPARK-32257][SQL] Reports explicit errors for invalid usage of (commit: c6109ba9181520359222fb032d989f266d3221d8) (details)
  9. [SPARK-32310][ML][PYSPARK] ML params default value parity in feature and (commit: bc7885901dd99de21ecbf269d72fa37a393b2ffc) (details)
  10. [SPARK-32290][SQL][FOLLOWUP] Add version for the SQL config (commit: f3b10f526b2e700680b7c56bfdaffbcf0eb0f269) (details)
  11. [MINOR][SQL] Fix versions in the SQL migration guide for Spark 3.1 (commit: 9bbe8c7418137e03df8bfbd80e7c569192d3711a) (details)
  12. [SPARK-32160][CORE][PYSPARK][FOLLOWUP] Change the config name to switch (commit: 7deb67c28f948cca4e768317ade6d68d2534408f) (details)
  13. [SPARK-30276][SQL] Support Filter expression allows simultaneous use of (commit: 1597d8fcd4c68e723eb3152335298c7d05155643) (details)
  14. [SPARK-32468][SS][TESTS][FOLLOWUP] Provide "default.api.timeout.ms" as (commit: 005ef3a5b8715b38874888f2768c463b60c704f8) (details)
  15. [SPARK-32524][SQL][TESTS] CachedBatchSerializerSuite should clean up (commit: 7fec6e0c16409235a40ee7bb1cc7e0eae7751d69) (details)
  16. [SPARK-32521][SQL] Bug-fix: WithFields Expression should not be foldable (commit: 6d690680576ba58c35e6fbc86d37b45fef1c50d9) (details)
  17. [SPARK-23431][CORE] Expose stage level peak executor metrics via REST (commit: 171b7d5d7114c516be722274dbc433a0897b62c0) (details)
  18. [SPARK-32499][SQL] Use `{}` in conversions maps and structs to strings (commit: 7eb6f45688a05cb426edba889f220b3ffc5d946d) (details)
  19. [SPARK-32525][DOCS] The layout of monitoring.html is broken (commit: 0660a0501d28c9a24cb537ebaee2d8f0a78fea17) (details)
Commit 0693d8bbf2942ab96ffe705ef0fc3fe4b0d9ec11 by dongjoon
[SPARK-32490][BUILD] Upgrade netty-all to 4.1.51.Final
### What changes were proposed in this pull request? This PR aims to
bring the bug fixes from the latest netty version.
### Why are the changes needed?
- 4.1.48.Final:
[https://github.com/netty/netty/milestone/223?closed=1](https://github.com/netty/netty/milestone/223?closed=1)(14
patches or issues)
- 4.1.49.Final:
[https://github.com/netty/netty/milestone/224?closed=1](https://github.com/netty/netty/milestone/224?closed=1)(48
patches or issues)
- 4.1.50.Final:
[https://github.com/netty/netty/milestone/225?closed=1](https://github.com/netty/netty/milestone/225?closed=1)(38
patches or issues)
- 4.1.51.Final:
[https://github.com/netty/netty/milestone/226?closed=1](https://github.com/netty/netty/milestone/226?closed=1)(53
patches or issues)
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Pass the Jenkins with the existing tests.
Closes #29299 from LuciferYang/upgrade-netty-version.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 0693d8bbf2942ab96ffe705ef0fc3fe4b0d9ec11)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-1.2 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit 713124d5e32fb5984ef3c15ed655284b58399032 by wenchen
[SPARK-32274][SQL] Make SQL cache serialization pluggable
### What changes were proposed in this pull request?
Add a config to let users change how SQL/Dataframe data is compressed
when cached.
This adds a few new classes/APIs for use with this config.
1. `CachedBatch` is a trait used to tag data that is intended to be
cached. It has a few APIs that lets us keep the
compression/serialization of the data separate from the metrics about
it. 2. `CachedBatchSerializer` provides the APIs that must be
implemented to cache data.
   * `convertForCache` is an API that runs a cached spark plan and turns
its result into an `RDD[CachedBatch]`.  The actual caching is done
outside of this API
  * `buildFilter` is an API that takes a set of predicates and builds a
filter function that can be used to filter the `RDD[CachedBatch]`
returned by `convertForCache`
  * `decompressColumnar` decompresses an `RDD[CachedBatch]` into an
`RDD[ColumnarBatch]` This is only used for a limited set of data types.
These data types may expand in the future.  If they do we can add in a
new API with a default value that says which data types this serializer
supports.
  * `decompressToRows` decompresses an `RDD[CachedBatch]` into an
`RDD[InternalRow]` this API, like `decompressColumnar` decompresses the
data in `CachedBatch` but turns it into `InternalRow`s, typically using
code generation for performance reasons.
There is also an API that lets you reuse the current filtering based on
min/max values. `SimpleMetricsCachedBatch` and
`SimpleMetricsCachedBatchSerializer`.
### Why are the changes needed?
This lets users explore different types of compression and compression
ratios.
### Does this PR introduce _any_ user-facing change?
This adds in a single config, and exposes some developer API classes
described above.
### How was this patch tested?
I ran the unit tests around this and I also did some manual performance
tests.  I could find any performance difference between the old and new
code, and if there is any it is within error.
Closes #29067 from revans2/pluggable_cache_serializer.
Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 713124d5e32fb5984ef3c15ed655284b58399032)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/columnar/CachedBatchSerializer.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/columnar/CachedBatchSerializerSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala (diff)
Commit fda397d9c8ff52f9e0785da8f470c37ed40af616 by wenchen
[SPARK-32510][SQL] Check duplicate nested columns in read from JDBC
datasource
### What changes were proposed in this pull request? Check that there
are not duplicate column names on the same level (top level or nested
levels) in reading from JDBC datasource. If such duplicate columns
exist, throw the exception:
``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in
the customSchema option value:
``` The check takes into account the SQL config
`spark.sql.caseSensitive` (`false` by default).
### Why are the changes needed? To make handling of duplicate nested
columns is similar to handling of duplicate top-level columns i. e.
output the same error:
```Scala org.apache.spark.sql.AnalysisException: Found duplicate
column(s) in the customSchema option value: `camelcase`
```
Checking of top-level duplicates was introduced by
https://github.com/apache/spark/pull/17758, and duplicates in nested
structures by https://github.com/apache/spark/pull/29234.
### Does this PR introduce _any_ user-facing change? Yes.
### How was this patch tested? Added new test suite
`JdbcNestedDataSourceSuite`.
Closes #29317 from MaxGekk/jdbc-dup-nested-columns.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: fda397d9c8ff52f9e0785da8f470c37ed40af616)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCNestedDataSourceSuite.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/NestedDataSourceSuite.scala (diff)
Commit 7a09e71198a094250f04e0f82f0c7c9860169540 by wenchen
[SPARK-32509][SQL] Ignore unused DPP True Filter in Canonicalization
### What changes were proposed in this pull request? This PR fixes
issues relate to Canonicalization of FileSourceScanExec when it contains
unused DPP Filter.
### Why are the changes needed?
As part of PlanDynamicPruningFilter rule, the unused DPP Filter are
simply replaced by `DynamicPruningExpression(TrueLiteral)` so that they
can be avoided. But these
unnecessary`DynamicPruningExpression(TrueLiteral)` partition filter
inside the FileSourceScanExec affects the canonicalization of the node
and so in many cases, this can prevent ReuseExchange from happening.
This PR fixes this issue by ignoring the unused DPP filter in the `def
doCanonicalize` method.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Added UT.
Closes #29318 from prakharjain09/SPARK-32509_df_reuse.
Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 7a09e71198a094250f04e0f82f0c7c9860169540)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (diff)
Commit 42f9ee4c7deb83a915fad070af5eedab56399382 by wenchen
[SPARK-24884][SQL] Support regexp function regexp_extract_all
### What changes were proposed in this pull request?
`regexp_extract_all` is a very useful function expanded the capabilities
of `regexp_extract`. There are some description of this function.
``` SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1 SELECT
regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14] SELECT
regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a' SELECT
regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm']
``` There are some mainstream database support the syntax.
**Presto:** https://prestodb.io/docs/current/functions/regexp.html
**Pig:**
https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html
Note: This PR pick up the work of
https://github.com/apache/spark/pull/21985
### Why are the changes needed?
`regexp_extract_all` is a very useful function and make work easier.
### Does this PR introduce any user-facing change? No
### How was this patch tested? New UT
Closes #27507 from beliefer/support-regexp_extract_all.
Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan
<gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 42f9ee4c7deb83a915fad070af5eedab56399382)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql (diff)
Commit 3deb59d5c29fada8f08b0068fbaf3ee706cea312 by wenchen
[SPARK-31709][SQL] Proper base path for database/table location when it
is a relative path
### What changes were proposed in this pull request?
Currently, the user home directory is used as the base path for the
database and table locations when their locationa are specified with a
relative paths, e.g.
```sql
> set spark.sql.warehouse.dir; spark.sql.warehouse.dir
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/
spark-sql> create database loctest location 'loctestdbdir';
spark-sql> desc database loctest; Database Name loctest Comment
Location
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
Owner kentyao
spark-sql> create table loctest(id int) location 'loctestdbdir';
spark-sql> desc formatted loctest; id int NULL
# Detailed Table Information Database default Table loctest
Owner kentyao Created Time Thu May 14 16:29:05 CST 2020 Last
Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type
EXTERNAL Provider parquet Location
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
Serde Library
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
``` The user home is not always warehouse-related, unchangeable in
runtime, and shared both by database and table as the parent directory.
Meanwhile, we use the table path as the parent directory for relative
partition locations.
The config `spark.sql.warehouse.dir` represents `the default location
for managed databases and tables`. For databases, the case above seems
not to follow its semantics, because it should use `
`spark.sql.warehouse.dir` as the base path instead.
For tables, it seems to be right but here I suggest enriching the
meaning that lets it also be the for external tables with relative paths
for locations.
With changes in this PR,
The location of a database will be `warehouseDir/dbpath` when `dbpath`
is relative. The location of a table will be `dbpath/tblpath` when
`tblpath` is relative.
### Why are the changes needed?
bugfix and improvement
Firstly, the databases with relative locations should be created under
the default location specified by `spark.sql.warehouse.dir`.
Secondly, the external tables with relative paths may also follow this
behavior for consistency.
At last, the behavior for database, tables and partitions with relative
paths to choose base paths should be the same.
### Does this PR introduce _any_ user-facing change?
Yes, this PR changes the `createDatabase`, `alterDatabase`,
`createTable` and `alterTable` APIs and related DDLs. If the LOCATION
clause is followed by a relative path, the root path will be
`spark.sql.warehouse.dir` for databases, and `spark.sql.warehouse.dir` /
`dbPath` for tables.
e.g.
#### after
```sql spark-sql> desc database loctest; Database Name loctest
Comment Location
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest
Owner kentyao spark-sql> use loctest; spark-sql> create table
loctest(id int) location 'loctest'; 20/05/14 18:18:02 WARN
InMemoryFileIndex: The directory
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/loctest
was not found. Was it deleted very recently? 20/05/14 18:18:02 WARN
SessionState: METASTORE_FILTER_HOOK will be ignored, since
hive.security.authorization.manager is set to instance of
HiveAuthorizerFactory. 20/05/14 18:18:03 WARN HiveConf: HiveConf of name
hive.internal.ss.authz.settings.applied.marker does not exist 20/05/14
18:18:03 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does
not exist 20/05/14 18:18:03 WARN HiveConf: HiveConf of name
hive.stats.retries.wait does not exist spark-sql> desc formatted
loctest; id int NULL
# Detailed Table Information Database loctest Table loctest
Owner kentyao Created Time Thu May 14 18:18:03 CST 2020 Last
Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type
EXTERNAL Provider parquet Location
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest
Serde Library
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
spark-sql> alter table loctest set location 'loctest2'
        > ; spark-sql> desc formatted loctest; id int NULL
# Detailed Table Information Database loctest Table loctest
Owner kentyao Created Time Thu May 14 18:18:03 CST 2020 Last
Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type
EXTERNAL Provider parquet Location
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest2
Serde Library
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
```
### How was this patch tested?
Add unit tests.
Closes #28527 from yaooqinn/SPARK-31709.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 3deb59d5c29fada8f08b0068fbaf3ee706cea312)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalogSuite.scala (diff)
Commit 7f5326c082081af465deb81c368306e569325b53 by wenchen
[SPARK-32492][SQL] Fulfill missing column meta information COLUMN_SIZE
/DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION for thriftserver client
tools
### What changes were proposed in this pull request?
This PR fulfills some missing fields for SparkGetColumnsOperation
including COLUMN_SIZE /DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION
and improve the test coverage.
### Why are the changes needed?
make jdbc tools happier
### Does this PR introduce _any_ user-facing change?
yes,
#### before
![image](https://user-images.githubusercontent.com/8326978/88911764-e78b2180-d290-11ea-8abb-96f137f9c3c4.png)
#### after
![image](https://user-images.githubusercontent.com/8326978/88911709-d04c3400-d290-11ea-90ab-02bda3e628e9.png)
![image](https://user-images.githubusercontent.com/8326978/88912007-39cc4280-d291-11ea-96d6-1ef3abbbddec.png)
### How was this patch tested?
add unit tests
Closes #29303 from yaooqinn/SPARK-32492.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 7f5326c082081af465deb81c368306e569325b53)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala (diff)
Commit c6109ba9181520359222fb032d989f266d3221d8 by wenchen
[SPARK-32257][SQL] Reports explicit errors for invalid usage of
SET/RESET command
### What changes were proposed in this pull request?
This PR modified the parser code to handle invalid usages of a SET/RESET
command. For example;
``` SET spark.sql.ansi.enabled true
``` The above SQL command does not change the configuration value and it
just tries to display the value of the configuration
`spark.sql.ansi.enabled true`. This PR disallows using special
characters including spaces in the configuration name and reports a
user-friendly error instead. In the error message, it tells users a
workaround to use quotes or a string literal if they still needs to
specify a configuration with them. 
Before this PR:
``` scala> sql("SET spark.sql.ansi.enabled true").show(1, -1)
+---------------------------+-----------+
|key                        |value      |
+---------------------------+-----------+
|spark.sql.ansi.enabled true|<undefined>|
+---------------------------+-----------+
```
After this PR:
``` scala> sql("SET spark.sql.ansi.enabled true")
org.apache.spark.sql.catalyst.parser.ParseException: Expected format is
'SET', 'SET key', or 'SET key=value'. If you want to include special
characters in key, please use quotes, e.g., SET `ke y`=value.(line 1,
pos 0)
== SQL == SET spark.sql.ansi.enabled true
^^^
```
### Why are the changes needed?
For better user-friendly errors.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests in `SparkSqlParserSuite`.
Closes #29146 from maropu/SPARK-32257.
Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by:
Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: c6109ba9181520359222fb032d989f266d3221d8)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala (diff)
Commit bc7885901dd99de21ecbf269d72fa37a393b2ffc by huaxing
[SPARK-32310][ML][PYSPARK] ML params default value parity in feature and
tuning
### What changes were proposed in this pull request? set params default
values in trait Params for feature and tuning in both Scala and Python.
### Why are the changes needed? Make ML has the same default param
values between estimator and its corresponding transformer, and also
between Scala and Python.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Existing and modified tests
Closes #29153 from huaxingao/default2.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao
<huaxing@us.ibm.com>
(commit: bc7885901dd99de21ecbf269d72fa37a393b2ffc)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala (diff)
The file was modifiedpython/pyspark/ml/feature.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala (diff)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_param.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala (diff)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala (diff)
The file was modifiedpython/pyspark/ml/recommendation.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala (diff)
The file was modifiedpython/pyspark/ml/regression.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala (diff)
The file was modifiedpython/pyspark/ml/clustering.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala (diff)
Commit f3b10f526b2e700680b7c56bfdaffbcf0eb0f269 by wenchen
[SPARK-32290][SQL][FOLLOWUP] Add version for the SQL config
`spark.sql.optimizeNullAwareAntiJoin`
### What changes were proposed in this pull request? Add the version
`3.1.0` for the SQL config `spark.sql.optimizeNullAwareAntiJoin`.
### Why are the changes needed? To inform users when the config was
added, for example on the page
http://spark.apache.org/docs/latest/configuration.html.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By compiling and running
`./dev/scalastyle`.
Closes #29335 from MaxGekk/leanken-SPARK-32290-followup.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: f3b10f526b2e700680b7c56bfdaffbcf0eb0f269)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 9bbe8c7418137e03df8bfbd80e7c569192d3711a by gurwls223
[MINOR][SQL] Fix versions in the SQL migration guide for Spark 3.1
### What changes were proposed in this pull request? Change _To restore
the behavior before Spark **3.0**_ to _To restore the behavior before
Spark **3.1**_ in the SQL migration guide while telling about the
behaviour before new version 3.1.
### Why are the changes needed? To have correct info in the SQL
migration guide.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? N/A
Closes #29336 from MaxGekk/fix-version-in-sql-migration.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 9bbe8c7418137e03df8bfbd80e7c569192d3711a)
The file was modifieddocs/sql-migration-guide.md (diff)
Commit 7deb67c28f948cca4e768317ade6d68d2534408f by gurwls223
[SPARK-32160][CORE][PYSPARK][FOLLOWUP] Change the config name to switch
allow/disallow SparkContext in executors
### What changes were proposed in this pull request?
This is a follow-up of #29278. This PR changes the config name to switch
allow/disallow `SparkContext` in executors as per the comment
https://github.com/apache/spark/pull/29278#pullrequestreview-460256338.
### Why are the changes needed?
The config name `spark.executor.allowSparkContext` is more reasonable.
### Does this PR introduce _any_ user-facing change?
Yes, the config name is changed.
### How was this patch tested?
Updated tests.
Closes #29340 from ueshin/issues/SPARK-32160/change_config_name.
Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 7deb67c28f948cca4e768317ade6d68d2534408f)
The file was modifiedcore/src/test/scala/org/apache/spark/SparkContextSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifieddocs/core-migration-guide.md (diff)
The file was modifiedpython/pyspark/tests/test_context.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedpython/pyspark/context.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
Commit 1597d8fcd4c68e723eb3152335298c7d05155643 by wenchen
[SPARK-30276][SQL] Support Filter expression allows simultaneous use of
DISTINCT
### What changes were proposed in this pull request? This PR is related
to https://github.com/apache/spark/pull/26656.
https://github.com/apache/spark/pull/26656 only support use FILTER
clause on aggregate expression without DISTINCT. This PR will enhance
this feature when one or more DISTINCT aggregate expressions which
allows the use of the FILTER clause. Such as:
``` select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from
student group by class_id; select count(id) filter (where class_id = 1),
sum(distinct id) filter (where sex = 'man') from student; select
class_id, count(id) filter (where class_id = 1), sum(distinct id) filter
(where sex = 'man') from student group by class_id; select sum(distinct
id), sum(distinct id) filter (where sex = 'man') from student; select
class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man')
from student group by class_id; select class_id, count(id), count(id)
filter (where class_id = 1), sum(distinct id), sum(distinct id) filter
(where sex = 'man') from student group by class_id;
```
### Why are the changes needed? Spark SQL only support use FILTER clause
on aggregate expression without DISTINCT. This PR support Filter
expression allows simultaneous use of DISTINCT
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? Exists and new UT
Closes #29291 from beliefer/support-distinct-with-filter.
Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer
<beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1597d8fcd4c68e723eb3152335298c7d05155643)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/group-by-filter.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/aggregates_part3.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/postgreSQL/aggregates_part3.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/postgreSQL/groupingsets.sql (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/groupingsets.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/group-by-filter.sql.out (diff)
Commit 005ef3a5b8715b38874888f2768c463b60c704f8 by gurwls223
[SPARK-32468][SS][TESTS][FOLLOWUP] Provide "default.api.timeout.ms" as
well when specifying "request.timeout.ms" on replacing
"default.api.timeout.ms"
### What changes were proposed in this pull request?
This patch is a follow-up to fill the gap in #29272 which missed to also
provide `default.api.timeout.ms` as well.  #29272 unintentionally
changed the behavior on Kafka side timeout which is incompatible with
the test timeout. (`default.api.timeout.ms` gets default value which is
60 seconds, longer than test timeout.)
### Why are the changes needed?
We realized the PR for SPARK-32468 (#29272) doesn't work as we expect.
See https://github.com/apache/spark/pull/29272#issuecomment-668333483
for more details.
### Does this PR introduce _any_ user-facing change?
No, as it only touches the tests.
### How was this patch tested?
Will trigger builds from Jenkins or Github Action multiple time and
confirm.
Closes #29343 from HeartSaVioR/SPARK-32468-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 005ef3a5b8715b38874888f2768c463b60c704f8)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala (diff)
Commit 7fec6e0c16409235a40ee7bb1cc7e0eae7751d69 by gurwls223
[SPARK-32524][SQL][TESTS] CachedBatchSerializerSuite should clean up
InMemoryRelation.ser
### What changes were proposed in this pull request?
This PR aims to clean up `InMemoryRelation.ser` in
`CachedBatchSerializerSuite`.
### Why are the changes needed?
SPARK-32274 makes SQL cache serialization pluggable.
```
[SPARK-32274][SQL] Make SQL cache serialization pluggable
```
This causes UT failures.
```
$ build/sbt "sql/testOnly *.CachedBatchSerializerSuite
*.CachedTableSuite"
...
[info]   Cause: java.lang.IllegalStateException: This does not work.
This is only for testing
[info]   at
org.apache.spark.sql.execution.columnar.TestSingleIntColumnarCachedBatchSerializer.convertInternalRowToCachedBatch(CachedBatchSerializerSuite.scala:49)
...
[info] *** 30 TESTS FAILED ***
[error] Failed: Total 51, Failed 30, Errors 0, Passed 21
[error] Failed tests:
[error] org.apache.spark.sql.CachedTableSuite
[error] (sql/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually.
```
$ build/sbt "sql/testOnly *.CachedBatchSerializerSuite
*.CachedTableSuite"
[info] Tests: succeeded 51, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 51, Failed 0, Errors 0, Passed 51
```
Closes #29346 from dongjoon-hyun/SPARK-32524-3.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 7fec6e0c16409235a40ee7bb1cc7e0eae7751d69)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/columnar/CachedBatchSerializerSuite.scala (diff)
Commit 6d690680576ba58c35e6fbc86d37b45fef1c50d9 by wenchen
[SPARK-32521][SQL] Bug-fix: WithFields Expression should not be foldable
### What changes were proposed in this pull request?
Make WithFields Expression not foldable.
### Why are the changes needed?
The following query currently fails on master brach:
``` sql("SELECT named_struct('a', 1, 'b', 2) a")
.select($"a".withField("c", lit(3)).as("a"))
.show(false)
// java.lang.UnsupportedOperationException: Cannot evaluate expression:
with_fields(named_struct(a, 1, b, 2), c, 3)
``` This happens because the Catalyst optimizer sees that the WithFields
Expression is foldable and tries to statically evaluate the WithFields
Expression (via the ConstantFolding rule), however it cannot do so
because WithFields Expression is Unevaluable.
### Does this PR introduce _any_ user-facing change?
Yes, queries like the one shared above will now succeed. That said, this
bug was introduced in Spark 3.1.0 which has yet to be released.
### How was this patch tested?
A new unit test was added.
Closes #29338 from fqaiser94/SPARK-32521.
Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Co-authored-by: fqaiser94 <fqaiser94@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: 6d690680576ba58c35e6fbc86d37b45fef1c50d9)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala (diff)
Commit 171b7d5d7114c516be722274dbc433a0897b62c0 by gengliang.wang
[SPARK-23431][CORE] Expose stage level peak executor metrics via REST
API
### What changes were proposed in this pull request?
Note that this PR is forked from #23340 originally written by edwinalu.
This PR proposes to expose the peak executor metrics at the stage level
via the REST APIs:
* `/applications/<application_id>/stages/`: peak values of executor
metrics for **each stage**
* `/applications/<application_id>/stages/<stage_id>/< stage_attempt_id
>`: peak values of executor metrics for **each executor** for the stage,
followed by peak values of executor metrics for the stage
### Why are the changes needed?
The stage level peak executor metrics can help better understand your
application's resource utilization.
### Does this PR introduce _any_ user-facing change?
1. For the `/applications/<application_id>/stages/` API, you will see
the following new info for **each stage**:
```JSON
"peakExecutorMetrics" : {
   "JVMHeapMemory" : 213367864,
   "JVMOffHeapMemory" : 189011656,
   "OnHeapExecutionMemory" : 0,
   "OffHeapExecutionMemory" : 0,
   "OnHeapStorageMemory" : 2133349,
   "OffHeapStorageMemory" : 0,
   "OnHeapUnifiedMemory" : 2133349,
   "OffHeapUnifiedMemory" : 0,
   "DirectPoolMemory" : 282024,
   "MappedPoolMemory" : 0,
   "ProcessTreeJVMVMemory" : 0,
   "ProcessTreeJVMRSSMemory" : 0,
   "ProcessTreePythonVMemory" : 0,
   "ProcessTreePythonRSSMemory" : 0,
   "ProcessTreeOtherVMemory" : 0,
   "ProcessTreeOtherRSSMemory" : 0,
   "MinorGCCount" : 13,
   "MinorGCTime" : 115,
   "MajorGCCount" : 4,
   "MajorGCTime" : 339
}
```
2. For the
`/applications/<application_id>/stages/<stage_id>/<stage_attempt_id>`
API, you will see the following new info for **each executor** under
`executorSummary`:
```JSON
"peakMemoryMetrics" : {
   "JVMHeapMemory" : 0,
   "JVMOffHeapMemory" : 0,
   "OnHeapExecutionMemory" : 0,
   "OffHeapExecutionMemory" : 0,
   "OnHeapStorageMemory" : 0,
   "OffHeapStorageMemory" : 0,
   "OnHeapUnifiedMemory" : 0,
   "OffHeapUnifiedMemory" : 0,
   "DirectPoolMemory" : 0,
   "MappedPoolMemory" : 0,
   "ProcessTreeJVMVMemory" : 0,
   "ProcessTreeJVMRSSMemory" : 0,
   "ProcessTreePythonVMemory" : 0,
   "ProcessTreePythonRSSMemory" : 0,
   "ProcessTreeOtherVMemory" : 0,
   "ProcessTreeOtherRSSMemory" : 0,
   "MinorGCCount" : 0,
   "MinorGCTime" : 0,
   "MajorGCCount" : 0,
   "MajorGCTime" : 0
}
```
, and the following at the stage level:
```JSON
"peakExecutorMetrics" : {
   "JVMHeapMemory" : 213367864,
   "JVMOffHeapMemory" : 189011656,
   "OnHeapExecutionMemory" : 0,
   "OffHeapExecutionMemory" : 0,
   "OnHeapStorageMemory" : 2133349,
   "OffHeapStorageMemory" : 0,
   "OnHeapUnifiedMemory" : 2133349,
   "OffHeapUnifiedMemory" : 0,
   "DirectPoolMemory" : 282024,
   "MappedPoolMemory" : 0,
   "ProcessTreeJVMVMemory" : 0,
   "ProcessTreeJVMRSSMemory" : 0,
   "ProcessTreePythonVMemory" : 0,
   "ProcessTreePythonRSSMemory" : 0,
   "ProcessTreeOtherVMemory" : 0,
   "ProcessTreeOtherRSSMemory" : 0,
   "MinorGCCount" : 13,
   "MinorGCTime" : 115,
   "MajorGCCount" : 4,
   "MajorGCTime" : 339
}
```
### How was this patch tested?
Added tests.
Closes #29020 from imback82/metrics.
Lead-authored-by: Terry Kim <yuminkim@gmail.com> Co-authored-by:
edwinalu <edwina.lu@gmail.com> Signed-off-by: Gengliang Wang
<gengliang.wang@databricks.com>
(commit: 171b7d5d7114c516be722274dbc433a0897b62c0)
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/api.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ui/StagePageSuite.scala (diff)
The file was addedcore/src/test/resources/HistoryServerExpectations/stage_with_peak_metrics_expectation.json
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusStore.scala (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/completed_app_list_json_expectation.json (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusListener.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/application_list_json_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/minDate_app_list_json_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/minEndDate_app_list_json_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/limit_app_list_json_expectation.json (diff)
The file was addedcore/src/test/resources/HistoryServerExpectations/stage_list_with_peak_metrics_expectation.json
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/LiveEntity.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala (diff)
The file was addedcore/src/test/resources/spark-events/app-20200706201101-0003
The file was modifieddev/.rat-excludes (diff)
Commit 7eb6f45688a05cb426edba889f220b3ffc5d946d by wenchen
[SPARK-32499][SQL] Use `{}` in conversions maps and structs to strings
### What changes were proposed in this pull request? Change casting of
map and struct values to strings by using the `{}` brackets instead of
`[]`. The behavior is controlled by the SQL config
`spark.sql.legacy.castComplexTypesToString.enabled`. When it is `true`,
`CAST` wraps maps and structs by `[]` in casting to strings. Otherwise,
if this is `false`, which is the default, maps and structs are wrapped
by `{}`.
### Why are the changes needed?
- To distinguish structs/maps from arrays.
- To make `show`'s output consistent with Hive and conversions to Hive
strings.
- To display dataframe content in the same form by `spark-sql` and
`show`
- To be consistent with the `*.sql` tests
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? By existing test suite `CastSuite`.
Closes #29308 from MaxGekk/show-struct-map.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 7eb6f45688a05cb426edba889f220b3ffc5d946d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/pivot.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedpython/pyspark/ml/stat.py (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/udf-pivot.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala (diff)
Commit 0660a0501d28c9a24cb537ebaee2d8f0a78fea17 by gengliang.wang
[SPARK-32525][DOCS] The layout of monitoring.html is broken
### What changes were proposed in this pull request?
This PR fixes the layout of monitoring.html broken after
SPARK-31566(#28354). The cause is there are 2 `<td>` tags not closed in
`monitoring.md`.
### Why are the changes needed?
This is a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Build docs and the following screenshots are before/after.
* Before fixed
![broken-doc](https://user-images.githubusercontent.com/4736016/89257873-fba09b80-d661-11ea-90da-06cbc0783011.png)
* After fixed.
![fixed-doc2](https://user-images.githubusercontent.com/4736016/89257910-0fe49880-d662-11ea-9a85-7a1ecb1d38d6.png)
Of course, the table is still rendered correctly.
![fixed-doc1](https://user-images.githubusercontent.com/4736016/89257948-225ed200-d662-11ea-80fd-d9254b44d4a0.png)
Closes #29345 from sarutak/fix-monitoring.md.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Gengliang Wang <gengliang.wang@databricks.com>
(commit: 0660a0501d28c9a24cb537ebaee2d8f0a78fea17)
The file was modifieddocs/monitoring.md (diff)