FailedChanges

Summary

  1. [SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale (commit: 708ab57f377bfd8e71183cfead918bae5b811946) (details)
  2. [SPARK-29956][SQL] A literal number with an exponent should be parsed to (commit: 87ebfaf003fcd05a7f6d23b3ecd4661409ce5f2f) (details)
  3. [SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate (commit: 5a1896adcb87e1611559c55fc76f32063e1c7c1b) (details)
  4. [SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey (commit: d1465a1b0dea690fcfbf75edb73ff9f8a015c0dd) (details)
  5. [SPARK-29851][SQL][FOLLOW-UP] Use foreach instead of misusing map (commit: 51e69feb495dfc63023ff673da30a3198081cfb6) (details)
  6. [SPARK-30050][SQL] analyze table and rename table should not erase hive (commit: 85cb388ae3f25b0e6a7fc1a2d78fd1c3ec03f341) (details)
  7. [SPARK-29959][ML][PYSPARK] Summarizer support more metrics (commit: 03ac1b799cf1e48489e8246a1b97110c80344160) (details)
  8. [SPARK-30025][CORE] Continuous shuffle block fetching should be disabled (commit: 169415ffac3050a86934011525ea00eef7fca35c) (details)
  9. [SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE (commit: 04a5b8f5f80ee746bdc16267e44a993a9941d335) (details)
  10. [SPARK-30047][SQL] Support interval types in UnsafeRow (commit: 4e073f3c5093e136518e456d0a3a7437ad9867a3) (details)
  11. [MINOR][SQL] Rename config name to (commit: e271664a01fd7dee63391890514d76262cad1bc1) (details)
  12. [MINOR][SS] Add implementation note on overriding serialize/deserialize (commit: 54edaee58654bdc3c961906a8390088f35460ae9) (details)
  13. [SPARK-27721][BUILD] Switch to use right leveldbjni according to the (commit: e842033accf12190f1bf3962546065613656410f) (details)
  14. [SPARK-30085][SQL][DOC] Standardize sql reference (commit: babefdee1c133c6b35ff026d5deacb292a0b85aa) (details)
  15. [SPARK-30075][CORE][TESTS] Fix the hashCode implementation of (commit: e04a63437b8f31db90ca1669ee98289f4ba633e1) (details)
  16. [SPARK-30072][SQL] Create dedicated planner for subqueries (commit: 68034a805607ced50dbedca73dfc7eaf0102dde8) (details)
Commit 708ab57f377bfd8e71183cfead918bae5b811946 by gurwls223
[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale
of the column
## What changes were proposed in this pull request?
[HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved
pad decimal numbers with trailing zeros to the scale of the column. The
following description is copied from the description of HIVE-12063.
> HIVE-7373 was to address the problems of trimming tailing zeros by
Hive, which caused many problems including treating 0.0, 0.00 and so on
as 0, which has different precision/scale. Please refer to HIVE-7373
description. However, HIVE-7373 was reverted by HIVE-8745 while the
underlying problems remained. HIVE-11835 was resolved recently to
address one of the problems, where 0.0, 0.00, and so on cannot be read
into decimal(1,1).
However, HIVE-11835 didn't address the problem of showing as 0 in query
result for any decimal values such as 0.0, 0.00, etc. This causes
confusion as 0 and 0.0 have different precision/scale than 0. The
proposal here is to pad zeros for query result to the type's scale. This
not only removes the confusion described above, but also aligns with
many other DBs. Internal decimal number representation doesn't change,
however.
**Spark SQL**:
```sql
// bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1
spark-sql>
// bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as
decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18))  |
+----------------------------+--+
| 1.000000000000000000       |
+----------------------------+--+
// bin/spark-shell scala> spark.sql("select cast(1 as decimal(38,
18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000     |
+-------------------------+
// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+
// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+
```
**PostgreSQL**:
```sql postgres=# select cast(1 as decimal(38, 18));
      numeric
----------------------
1.000000000000000000
(1 row)
```
**Presto**:
```sql presto> select cast(1 as decimal(38, 18));
       _col0
----------------------
1.000000000000000000
(1 row)
```
## How was this patch tested?
unit tests and manual test:
```sql spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000
``` Spark SQL Upgrading Guide:
![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png)
Closes #26697 from wangyum/SPARK-28461.
Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 708ab57f377bfd8e71183cfead918bae5b811946)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/union.sql.out (diff)
The file was modifiedsql/hive/src/test/resources/golden/windowing_navfn.q (deterministic)-2-1e88e0ba414a00195f7ebf6b8600ac04 (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/order-by-nulls-ordering.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/typeCoercion/native/decimalPrecision.sql.out (diff)
The file was modifiedsql/hive/src/test/resources/golden/windowing_rank.q (deterministic) 4-0-12cc78f3953c3e6b5411ddc729541bf0 (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out (diff)
The file was modifiedsql/hive/src/test/resources/golden/decimal_1_1-3-ac24b36077314acab595ada14e598e (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/create_view.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int8.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/select.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/window_part4.sql.out (diff)
The file was modifiedsql/hive/src/test/resources/golden/decimal_4-6-693c2e345731f9b2b547c3b75218458e (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/not-in-unit-tests-single-column.sql.out (diff)
The file was modifiedsql/hive/src/test/resources/golden/decimal_4-7-f1eb45492510cb76cf6b452121af8531 (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/not-in-unit-tests-single-column-literal.sql.out (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/literals.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/timestamp.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/not-in-unit-tests-multi-column-literal.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/not-in-unit-tests-multi-column.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/typeCoercion/native/division.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/table-aliases.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/decimalArithmeticOperations.sql.out (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/udf-union.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/union.sql.out (diff)
The file was modifiedsql/hive/src/test/resources/golden/decimal_1_1-4-128804f8dfe7dbb23be0498b91647ba3 (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int2.sql.out (diff)
The file was modifiedsql/hive/src/test/resources/golden/serde_regex-10-c5b3ec90419a40660e5f83736241c429 (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala (diff)
The file was modifiedsql/hive/src/test/resources/golden/windowing_rank.q (deterministic) 2-0-81bb7f49a55385878637c8aac4d08e5 (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala (diff)
Commit 87ebfaf003fcd05a7f6d23b3ecd4661409ce5f2f by wenchen
[SPARK-29956][SQL] A literal number with an exponent should be parsed to
Double
### What changes were proposed in this pull request?
For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to
Double by default rather than Decimal. And user could still use
`spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to
previous behavior.
### Why are the changes needed?
According to ANSI standard of SQL, we see that the (part of) definition
of `literal` :
```
<approximate numeric literal> ::=
   <mantissa> E <exponent>
``` which indicates that a literal number with an exponent should be
approximate numeric(e.g. Double) rather than exact numeric(e.g.
Decimal).
And when we test Presto, we found that Presto also conforms to this
standard:
``` presto:default> select typeof(1E2);
_col0
--------
double
(1 row)
```
``` presto:default> select typeof(1.2);
   _col0
--------------
decimal(2,1)
(1 row)
```
We also find that, actually, literals like `1E2` are parsed as Double
before Spark2.1, but changed to Decimal after #14828 due to *The
difference between the two confuses most users* as it said. But we also
see support(from DB2 test) of original behavior at #14828 (comment).
Although, we also see that PostgreSQL has its own implementation:
``` postgres=# select pg_typeof(1E2);
pg_typeof
-----------
numeric
(1 row)
postgres=# select pg_typeof(1.2);
pg_typeof
-----------
numeric
(1 row)
```
We still think that Spark should also conform to this standard while
considering SQL standard and Spark own history and majority DBMS and
also user experience.
### Does this PR introduce any user-facing change?
Yes.
For `1E2`, before this PR:
``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame
= [1E+2: decimal(1,-2)]
```
After this PR:
``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame
= [100.0: double]
```
And for `1E-45`, before this PR:
``` org.apache.spark.sql.catalyst.parser.ParseException: decimal can
only support precision up to 38
== SQL == select 1E-45
at
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131)
at
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
at
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
... 47 elided
```
after this PR:
``` scala> spark.sql("select 1E-45"); res1:
org.apache.spark.sql.DataFrame = [1.0E-45: double]
```
And before this PR, user may feel super weird to see that `select 1e40`
works but `select 1e-40 fails`. And now, both of them work well.
### How was this patch tested?
updated `literals.sql.out` and `ansi/literals.sql.out`
Closes #26595 from Ngone51/SPARK-29956.
Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 87ebfaf003fcd05a7f6d23b3ecd4661409ce5f2f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-order-by.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/subquery/scalar-subquery/scalar-subquery-predicate.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-limit.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/ansi/decimalArithmeticOperations.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-limit.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/operators.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-group-by.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-order-by.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/simple-in.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/literals.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/simple-in.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/decimalArithmeticOperations.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-group-by.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/decimalArithmeticOperations.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-set-operations.sql.out (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/scalar-subquery/scalar-subquery-predicate.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-set-operations.sql (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
Commit 5a1896adcb87e1611559c55fc76f32063e1c7c1b by wenchen
[SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate
columns
### What changes were proposed in this pull request?
`DataFrameNaFunctions.drop` doesn't handle duplicate columns even when
column names are not specified.
```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df =
left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show
``` produces
``` root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col2: string (nullable = true)
org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous,
could be: col2, col2.;
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
``` The reason for the above failure is that columns are resolved by
name and if there are multiple columns with the same name, it will fail
due to ambiguity.
This PR updates `DataFrameNaFunctions.drop` such that if the columns to
drop are not specified, it will resolve ambiguity gracefully by applying
`drop` to all the eligible columns. (Note that if the user specifies the
columns, it will still continue to fail due to ambiguity).
### Why are the changes needed?
If column names are not specified, `drop` should not fail due to
ambiguity since it should still be able to apply `drop` to the eligible
columns.
### Does this PR introduce any user-facing change?
Yes, now all the rows with nulls are dropped in the above example:
``` scala> df.na.drop("any").show
+----+----+----+
|col1|col2|col2|
+----+----+----+
+----+----+----+
```
### How was this patch tested?
Added new unit tests.
Closes #26700 from imback82/na_drop.
Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 5a1896adcb87e1611559c55fc76f32063e1c7c1b)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala (diff)
Commit d1465a1b0dea690fcfbf75edb73ff9f8a015c0dd by wenchen
[SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey
reducePostShufflePartitions enabled
### What changes were proposed in this pull request? 1. Make
maxNumPostShufflePartitions config obey reducePostShfflePartitions
config. 2. Update the description for all the SQLConf affected by
`spark.sql.adaptive.enabled`.
### Why are the changes needed? Make the relation between these confs
clearer.
### Does this PR introduce any user-facing change? No
### How was this patch tested? Existing UT.
Closes #26664 from xuanyuanking/SPARK-9853-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: d1465a1b0dea690fcfbf75edb73ff9f8a015c0dd)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 51e69feb495dfc63023ff673da30a3198081cfb6 by gurwls223
[SPARK-29851][SQL][FOLLOW-UP] Use foreach instead of misusing map
### What changes were proposed in this pull request?
This PR proposes to use foreach instead of misusing map as a small
followup of #26476. This could cause some weird errors potentially and
it's not a good practice anyway. See also SPARK-16694
### Why are the changes needed? To avoid potential issues like
SPARK-16694
### Does this PR introduce any user-facing change? No
### How was this patch tested? Existing tests should cover.
Closes #26729 from HyukjinKwon/SPARK-29851.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 51e69feb495dfc63023ff673da30a3198081cfb6)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTableCatalog.scala (diff)
Commit 85cb388ae3f25b0e6a7fc1a2d78fd1c3ec03f341 by wenchen
[SPARK-30050][SQL] analyze table and rename table should not erase hive
table bucketing info
### What changes were proposed in this pull request?
This patch adds Hive provider into table metadata in
`HiveExternalCatalog.alterTableStats`. When we call
`HiveClient.alterTable`, `alterTable` will erase if it can not find hive
provider in given table metadata.
Rename table also has this issue.
### Why are the changes needed?
Because running `ANALYZE TABLE` on a Hive table, if the table has
bucketing info, will erase existing bucket info.
### Does this PR introduce any user-facing change?
Yes. After this PR, running `ANALYZE TABLE` on Hive table, won't erase
existing bucketing info.
### How was this patch tested?
Unit test.
Closes #26685 from viirya/fix-hive-bucket.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by:
Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 85cb388ae3f25b0e6a7fc1a2d78fd1c3ec03f341)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
Commit 03ac1b799cf1e48489e8246a1b97110c80344160 by ruifengz
[SPARK-29959][ML][PYSPARK] Summarizer support more metrics
### What changes were proposed in this pull request? Summarizer support
more metrics: sum, std
### Why are the changes needed? Those metrics are widely used, it will
be convenient to directly obtain them other than a conversion. in
`NaiveBayes`: we want the sum of vectors,  mean & weightSum need to
computed then multiplied in
`StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`:
we need to obtain `variance` and then sqrt it to get std
### Does this PR introduce any user-facing change? yes, new metrics are
exposed to end users
### How was this patch tested? added testsuites
Closes #26596 from zhengruifeng/summarizer_add_metrics.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by:
zhengruifeng <ruifengz@foxmail.com>
(commit: 03ac1b799cf1e48489e8246a1b97110c80344160)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala (diff)
The file was modifieddocs/ml-statistics.md (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (diff)
The file was modifiedpython/pyspark/ml/stat.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala (diff)
Commit 169415ffac3050a86934011525ea00eef7fca35c by wenchen
[SPARK-30025][CORE] Continuous shuffle block fetching should be disabled
by default when the old fetch protocol is used
### What changes were proposed in this pull request? Disable continuous
shuffle block fetching when the old fetch protocol in use.
### Why are the changes needed? The new feature of continuous shuffle
block fetching depends on the latest version of the shuffle fetch
protocol. We should keep this constraint in
`BlockStoreShuffleReader.fetchContinuousBlocksInBatch`.
### Does this PR introduce any user-facing change? Users will not get
the exception related to continuous shuffle block fetching when old
version of the external shuffle service is used.
### How was this patch tested? Existing UT.
Closes #26663 from xuanyuanking/SPARK-30025.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: 169415ffac3050a86934011525ea00eef7fca35c)
The file was modifiedcore/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 04a5b8f5f80ee746bdc16267e44a993a9941d335 by wenchen
[SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE
### What changes were proposed in this pull request? In SPARK-29421
(#26097) , we can specify a different table provider for `CREATE TABLE
LIKE` via `USING provider`. Hive support `STORED AS` new file format
syntax:
```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2
LIKE tbl STORED AS PARQUET;
``` For Hive compatibility, we should also support `STORED AS` in
`CREATE TABLE LIKE`.
### Why are the changes needed? See
https://github.com/apache/spark/pull/26097#issue-327424759
### Does this PR introduce any user-facing change? Add a new syntax
based on current CTL: CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat];
### How was this patch tested? Add UTs.
Closes #26466 from LantaoJin/SPARK-29839.
Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 04a5b8f5f80ee746bdc16267e44a993a9941d335)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
Commit 4e073f3c5093e136518e456d0a3a7437ad9867a3 by wenchen
[SPARK-30047][SQL] Support interval types in UnsafeRow
### What changes were proposed in this pull request?
Optimize aggregates on interval values from sort-based to hash-based,
and we can use the
`org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for
better performance.
### Why are the changes needed?
improve aggerates
### Does this PR introduce any user-facing change? no
### How was this patch tested?
add ut and existing ones
Closes #26680 from yaooqinn/SPARK-30047.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 4e073f3c5093e136518e456d0a3a7437ad9867a3)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriterSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriter.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java (diff)
Commit e271664a01fd7dee63391890514d76262cad1bc1 by wenchen
[MINOR][SQL] Rename config name to
spark.sql.analyzer.failAmbiguousSelfJoin.enabled
### What changes were proposed in this pull request?
add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`.
### Why are the changes needed?
to follow the existing naming style
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
not needed
Closes #26694 from cloud-fan/conf.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: e271664a01fd7dee63391890514d76262cad1bc1)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSelfJoinSuite.scala (diff)
Commit 54edaee58654bdc3c961906a8390088f35460ae9 by sean.owen
[MINOR][SS] Add implementation note on overriding serialize/deserialize
in HDFSMetadataLog methods' scaladoc
### What changes were proposed in this pull request?
The patch adds scaladoc on `HDFSMetadataLog.serialize` and
`HDFSMetadataLog.deserialize` for adding implementation note when
overriding - HDFSMetadataLog calls `serialize` and `deserialize` inside
try-finally and caller will do the resource (input stream, output
stream) cleanup, so resource cleanup should not be performed in these
methods, but there's no note on this (only code comment, not scaladoc)
which is easy to be missed.
### Why are the changes needed?
Contributors who are unfamiliar with the intention seem to think it as a
bug if the resource is not cleaned up in serialize/deserialize of
subclass of HDFSMetadataLog, and they couldn't know about the intention
without reading the code of HDFSMetadataLog. Adding the note as scaladoc
would expand the visibility.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Just a doc change.
Closes #26732 from HeartSaVioR/MINOR-SS-HDFSMetadataLog-serde-scaladoc.
Lead-authored-by: Jungtaek Lim (HeartSaVioR)
<kabhwan.opensource@gmail.com> Co-authored-by: dz <953396112@qq.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(commit: 54edaee58654bdc3c961906a8390088f35460ae9)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala (diff)
Commit e842033accf12190f1bf3962546065613656410f by sean.owen
[SPARK-27721][BUILD] Switch to use right leveldbjni according to the
platforms
This change adds a profile to switch to use the right leveldbjni package
according to the platforms: aarch64 uses
org.openlabtesting.leveldbjni:leveldbjni-all.1.8, and other platforms
use the old one org.fusesource.leveldbjni:leveldbjni-all.1.8. And
because some hadoop dependencies packages are also depend on
org.fusesource.leveldbjni:leveldbjni-all, but hadoop merge the similar
change on trunk, details see
https://issues.apache.org/jira/browse/HADOOP-16614, so exclude the
dependency of org.fusesource.leveldbjni for these hadoop packages
related. Then Spark can build/test on aarch64 platform successfully.
Closes #26636 from huangtianhua/add-aarch64-leveldbjni.
Authored-by: huangtianhua <huangtianhua@huawei.com> Signed-off-by: Sean
Owen <sean.owen@databricks.com>
(commit: e842033accf12190f1bf3962546065613656410f)
The file was modifiedcommon/network-common/pom.xml (diff)
The file was modifiedcommon/kvstore/pom.xml (diff)
The file was modifiedpom.xml (diff)
Commit babefdee1c133c6b35ff026d5deacb292a0b85aa by sean.owen
[SPARK-30085][SQL][DOC] Standardize sql reference
### What changes were proposed in this pull request? Standardize sql
reference
### Why are the changes needed? To have consistent docs
### Does this PR introduce any user-facing change? Yes
### How was this patch tested? Tested using jykyll build --serve
Closes #26721 from huaxingao/spark-30085.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen
<sean.owen@databricks.com>
(commit: babefdee1c133c6b35ff026d5deacb292a0b85aa)
The file was modifieddocs/sql-ref-syntax-aux-describe-query.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-columns.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-drop-database.md (diff)
The file was modifieddocs/sql-ref-syntax-dml-load.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-describe-table.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-cache.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-describe-function.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-alter-view.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-alter-database.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-create-table.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-cache-clear-cache.md (diff)
The file was modifieddocs/sql-ref-syntax-dml-insert-into.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-functions.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-drop-table.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-describe-database.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-table.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-databases.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-cache-cache-table.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-tblproperties.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-create-function.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-partitions.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-alter-table.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-truncate-table.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-create-view.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-drop-function.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-show-tables.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-analyze-table.md (diff)
The file was modifieddocs/sql-ref-syntax-dml-insert-overwrite-table.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-cache-uncache-table.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-drop-view.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-repair-table.md (diff)
The file was modifieddocs/sql-ref-syntax-aux-refresh-table.md (diff)
The file was modifieddocs/sql-ref-syntax-ddl-create-database.md (diff)
Commit e04a63437b8f31db90ca1669ee98289f4ba633e1 by sean.owen
[SPARK-30075][CORE][TESTS] Fix the hashCode implementation of
ArrayKeyIndexType correctly
### What changes were proposed in this pull request?
This patch fixes the bug on ArrayKeyIndexType.hashCode() as it is simply
calling Array.hashCode() which in turn calls Object.hashCode(). That
should be Arrays.hashCode() to reflect the elements in the array.
### Why are the changes needed?
I've encountered the bug in #25811 while adding test codes for #25811,
and I've split the fix into individual PR to speed up reviewing. Without
this patch, ArrayKeyIndexType would bring various issues when it's used
as type of collections.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
I've skipped adding UT as ArrayKeyIndexType is in test and the patch is
pretty simple one-liner.
Closes #26709 from HeartSaVioR/SPARK-30075.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(commit: e04a63437b8f31db90ca1669ee98289f4ba633e1)
The file was modifiedcommon/kvstore/src/test/java/org/apache/spark/util/kvstore/ArrayKeyIndexType.java (diff)
Commit 68034a805607ced50dbedca73dfc7eaf0102dde8 by herman
[SPARK-30072][SQL] Create dedicated planner for subqueries
### What changes were proposed in this pull request?
This PR changes subquery planning by calling the planner and plan
preparation rules on the subquery plan directly. Before we were creating
a `QueryExecution` instance for subqueries to get the executedPlan. This
would re-run analysis and optimization on the subqueries plan. Running
the analysis again on an optimized query plan can have unwanted
consequences, as some rules, for example `DecimalPrecision`, are not
idempotent.
As an example, consider the expression `1.7 * avg(a)` which after
applying the `DecimalPrecision` rule becomes:
``` promote_precision(1.7) * promote_precision(avg(a))
```
After the optimization, more specifically the constant folding rule,
this expression becomes:
``` 1.7 * promote_precision(avg(a))
```
Now if we run the analyzer on this optimized query again, we will get:
``` promote_precision(1.7) *
promote_precision(promote_precision(avg(a)))
```
Which will later optimized as:
``` 1.7 * promote_precision(promote_precision(avg(a)))
```
As can be seen, re-running the analysis and optimization on this
expression results in an expression with extra nested promote_preceision
nodes. Adding unneeded nodes to the plan is problematic because it can
eliminate situations where we can reuse the plan.
We opted to introduce dedicated planners for subuqueries, instead of
making the DecimalPrecision rule idempotent, because this eliminates
this entire category of problems. Another benefit is that planning time
for subqueries is reduced.
### How was this patch tested? Unit tests
Closes #26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by:
herman <herman@databricks.com>
(commit: 68034a805607ced50dbedca73dfc7eaf0102dde8)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAliasAndProjectSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/dynamicpruning/PartitionPruning.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/dynamicpruning/PlanDynamicPruningFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala (diff)