SuccessChanges

Summary

  1. [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator. (commit: 833a584bb675bf7c643bde62eb51963814c12384) (details)
  2. [SPARK-23095][SQL] Decorrelation of scalar subquery fails with (commit: 41d1a323c0e876352060d81f8c281a1565619ca1) (details)
  3. [SPARK-22361][SQL][TEST] Add unit test for Window Frames (commit: 08252bb38e48a940b4d8fb975fe12a020fe36b97) (details)
  4. [SPARK-22908][SS] Roll forward continuous processing Kafka support with (commit: 0a441d2edb0a3f6c6c7c370db8917e1c07f211e7) (details)
  5. Revert "[SPARK-23020][CORE] Fix races in launcher code, test." (commit: b9339eee1304c0309be4ea74f8cdc3d37a8048d3) (details)
  6. Fix merge between 07ae39d0ec and 1667057851 (commit: 00c744e40be3a96f1fe7c377725703fc7b9ca3e3) (details)
  7. [SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data (commit: 8ef323c572cee181e3bdbddeeb7119eda03d78f4) (details)
  8. [SPARK-23062][SQL] Improve EXCEPT documentation (commit: bfbc2d41b8a9278b347b6df2d516fe4679b41076) (details)
  9. [SPARK-21783][SQL] Turn on ORC filter push-down by default (commit: cbb6bda437b0d2832496b5c45f8264e5527f1cce) (details)
  10. [SPARK-23079][SQL] Fix query constraints propagation with aliases (commit: aae73a21a42fa366a09c2be1a4b91308ef211beb) (details)
Commit 833a584bb675bf7c643bde62eb51963814c12384 by joseph
[SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator.
## What changes were proposed in this pull request?
RFormula should use VectorSizeHint & OneHotEncoderEstimator in its
pipeline to avoid using the deprecated OneHotEncoder & to ensure the
model produced can be used in streaming.
## How was this patch tested?
Unit tests.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Bago Amirbekian <bago@databricks.com>
Closes #20229 from MrBago/rFormula.
(cherry picked from commit 4371466b3f06ca171b10568e776c9446f7bae6dd)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
(commit: 833a584bb675bf7c643bde62eb51963814c12384)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala (diff)
The file was modifiedR/pkg/R/mllib_utils.R (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala (diff)
Commit 41d1a323c0e876352060d81f8c281a1565619ca1 by gatorsmile
[SPARK-23095][SQL] Decorrelation of scalar subquery fails with
java.util.NoSuchElementException
## What changes were proposed in this pull request? The following SQL
involving scalar correlated query returns a map exception.
``` SQL SELECT t1a FROM   t1 WHERE  t1a = (SELECT   count(*)
             FROM     t2
             WHERE    t2c = t1c
             HAVING   count(*) >= 1)
```
``` SQL key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)
java.util.NoSuchElementException: key not found:
ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)
       at scala.collection.MapLike$class.default(MapLike.scala:228)
       at scala.collection.AbstractMap.default(Map.scala:59)
       at scala.collection.MapLike$class.apply(MapLike.scala:141)
       at scala.collection.AbstractMap.apply(Map.scala:59)
       at
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)
       at
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430)
       at
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426)
```
In this case, after evaluating the HAVING clause "count(*) > 1"
statically against the binding of aggregtation result on empty input, we
determine that this query will not have a the count bug. We should
simply return the evalSubqueryOnZeroTups with empty value.
(Please fill in changes proposed in this fix)
## How was this patch tested? A new test was added in the Subquery
bucket.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #20283 from dilipbiswal/scalar-count-defect.
(cherry picked from commit 0c2ba427bc7323729e6ffb34f1f06a97f0bf0c1d)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 41d1a323c0e876352060d81f8c281a1565619ca1)
The file was modifiedsql/core/src/test/resources/sql-tests/results/subquery/scalar-subquery/scalar-subquery-predicate.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/subquery/scalar-subquery/scalar-subquery-predicate.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
Commit 08252bb38e48a940b4d8fb975fe12a020fe36b97 by gatorsmile
[SPARK-22361][SQL][TEST] Add unit test for Window Frames
## What changes were proposed in this pull request?
There are already quite a few integration tests using window frames, but
the unit tests coverage is not ideal.
In this PR the already existing tests are reorganized, extended and
where gaps found additional cases added.
## How was this patch tested?
Automated: Pass the Jenkins.
Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Closes #20019 from gaborgsomogyi/SPARK-22361.
(cherry picked from commit a9b845ebb5b51eb619cfa7d73b6153024a6a420d)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 08252bb38e48a940b4d8fb975fe12a020fe36b97)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala (diff)
Commit 0a441d2edb0a3f6c6c7c370db8917e1c07f211e7 by tathagata.das1565
[SPARK-22908][SS] Roll forward continuous processing Kafka support with
fix to continuous Kafka data reader
## What changes were proposed in this pull request?
The Kafka reader is now interruptible and can close itself.
## How was this patch tested?
I locally ran one of the ContinuousKafkaSourceSuite tests in a tight
loop. Before the fix, my machine ran out of open file descriptors a few
iterations in; now it works fine.
Author: Jose Torres <jose@databricks.com>
Closes #20253 from jose-torres/fix-data-reader.
(cherry picked from commit 16670578519a7b787b0c63888b7d2873af12d5b9)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
(commit: 0a441d2edb0a3f6c6c7c370db8917e1c07f211e7)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (diff)
The file was addedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala (diff)
The file was addedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousWriter.scala
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala (diff)
The file was addedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was addedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousTest.scala
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala (diff)
The file was addedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDDIter.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala (diff)
Commit b9339eee1304c0309be4ea74f8cdc3d37a8048d3 by sameerag
Revert "[SPARK-23020][CORE] Fix races in launcher code, test."
This reverts commit 20c69816a63071b82b1035d4b48798c358206421.
(commit: b9339eee1304c0309be4ea74f8cdc3d37a8048d3)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/InProcessAppHandle.java (diff)
The file was modifiedlauncher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/LauncherConnection.java (diff)
The file was modifiedcore/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java (diff)
The file was modifiedlauncher/src/test/java/org/apache/spark/launcher/BaseSuite.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/LauncherServer.java (diff)
Commit 00c744e40be3a96f1fe7c377725703fc7b9ca3e3 by zsxwing
Fix merge between 07ae39d0ec and 1667057851
## What changes were proposed in this pull request?
The first commit added a new test, and the second refactored the class
the test was in. The automatic merge put the test in the wrong place.
## How was this patch tested?
-
Author: Jose Torres <jose@databricks.com>
Closes #20289 from jose-torres/fix.
(cherry picked from commit a963980a6d2b4bef2c546aa33acf0aa501d2507b)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
(commit: 00c744e40be3a96f1fe7c377725703fc7b9ca3e3)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala (diff)
Commit 8ef323c572cee181e3bdbddeeb7119eda03d78f4 by wenchen
[SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data
sources
## What changes were proposed in this pull request?
After [SPARK-20682](https://github.com/apache/spark/pull/19651), Apache
Spark 2.3 is able to read ORC files with Unicode schema. Previously, it
raises `org.apache.spark.sql.catalyst.parser.ParseException`.
This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based
data sources. Note that TEXT data source only has [a single column with
a fixed name
'value'](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L71).
## How was this patch tested?
Pass the newly added test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20266 from dongjoon-hyun/SPARK-23072.
(cherry picked from commit a0aedb0ded4183cc33b27e369df1cbf862779e26)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 8ef323c572cee181e3bdbddeeb7119eda03d78f4)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
Commit bfbc2d41b8a9278b347b6df2d516fe4679b41076 by gatorsmile
[SPARK-23062][SQL] Improve EXCEPT documentation
## What changes were proposed in this pull request?
Make the default behavior of EXCEPT (i.e. EXCEPT DISTINCT) more explicit
in the documentation, and call out the change in behavior from 1.x.
Author: Henry Robinson <henry@cloudera.com>
Closes #20254 from henryr/spark-23062.
(cherry picked from commit 1f3d933e0bd2b1e934a233ed699ad39295376e71)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: bfbc2d41b8a9278b347b6df2d516fe4679b41076)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedR/pkg/R/DataFrame.R (diff)
Commit cbb6bda437b0d2832496b5c45f8264e5527f1cce by wenchen
[SPARK-21783][SQL] Turn on ORC filter push-down by default
## What changes were proposed in this pull request?
ORC filter push-down is disabled by default from the beginning,
[SPARK-2883](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149
).
Now, Apache Spark starts to depend on Apache ORC 1.4.1. For Apache Spark
2.3, this PR turns on ORC filter push-down by default like Parquet
([SPARK-9207](https://issues.apache.org/jira/browse/SPARK-21783)) as a
part of
[SPARK-20901](https://issues.apache.org/jira/browse/SPARK-20901),
"Feature parity for ORC with Parquet".
## How was this patch tested?
Pass the existing tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20265 from dongjoon-hyun/SPARK-21783.
(cherry picked from commit 0f8a28617a0742d5a99debfbae91222c2e3b5cec)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: cbb6bda437b0d2832496b5c45f8264e5527f1cce)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
Commit aae73a21a42fa366a09c2be1a4b91308ef211beb by wenchen
[SPARK-23079][SQL] Fix query constraints propagation with aliases
## What changes were proposed in this pull request?
Previously, PR #19201 fix the problem of non-converging constraints.
After that PR #19149 improve the loop and constraints is inferred only
once. So the problem of non-converging constraints is gone.
However, the case below will fail.
```
spark.range(5).write.saveAsTable("t") val t = spark.read.table("t") val
left = t.withColumn("xid", $"id" + lit(1)).as("x") val right =
t.withColumnRenamed("id", "xid").as("y") val df = left.join(right,
"xid").filter("id = 3").toDF() checkAnswer(df, Row(4, 3))
```
Because `aliasMap` replace all the aliased child. See the test case in
PR for details.
This PR is to fix this bug by removing useless code for preventing
non-converging constraints. It can be also fixed with #20270, but this
is much simpler and clean up the code.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes #20278 from gengliangwang/FixConstraintSimple.
(cherry picked from commit 8598a982b4147abe5f1aae005fea0fd5ae395ac4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: aae73a21a42fa366a09c2be1a4b91308ef211beb)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromConstraintsSuite.scala (diff)