SuccessChanges

Summary

  1. add one supported type missing from the javadoc (commit: c7c0b086a0b18424725433ade840d5121ac2b86e) (details)
  2. [SPARK-24573][INFRA] Runs SBT checkstyle after the build to work around (commit: b0a935255951280b49c39968f6234163e2f0e379) (details)
  3. [SPARK-23772][SQL] Provide an option to ignore column of all null values (commit: e219e692ef70c161f37a48bfdec2a94b29260004) (details)
  4. [SPARK-24526][BUILD][TEST-MAVEN] Spaces in the build dir causes failures (commit: bce177552564a4862bc979d39790cf553a477d74) (details)
  5. [SPARK-24548][SQL] Fix incorrect schema of Dataset with tuple encoders (commit: 8f225e055c2031ca85d61721ab712170ab4e50c1) (details)
  6. [SPARK-24478][SQL][FOLLOWUP] Move projection and filter push down to (commit: 1737d45e08a5f1fb78515b14321721d7197b443a) (details)
  7. [SPARK-24542][SQL] UDF series UDFXPathXXXX allow users to pass carefully (commit: 9a75c18290fff7d116cf88a44f9120bf67d8bd27) (details)
  8. [SPARK-24521][SQL][TEST] Fix ineffective test in CachedTableSuite (commit: a78a9046413255756653f70165520efd486fb493) (details)
  9. [SPARK-24556][SQL] Always rewrite output partitioning in (commit: 9dbe53eb6bb5916d28000f2c0d646cf23094ac11) (details)
  10. [SPARK-24534][K8S] Bypass non spark-on-k8s commands (commit: 13092d733791b19cd7994084178306e0c449f2ed) (details)
  11. [SPARK-24565][SS] Add API for in Structured Streaming for exposing (commit: 2cb976355c615eee4ebd0a86f3911fa9284fccf6) (details)
  12. [SPARK-24583][SQL] Wrong schema type in InsertIntoDataSourceCommand (commit: bc0498d5820ded2b428277e396502e74ef0ce36d) (details)
  13. [SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD (commit: bc111463a766a5619966a282fbe0fec991088ceb) (details)
  14. [MINOR][SQL] Remove invalid comment from SparkStrategies (commit: c8ef9232cf8b8ef262404b105cea83c1f393d8c3) (details)
Commit c7c0b086a0b18424725433ade840d5121ac2b86e by rxin
add one supported type missing from the javadoc
## What changes were proposed in this pull request?
The supported java.math.BigInteger type is not mentioned in the javadoc
of Encoders.bean()
## How was this patch tested?
only Javadoc fix
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: James Yu <james@ispot.tv>
Closes #21544 from yuj/master.
(commit: c7c0b086a0b18424725433ade840d5121ac2b86e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala (diff)
Commit b0a935255951280b49c39968f6234163e2f0e379 by hyukjinkwon
[SPARK-24573][INFRA] Runs SBT checkstyle after the build to work around
a side-effect
## What changes were proposed in this pull request?
Seems checkstyle affects the build in the PR builder in Jenkins. I can't
reproduce in my local and seems it can only be reproduced in the PR
builder.
I was checking the places it goes through and this is just a speculation
that checkstyle's compilation in SBT has a side effect to the assembly
build.
This PR proposes to run the SBT checkstyle after the build.
## How was this patch tested?
Jenkins tests.
Author: hyukjinkwon <gurwls223@apache.org>
Closes #21579 from HyukjinKwon/investigate-javastyle.
(commit: b0a935255951280b49c39968f6234163e2f0e379)
The file was modifieddev/run-tests.py (diff)
Commit e219e692ef70c161f37a48bfdec2a94b29260004 by hyukjinkwon
[SPARK-23772][SQL] Provide an option to ignore column of all null values
or empty array during JSON schema inference
## What changes were proposed in this pull request? This pr added a new
JSON option `dropFieldIfAllNull ` to ignore column of all null values or
empty array/struct during JSON schema inference.
## How was this patch tested? Added tests in `JsonSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org> Author: Xiangrui Meng
<meng@databricks.com>
Closes #20929 from maropu/SPARK-23772.
(commit: e219e692ef70c161f37a48bfdec2a94b29260004)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
Commit bce177552564a4862bc979d39790cf553a477d74 by hyukjinkwon
[SPARK-24526][BUILD][TEST-MAVEN] Spaces in the build dir causes failures
in the build/mvn script
## What changes were proposed in this pull request?
Fix the call to ${MVN_BIN} to be wrapped in quotes so it will handle
having spaces in the path.
## How was this patch tested?
Ran the following to confirm using the build/mvn tool with a space in
the build dir now works without error
``` mkdir /tmp/test\ spaces cd /tmp/test\ spaces git clone
https://github.com/apache/spark.git cd spark
# Remove all mvn references in PATH so the script will download mvn to
the local dir
./build/mvn -DskipTests clean package
```
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: trystanleftwich <trystan@atscale.com>
Closes #21534 from trystanleftwich/SPARK-24526.
(commit: bce177552564a4862bc979d39790cf553a477d74)
The file was modifiedbuild/mvn (diff)
Commit 8f225e055c2031ca85d61721ab712170ab4e50c1 by wenchen
[SPARK-24548][SQL] Fix incorrect schema of Dataset with tuple encoders
## What changes were proposed in this pull request?
When creating tuple expression encoders, we should give the serializer
expressions of tuple items correct names, so we can have correct output
schema when we use such tuple encoders.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #21576 from viirya/SPARK-24548.
(commit: 8f225e055c2031ca85d61721ab712170ab4e50c1)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala (diff)
The file was modifiedsql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java (diff)
Commit 1737d45e08a5f1fb78515b14321721d7197b443a by wenchen
[SPARK-24478][SQL][FOLLOWUP] Move projection and filter push down to
physical conversion
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/21503, to
completely move operator pushdown to the planner rule.
The code are mostly from https://github.com/apache/spark/pull/21319
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes #21574 from cloud-fan/followup.
(commit: 1737d45e08a5f1fb78515b14321721d7197b443a)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportStatistics.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
Commit 9a75c18290fff7d116cf88a44f9120bf67d8bd27 by wenchen
[SPARK-24542][SQL] UDF series UDFXPathXXXX allow users to pass carefully
crafted XML to access arbitrary files
## What changes were proposed in this pull request?
UDF series UDFXPathXXXX allow users to pass carefully crafted XML to
access arbitrary files. Spark does not have built-in access control.
When users use the external access control library, users might bypass
them and access the file contents.
This PR basically patches the Hive fix to Apache Spark.
https://issues.apache.org/jira/browse/HIVE-18879
## How was this patch tested?
A unit test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes #21549 from gatorsmile/xpathSecurity.
(commit: 9a75c18290fff7d116cf88a44f9120bf67d8bd27)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/xml/UDFXPathUtilSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/xml/UDFXPathUtil.java (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/xml/XPathExpressionSuite.scala (diff)
Commit a78a9046413255756653f70165520efd486fb493 by gatorsmile
[SPARK-24521][SQL][TEST] Fix ineffective test in CachedTableSuite
## What changes were proposed in this pull request?
test("withColumn doesn't invalidate cached dataframe") in
CachedTableSuite doesn't not work because:
The UDF is executed and test count incremented when "df.cache()" is
called and the subsequent "df.collect()" has no effect on the test
result.
This PR fixed this test and add another test for caching UDF.
## How was this patch tested?
Add new tests.
Author: Li Jin <ice.xelloss@gmail.com>
Closes #21531 from icexelloss/fix-cache-test.
(commit: a78a9046413255756653f70165520efd486fb493)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala (diff)
Commit 9dbe53eb6bb5916d28000f2c0d646cf23094ac11 by wenchen
[SPARK-24556][SQL] Always rewrite output partitioning in
ReusedExchangeExec and InMemoryTableScanExec
## What changes were proposed in this pull request?
Currently, ReusedExchange and InMemoryTableScanExec only rewrite output
partitioning if child's partitioning is HashPartitioning and do nothing
for other partitioning, e.g., RangePartitioning. We should always
rewrite it, otherwise, unnecessary shuffle could be introduced like
https://issues.apache.org/jira/browse/SPARK-24556.
## How was this patch tested?
Add new tests.
Author: yucai <yyu1@ebay.com>
Closes #21564 from yucai/SPARK-24556.
(commit: 9dbe53eb6bb5916d28000f2c0d646cf23094ac11)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala (diff)
Commit 13092d733791b19cd7994084178306e0c449f2ed by eerlands
[SPARK-24534][K8S] Bypass non spark-on-k8s commands
## What changes were proposed in this pull request? This PR changes the
entrypoint.sh to provide an option to run non spark-on-k8s commands
(init, driver, executor) in order to let the user keep with the normal
workflow without hacking the image to bypass the entrypoint
## How was this patch tested? This patch was built manually in my local
machine and I ran some tests with a combination of ```docker run```
commands.
Author: rimolive <ricardo.martinelli.oliveira@gmail.com>
Closes #21572 from rimolive/rimolive-spark-24534.
(commit: 13092d733791b19cd7994084178306e0c449f2ed)
The file was modifiedresource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh (diff)
Commit 2cb976355c615eee4ebd0a86f3911fa9284fccf6 by zsxwing
[SPARK-24565][SS] Add API for in Structured Streaming for exposing
output rows of each microbatch as a DataFrame
## What changes were proposed in this pull request?
Currently, the micro-batches in the MicroBatchExecution is not exposed
to the user through any public API. This was because we did not want to
expose the micro-batches, so that all the APIs we expose, we can
eventually support them in the Continuous engine. But now that we have
better sense of buiding a ContinuousExecution, I am considering adding
APIs which will run only the MicroBatchExecution. I have quite a few use
cases where exposing the microbatch output as a dataframe is useful.
- Pass the output rows of each batch to a library that is designed only
the batch jobs (example, uses many ML libraries need to collect() while
learning).
- Reuse batch data sources for output whose streaming version does not
exists (e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each
batch. This is not the most elegant thing to do for multiple-output
streaming queries but is likely to be better than running two streaming
queries processing the same data twice.
The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to
Scala/Java/Python `DataStreamWriter`.
## How was this patch tested? New unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #21571 from tdas/foreachBatch.
(commit: 2cb976355c615eee4ebd0a86f3911fa9284fccf6)
The file was modifiedpython/pyspark/streaming/context.py (diff)
The file was modifiedpython/pyspark/sql/utils.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala
The file was modifiedpython/pyspark/java_gateway.py (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSink.scala
Commit bc0498d5820ded2b428277e396502e74ef0ce36d by gatorsmile
[SPARK-24583][SQL] Wrong schema type in InsertIntoDataSourceCommand
## What changes were proposed in this pull request?
Change insert input schema type: "insertRelationType" ->
"insertRelationType.asNullable", in order to avoid nullable being
overridden.
## How was this patch tested?
Added one test in InsertSuite.
Author: Maryann Xue <maryannxue@apache.org>
Closes #21585 from maryannxue/spark-24583.
(commit: bc0498d5820ded2b428277e396502e74ef0ce36d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoDataSourceCommand.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala (diff)
Commit bc111463a766a5619966a282fbe0fec991088ceb by wenchen
[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD
## What changes were proposed in this pull request?
When a `union` is invoked on several RDDs of which one is an empty RDD,
the result of the operation is a `UnionRDD`. This causes an unneeded
extra-shuffle when all the other RDDs have the same partitioning.
The PR ignores incoming empty RDDs in the union method.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21333 from mgaido91/SPARK-23778.
(commit: bc111463a766a5619966a282fbe0fec991088ceb)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/rdd/RDDSuite.scala (diff)
Commit c8ef9232cf8b8ef262404b105cea83c1f393d8c3 by hvanhovell
[MINOR][SQL] Remove invalid comment from SparkStrategies
## What changes were proposed in this pull request?
This patch is removing invalid comment from SparkStrategies, given that
TODO-like comment is no longer preferred one as the comment:
https://github.com/apache/spark/pull/21388#issuecomment-396856235
Removing invalid comment will prevent contributors to spend their times
which is not going to be merged.
## How was this patch tested?
N/A
Author: Jungtaek Lim <kabhwan@gmail.com>
Closes #21595 from
HeartSaVioR/MINOR-remove-invalid-comment-on-spark-strategies.
(commit: c8ef9232cf8b8ef262404b105cea83c1f393d8c3)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)