SuccessChanges

Summary

  1. [SPARK-24369][SQL] Correct handling for multiple distinct aggregations (commit: 66289a3e067769cb8ed35953187f6363463791e1) (details)
  2. [SPARK-23754][BRANCH-2.3][PYTHON] Re-raising StopIteration in client (commit: e1c0ab16c71f102bfd9f5133647d168e49ae06bc) (details)
  3. [SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correctly into (commit: 3a024a4db5b531025fbd7761bccf2525f83f4234) (details)
Commit 66289a3e067769cb8ed35953187f6363463791e1 by wenchen
[SPARK-24369][SQL] Correct handling for multiple distinct aggregations
having the same argument set
## What changes were proposed in this pull request? This pr fixed an
issue when having multiple distinct aggregations having the same
argument set, e.g.,
``` scala>: paste val df = sql(
s"""SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
    | FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y)
  """.stripMargin)
java.lang.RuntimeException You hit a query analyzer bug. Please report
your query to Spark user mailing list.
``` The root cause is that `RewriteDistinctAggregates` can't detect
multiple distinct aggregations if they have the same argument set. This
pr modified code so that `RewriteDistinctAggregates` could count the
number of aggregate expressions with `isDistinct=true`.
## How was this patch tested? Added tests in `DataFrameAggregateSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes #21443 from maropu/SPARK-24369.
(cherry picked from commit 1e46f92f956a00d04d47340489b6125d44dbd47b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 66289a3e067769cb8ed35953187f6363463791e1)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/group-by.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/group-by.sql.out (diff)
Commit e1c0ab16c71f102bfd9f5133647d168e49ae06bc by hyukjinkwon
[SPARK-23754][BRANCH-2.3][PYTHON] Re-raising StopIteration in client
code
## What changes are proposed Make sure that `StopIteration`s raised in
users' code do not silently interrupt processing by spark, but are
raised as exceptions to the users. The users' functions are wrapped in
`safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as
`RuntimeError`s
## How were the changes tested Unit tests, making sure that the
exceptions are indeed raised. I am not sure how to check whether a
`Py4JJavaError` contains my exception, so I simply looked for the
exception message in the java exception's `toString`. Can you propose a
better way?
This is my original work, licensed in the same way as spark
---
Author: e-dorigatti <emilio.dorigattigmail.com>
Closes #21383 from e-dorigatti/fix_spark_23754.
(cherry picked from commit 0ebb0c0d4dd3e192464dc5e0e6f01efa55b945ed)
Author: e-dorigatti <emilio.dorigatti@gmail.com>
Closes #21463 from e-dorigatti/branch-2.3.
(commit: e1c0ab16c71f102bfd9f5133647d168e49ae06bc)
The file was modifiedpython/pyspark/rdd.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedpython/pyspark/util.py (diff)
The file was modifiedpython/pyspark/shuffle.py (diff)
The file was modifiedpython/pyspark/tests.py (diff)
Commit 3a024a4db5b531025fbd7761bccf2525f83f4234 by vanzin
[SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correctly into
PythonRunner in submit with client mode in spark-submit
## What changes were proposed in this pull request?
In client side before context initialization specifically,  .py file
doesn't work in client side before context initialization when the
application is a Python file. See below:
```
$ cat /home/spark/tmp.py def testtest():
   return 1
```
This works:
```
$ cat app.py import pyspark
pyspark.sql.SparkSession.builder.getOrCreate() import tmp
print("************************%s" % tmp.testtest())
$ ./bin/spark-submit --master yarn --deploy-mode client --py-files
/home/spark/tmp.py app.py
...
************************1
```
but this doesn't:
```
$ cat app.py import pyspark import tmp
pyspark.sql.SparkSession.builder.getOrCreate()
print("************************%s" % tmp.testtest())
$ ./bin/spark-submit --master yarn --deploy-mode client --py-files
/home/spark/tmp.py app.py Traceback (most recent call last):
File "/home/spark/spark/app.py", line 2, in <module>
   import tmp ImportError: No module named tmp
```
### How did it happen?
In client mode specifically, the paths are being added into PythonRunner
as are:
https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L430
https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L49-L88
The problem here is, .py file shouldn't be added as are since
`PYTHONPATH` expects a directory or an archive like zip or egg.
### How does this PR fix?
We shouldn't simply just add its parent directory because other files in
the parent directory could also be added into the `PYTHONPATH` in client
mode before context initialization.
Therefore, we copy .py files into a temp directory for .py files and add
it to `PYTHONPATH`.
## How was this patch tested?
Unit tests are added and manually tested in both standalond and yarn
client modes with submit.
Author: hyukjinkwon <gurwls223@apache.org>
Closes #21426 from HyukjinKwon/SPARK-24384.
(cherry picked from commit b142157dcc7f595eea93d66dda8b1d169a38d95c)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: 3a024a4db5b531025fbd7761bccf2525f83f4234)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/PythonRunner.scala (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala (diff)