SuccessChanges

Summary

  1. [PYTHON] Fix typo in serializer exception (commit: 7f1708a44759724b116742683e2d4290362a3b59) (details)
  2. revert [SPARK-21743][SQL] top-most limit should not cause memory leak (commit: d3255a57109a5cea79948aa4192008b988961aa3) (details)
  3. [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1 (commit: a7d378e78d73503d4d1ad37d94641200a9ea1b2d) (details)
  4. [SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple (commit: d42610440ac2e58ef77fcf42ad81ee4fdf5691ba) (details)
Commit 7f1708a44759724b116742683e2d4290362a3b59 by hyukjinkwon
[PYTHON] Fix typo in serializer exception
## What changes were proposed in this pull request?
Fix typo in exception raised in Python serializer
## How was this patch tested?
No code changes
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
Closes #21566 from rberenguel/fix_typo_pyspark_serializers.
(cherry picked from commit 6567fc43aca75b41900cde976594e21c8b0ca98a)
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(commit: 7f1708a44759724b116742683e2d4290362a3b59)
The file was modifiedpython/pyspark/serializers.py (diff)
Commit d3255a57109a5cea79948aa4192008b988961aa3 by hvanhovell
revert [SPARK-21743][SQL] top-most limit should not cause memory leak
## What changes were proposed in this pull request?
There is a performance regression in Spark 2.3. When we read a big
compressed text file which is un-splittable(e.g. gz), and then take the
first record, Spark will scan all the data in the text file which is
very slow. For example, `spark.read.text("/tmp/test.csv.gz").head(1)`,
we can check out the SQL UI and see that the file is fully scanned.
![image](https://user-images.githubusercontent.com/3182036/41445252-264b1e5a-6ffd-11e8-9a67-4c31d129a314.png)
This is introduced by #18955 , which adds a LocalLimit to the query when
executing `Dataset.head`. The foundamental problem is, `Limit` is not
well whole-stage-codegened. It keeps consuming the input even if we have
already hit the limitation.
However, if we just fix LIMIT whole-stage-codegen, the memory leak test
will fail, as we don't fully consume the inputs to trigger the resource
cleanup.
To fix it completely, we should do the following 1. fix LIMIT
whole-stage-codegen, stop consuming inputs after hitting the limitation.
2. in whole-stage-codegen, provide a way to release resource of the
parant operator, and apply it in LIMIT 3. automatically release resource
when task ends.
Howere this is a non-trivial change, and is risky to backport to Spark
2.3.
This PR proposes to revert #18955 in Spark 2.3. The memory leak is not a
big issue. When task ends, Spark will release all the pages allocated by
this task, which is kind of releasing most of the resources.
I'll submit a exhaustive fix to master later.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes #21573 from cloud-fan/limit.
(commit: d3255a57109a5cea79948aa4192008b988961aa3)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
Commit a7d378e78d73503d4d1ad37d94641200a9ea1b2d by vanzin
[SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
The PR updates the 2.3 version tested to the new release 2.3.1.
existing UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #21543 from mgaido91/patch-1.
(cherry picked from commit 3bf76918fb67fb3ee9aed254d4fb3b87a7e66117)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(commit: a7d378e78d73503d4d1ad37d94641200a9ea1b2d)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit d42610440ac2e58ef77fcf42ad81ee4fdf5691ba by wenchen
[SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple
This PR fixes possible overflow in int add or multiply. In particular,
their overflows in multiply are detected by
[Spotbugs](https://spotbugs.github.io/)
The following assignments may cause overflow in right hand side. As a
result, the result may be negative.
``` long = int * int long = int + int
```
To avoid this problem, this PR performs cast from int to long in right
hand side.
Existing UTs.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #21481 from kiszk/SPARK-24452.
(cherry picked from commit 90da7dc241f8eec2348c0434312c97c116330bc4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: d42610440ac2e58ef77fcf42ad81ee4fdf5691ba)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)