SuccessChanges

Summary

  1. [SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document (commit: fe9cb4afe39944b394b39cc25622e311554ae187) (details)
  2. [SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause (commit: dfa43792feb78b4cc3776606b3a13eff3586fbb1) (details)
  3. [SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the (commit: a4eb1e47ad2453b41ebb431272c92e1ac48bb310) (details)
Commit fe9cb4afe39944b394b39cc25622e311554ae187 by hyukjinkwon
[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document
## What changes were proposed in this pull request?
Clarify JSON and CSV reader behavior in document.
JSON doesn't support partial results for corrupted records. CSV only
supports partial results for the records with more or less tokens.
## How was this patch tested?
Pass existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #20666 from viirya/SPARK-23448-2.
(cherry picked from commit b14993e1fcb68e1c946a671c6048605ab4afdf58)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: fe9cb4afe39944b394b39cc25622e311554ae187)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala (diff)
Commit dfa43792feb78b4cc3776606b3a13eff3586fbb1 by wenchen
[SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause
oom
… cause oom
## What changes were proposed in this pull request? blockManagerIdCache
in BlockManagerId will not remove old values which may cause oom
`val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId,
BlockManagerId]()` Since whenever we apply a new BlockManagerId, it will
put into this map.
This patch will use guava cahce for  blockManagerIdCache instead.
A heap dump show in
[SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)
## How was this patch tested? Exist tests.
Author: zhoukang <zhoukang199191@gmail.com>
Closes #20667 from caneGuy/zhoukang/fix-history.
(cherry picked from commit 6a8abe29ef3369b387d9bc2ee3459a6611246ab1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: dfa43792feb78b4cc3776606b3a13eff3586fbb1)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerId.scala (diff)
Commit a4eb1e47ad2453b41ebb431272c92e1ac48bb310 by hyukjinkwon
[SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the
trace from Java side by Py4JJavaError
## What changes were proposed in this pull request?
This PR proposes for `pyspark.util._exception_message` to produce the
trace from Java side by `Py4JJavaError`.
Currently, in Python 2, it uses `message` attribute which
`Py4JJavaError` didn't happen to have:
```python
>>> from pyspark.util import _exception_message
>>> try:
...     sc._jvm.java.lang.String(None)
... except Exception as e:
...     pass
...
>>> e.message
''
```
Seems we should use `str` instead for now:

https://github.com/bartdag/py4j/blob/aa6c53b59027925a426eb09b58c453de02c21b7c/py4j-python/src/py4j/protocol.py#L412
but this doesn't address the problem with non-ascii string from Java
side -
`https://github.com/bartdag/py4j/issues/306`
So, we could directly call `__str__()`:
```python
>>> e.__str__() u'An error occurred while calling
None.java.lang.String.\n: java.lang.NullPointerException\n\tat
java.lang.String.<init>(String.java:588)\n\tat
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)\n\tat
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat
java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat
py4j.Gateway.invoke(Gateway.java:238)\n\tat
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat
py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat
java.lang.Thread.run(Thread.java:745)\n'
```
which doesn't type coerce unicodes to `str` in Python 2.
This can be actually a problem:
```python from pyspark.sql.functions import udf
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(1).select(udf(lambda x: [[]])()).toPandas()
```
**Before**
``` Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in
toPandas
   raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: Note: toPandas attempted Arrow optimization because
'spark.sql.execution.arrow.enabled' is set to true. Please set it to
false to disable this.
```
**After**
``` Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in
toPandas
   raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: An error occurred while calling
o47.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0
in stage 0.0 (TID 7, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
File "/.../spark/python/pyspark/worker.py", line 245, in main
   process()
File "/.../spark/python/pyspark/worker.py", line 240, in process
... Note: toPandas attempted Arrow optimization because
'spark.sql.execution.arrow.enabled' is set to true. Please set it to
false to disable this.
```
## How was this patch tested?
Manually tested and unit tests were added.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20680 from HyukjinKwon/SPARK-23517.
(cherry picked from commit fab563b9bd1581112462c0fc0b299ad6510b6564)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: a4eb1e47ad2453b41ebb431272c92e1ac48bb310)
The file was modifiedpython/pyspark/util.py (diff)
The file was modifiedpython/pyspark/tests.py (diff)