SuccessChanges

Summary

  1. [SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific (details)
Commit 5b187a85a24c788e19742e03f1300662d475bab8 by cutlerb
[SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific
to PyArrow 0.9.0)
## What changes were proposed in this pull request?
See [ARROW-2432](https://jira.apache.org/jira/browse/ARROW-2432). Seems
using `from_pandas` to convert decimals fails if encounters a value of
`None`:
```python import pyarrow as pa import pandas as pd from decimal import
Decimal
pa.Array.from_pandas(pd.Series([Decimal('3.14'), None]),
type=pa.decimal128(3, 2))
```
**Arrow 0.8.0**
```
<pyarrow.lib.Decimal128Array object at 0x10a572c58>
[
Decimal('3.14'),
NA
]
```
**Arrow 0.9.0**
``` Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
File "array.pxi", line 177, in pyarrow.lib.array
File "error.pxi", line 77, in pyarrow.lib.check_status
File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error converting from Python objects to
Decimal: Got Python object of type NoneType but can only handle these
types: decimal.Decimal
```
This PR propose to work around this via Decimal NaN:
```python pa.Array.from_pandas(pd.Series([Decimal('3.14'),
Decimal('NaN')]), type=pa.decimal128(3, 2))
```
```
<pyarrow.lib.Decimal128Array object at 0x10ffd2e68>
[
Decimal('3.14'),
NA
]
```
## How was this patch tested?
Manually tested:
```bash SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests
ScalarPandasUDFTests
```
**Before**
``` Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/tests.py", line 4672, in
test_vectorized_udf_null_decimal
   self.assertEquals(df.collect(), res.collect())
File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
   sock_info = self._jdf.collectToPython()
File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__
   answer, self.gateway_client, self.target_id, self.name)
File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
   return f(*a, **kw)
File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value
   format(target_id, ".", name), value) Py4JJavaError: An error occurred
while calling o51.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0
in stage 1.0 (TID 7, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
File "/.../spark/python/pyspark/worker.py", line 320, in main
   process()
File "/.../spark/python/pyspark/worker.py", line 315, in process
   serializer.dump_stream(func(split_index, iterator), outfile)
File "/.../spark/python/pyspark/serializers.py", line 274, in
dump_stream
   batch = _create_batch(series, self._timezone)
File "/.../spark/python/pyspark/serializers.py", line 243, in
_create_batch
   arrs = [create_array(s, t) for s, t in series]
File "/.../spark/python/pyspark/serializers.py", line 241, in
create_array
   return pa.Array.from_pandas(s, mask=mask, type=t)
File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
File "array.pxi", line 177, in pyarrow.lib.array
File "error.pxi", line 77, in pyarrow.lib.check_status
File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid:
Error converting from Python objects to Decimal: Got Python object of
type NoneType but can only handle these types: decimal.Decimal
```
**After**
``` Running tests...
----------------------------------------------------------------------
Setting default log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
.......S.............................
----------------------------------------------------------------------
Ran 37 tests in 21.980s
```
Author: hyukjinkwon <gurwls223@apache.org>
Closes #21928 from HyukjinKwon/SPARK-24976.
(cherry picked from commit f4772fd26f32b11ae54e7721924b5cf6eb27298a)
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
The file was modifiedpython/pyspark/serializers.py (diff)