FailedChanges

Summary

  1. [SPARK-30091][SQL][PYTHON] Document mergeSchema option directly in the (commit: e766a323bc3462763b03f9d892a0b3fdf2cb29db) (details)
  2. [SPARK-30113][SQL][PYTHON] Expose mergeSchema option in PySpark's ORC (commit: c8922d9145a9bc60c0f423a6c1b7d4f0bfa2e585) (details)
Commit e766a323bc3462763b03f9d892a0b3fdf2cb29db by gurwls223
[SPARK-30091][SQL][PYTHON] Document mergeSchema option directly in the
PySpark Parquet APIs
### What changes were proposed in this pull request?
This change properly documents the `mergeSchema` option directly in the
Python APIs for reading Parquet data.
### Why are the changes needed?
The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but
doesn't show it in the API. It seems like a simple oversight.
Before this PR, you'd have to do this to use `mergeSchema`:
```python spark.read.option('mergeSchema',
True).parquet('test-parquet').show()
```
After this PR, you can use the option as (I believe) it was intended to
be used:
```python spark.read.parquet('test-parquet', mergeSchema=True).show()
```
### Does this PR introduce any user-facing change?
Yes, this PR changes the signatures of `DataFrameReader.parquet()` and
`DataStreamReader.parquet()` to match their docstrings.
### How was this patch tested?
Testing the `mergeSchema` option directly seems to be left to the Scala
side of the codebase. I tested my change manually to confirm the API
works.
I also confirmed that setting `spark.sql.parquet.mergeSchema` at the
session does not get overridden by leaving `mergeSchema` at its default
when calling `parquet()`:
```
>>> spark.conf.set('spark.sql.parquet.mergeSchema', True)
>>> spark.range(3).write.parquet('test-parquet/id')
>>> spark.range(3).withColumnRenamed('id',
'name').write.parquet('test-parquet/name')
>>> spark.read.option('recursiveFileLookup',
True).parquet('test-parquet').show()
+----+----+
|  id|name|
+----+----+
|null|   1|
|null|   2|
|null|   0|
|   1|null|
|   2|null|
|   0|null|
+----+----+
>>> spark.read.option('recursiveFileLookup',
True).parquet('test-parquet', mergeSchema=False).show()
+----+
|  id|
+----+
|null|
|null|
|null|
|   1|
|   2|
|   0|
+----+
```
Closes #26730 from nchammas/parquet-merge-schema.
Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: e766a323bc3462763b03f9d892a0b3fdf2cb29db)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
Commit c8922d9145a9bc60c0f423a6c1b7d4f0bfa2e585 by gurwls223
[SPARK-30113][SQL][PYTHON] Expose mergeSchema option in PySpark's ORC
APIs
### What changes were proposed in this pull request?
This PR is a follow-up to #24043 and cousin of #26730. It exposes the
`mergeSchema` option directly in the ORC APIs.
### Why are the changes needed?
So the Python API matches the Scala API.
### Does this PR introduce any user-facing change?
Yes, it adds a new option directly in the ORC reader method signatures.
### How was this patch tested?
I tested this manually as follows:
```
>>> spark.range(3).write.orc('test-orc')
>>> spark.range(3).withColumnRenamed('id',
'name').write.orc('test-orc/nested')
>>> spark.read.orc('test-orc', recursiveFileLookup=True,
mergeSchema=True) DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True,
mergeSchema=False) DataFrame[id: bigint]
>>> spark.conf.set('spark.sql.orc.mergeSchema', True)
>>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id:
bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True,
mergeSchema=False) DataFrame[id: bigint]
```
Closes #26755 from nchammas/SPARK-30113-ORC-mergeSchema.
Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: c8922d9145a9bc60c0f423a6c1b7d4f0bfa2e585)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)