SuccessChanges

Summary

  1. [SPARK-24257][SQL] LongToUnsafeRowMap calculate the new size may be (commit: 3d2ae0ba866a3cb5b0681c2b8de67d5ead339712) (details)
  2. [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 (commit: 75e2cd1313551a816480a6c830fb5fa73d21c426) (details)
  3. [SPARK-24364][SS] Prevent InMemoryFileIndex from failing if file path (commit: 068c4ae3437981824b65d56efb7889232d5a3fb7) (details)
  4. [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase with dictionary (commit: f48d62400a757470c44191b9e5581c10236fe976) (details)
  5. [SPARK-24378][SQL] Fix date_trunc function incorrect examples (commit: d0f30e3f36f50abfc18654e379fd03c1360c4fd6) (details)
Commit 3d2ae0ba866a3cb5b0681c2b8de67d5ead339712 by wenchen
[SPARK-24257][SQL] LongToUnsafeRowMap calculate the new size may be
wrong
LongToUnsafeRowMap has a mistake when growing its page array: it blindly
grows to `oldSize * 2`, while the new record may be larger than `oldSize
* 2`. Then we may have a malformed UnsafeRow when querying this map,
whose actual data is smaller than its declared size, and the data is
corrupted.
Author: sychen <sychen@ctrip.com>
Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.
(cherry picked from commit 888340151f737bb68d0e419b1e949f11469881f9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 3d2ae0ba866a3cb5b0681c2b8de67d5ead339712)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala (diff)
Commit 75e2cd1313551a816480a6c830fb5fa73d21c426 by wenchen
[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
ORC 1.4.4 includes [nine
fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4).
One of the issues is about `Timestamp` bug (ORC-306) which occurs when
`native` ORC vectorized reader reads ORC column vector's sub-vector
`times` and `nanos`. ORC-306 fixes this according to the [original
definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46)
and this PR includes the updated interpretation on ORC column vectors.
Note that `hive` ORC reader and ORC MR reader is not affected.
```scala scala> spark.version res0: String = 2.3.0 scala> spark.sql("set
spark.sql.orc.impl=native") scala>
Seq(java.sql.Timestamp.valueOf("1900-05-05
12:34:56.000789")).toDF().write.orc("/tmp/orc") scala>
spark.read.orc("/tmp/orc").show(false)
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:55.000789|
+--------------------------+
```
This PR aims to update Apache Spark to use it.
**FULL LIST**
ID | TITLE
-- | -- ORC-281 | Fix compiler warnings from clang 5.0 ORC-301 |
`extractFileTail` should open a file in `try` statement ORC-304 | Fix
TestRecordReaderImpl to not fail with new storage-api ORC-306 | Fix
incorrect workaround for bug in java.sql.Timestamp ORC-324 | Add support
for ARM and PPC arch ORC-330 | Remove unnecessary Hive artifacts from
root pom ORC-332 | Add syntax version to orc_proto.proto ORC-336 |
Remove avro and parquet dependency management entries ORC-360 |
Implement error checking on subtype fields in Java
Pass the Jenkins.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #21372 from dongjoon-hyun/SPARK_ORC144.
(cherry picked from commit 486ecc680e9a0e7b6b3c3a45fb883a61072096fc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 75e2cd1313551a816480a6c830fb5fa73d21c426)
The file was modifieddev/deps/spark-deps-hadoop-2.7 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.6 (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala (diff)
The file was modifiedpom.xml (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java (diff)
Commit 068c4ae3437981824b65d56efb7889232d5a3fb7 by hyukjinkwon
[SPARK-24364][SS] Prevent InMemoryFileIndex from failing if file path
doesn't exist
## What changes were proposed in this pull request?
This PR proposes to follow up https://github.com/apache/spark/pull/15153
and complete SPARK-17599.
`FileSystem` operation (`fs.getFileBlockLocations`) can still fail if
the file path does not exist. For example see the exception message
below:
``` Error occurred while processing: File does not exist:
/rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv
... java.io.FileNotFoundException: File does not exist:
/rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv
...
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:249)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:229)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles$3.apply(InMemoryFileIndex.scala:314)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles$3.apply(InMemoryFileIndex.scala:297)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:297)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:174)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:173)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles(InMemoryFileIndex.scala:173)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:126)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:91)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:67)
at
org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:161)
at
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:152)
at
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:166)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:261)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:94)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:94)
at
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:196)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:206)
at com.hwx.StreamTest$.main(StreamTest.scala:97)
at com.hwx.StreamTest.main(StreamTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
File does not exist:
/rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv
...
```
So, it fixes it to make a warning instead.
## How was this patch tested?
It's hard to write a test. Manually tested multiple times.
Author: hyukjinkwon <gurwls223@apache.org>
Closes #21408 from HyukjinKwon/missing-files.
(cherry picked from commit 8a545822d0cc3a866ef91a94e58ea5c8b1014007)
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(commit: 068c4ae3437981824b65d56efb7889232d5a3fb7)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (diff)
Commit f48d62400a757470c44191b9e5581c10236fe976 by wenchen
[SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase with dictionary
filters.
## What changes were proposed in this pull request?
I missed this commit when preparing #21070.
When Parquet is able to filter blocks with dictionary filtering, the
expected total value count to be too high in Spark, leading to an error
when there were fewer than expected row groups to process. Spark should
get the row groups from Parquet to pick up new filter schemes in Parquet
like dictionary filtering.
## How was this patch tested?
Using in production at Netflix. Added test case for dictionary-filtered
blocks.
Author: Ryan Blue <blue@apache.org>
Closes #21295 from rdblue/SPARK-24230-fix-parquet-block-tracking.
(cherry picked from commit 3469f5c989e686866051382a3a28b2265619cab9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f48d62400a757470c44191b9e5581c10236fe976)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java (diff)
Commit d0f30e3f36f50abfc18654e379fd03c1360c4fd6 by hyukjinkwon
[SPARK-24378][SQL] Fix date_trunc function incorrect examples
## What changes were proposed in this pull request?
Fix `date_trunc` function incorrect examples.
## How was this patch tested?
N/A
Author: Yuming Wang <yumwang@ebay.com>
Closes #21423 from wangyum/SPARK-24378.
(commit: d0f30e3f36f50abfc18654e379fd03c1360c4fd6)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)