SuccessChanges

Summary

  1. [SPARK-25626][SQL][TEST] Improve the test execution time of (details)
  2. [SPARK-25635][SQL][BUILD] Support selective direct encoding in native (details)
  3. [SPARK-25653][TEST] Add tag ExtendedHiveTest for HiveSparkSubmitSuite (details)
  4. [SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite: (details)
Commit a433fbcee66904d1b7fa98ab053e2bdf81e5e4f2 by gatorsmile
[SPARK-25626][SQL][TEST] Improve the test execution time of
HiveClientSuites
## What changes were proposed in this pull request? Improve the runtime
by reducing the number of partitions created in the test. The number of
partitions are reduced from 280 to 60.
Here are the test times for the `getPartitionsByFilter returns all
partitions` test  on my laptop.
```
[info] - 0.13: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (4 seconds, 230 milliseconds)
[info] - 0.14: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (3 seconds, 576 milliseconds)
[info] - 1.0: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (3 seconds, 495 milliseconds)
[info] - 1.1: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (6 seconds, 728 milliseconds)
[info] - 1.2: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (7 seconds, 260 milliseconds)
[info] - 2.0: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (8 seconds, 270 milliseconds)
[info] - 2.1: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (6 seconds, 856 milliseconds)
[info] - 2.2: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (7 seconds, 587 milliseconds)
[info] - 2.3: getPartitionsByFilter returns all partitions when
hive.metastore.try.direct.sql=false (7 seconds, 230 milliseconds)
## How was this patch tested? Test only.
Closes #22644 from dilipbiswal/SPARK-25626.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile
<gatorsmile@gmail.com>
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala (diff)
Commit 1c9486c1acceb73e5cc6f1fa684b6d992e187a9a by gatorsmile
[SPARK-25635][SQL][BUILD] Support selective direct encoding in native
ORC write
## What changes were proposed in this pull request?
Before ORC 1.5.3, `orc.dictionary.key.threshold` and
`hive.exec.orc.dictionary.key.size.threshold` are applied for all
columns. This has been a big huddle to enable dictionary encoding. From
ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct
encoding selectively in a column-wise manner. This PR aims to add that
feature by upgrading ORC from 1.5.2 to 1.5.3.
The followings are the patches in ORC 1.5.3 and this feature is the only
one related to Spark directly.
``` ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes &
corrupts multi-byte data (gopalv) ORC-403: [C++] Add checks to avoid
invalid offsets in InputStream ORC-405: Remove calcite as a dependency
from the benchmarks. ORC-375: Fix libhdfs on gcc7 by adding #include
<functional> two places. ORC-383: Parallel builds fails with
ConcurrentModificationException ORC-382: Apache rat exclusions + add rat
check to travis ORC-401: Fix incorrect quoting in specification.
ORC-385: Change RecordReader to extend Closeable. ORC-384: [C++] fix
memory leak when loading non-ORC files ORC-391: [c++] parseType does not
accept underscore in the field name ORC-397: Allow selective disabling
of dictionary encoding. Original patch was by Mithun Radhakrishnan.
ORC-389: Add ability to not decode Acid metadata columns
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
Closes #22622 from dongjoon-hyun/SPARK-25635.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by:
gatorsmile <gatorsmile@gmail.com>
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.6 (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.1 (diff)
Commit bbd038d2436c17ff519c08630a016f3ec796a282 by gatorsmile
[SPARK-25653][TEST] Add tag ExtendedHiveTest for HiveSparkSubmitSuite
## What changes were proposed in this pull request?
The total run time of `HiveSparkSubmitSuite` is about 10 minutes. While
the related code is stable, add tag `ExtendedHiveTest` for it.
## How was this patch tested?
Unit test.
Closes #22642 from gengliangwang/addTagForHiveSparkSubmitSuite.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala (diff)
Commit 2c6f4d61bbf7f0267a7309b4a236047f830bd6ee by gatorsmile
[SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite:
cache UDF result correctly
## What changes were proposed in this pull request? In this test case,
we are verifying that the result of an UDF  is cached when the
underlying data frame is cached and that the udf is not evaluated again
when the cached data frame is used.
To reduce the runtime we do : 1) Use a single partition dataframe, so
the total execution time of UDF is more deterministic. 2) Cut down the
size of the dataframe from 10 to 2. 3) Reduce the sleep time in the UDF
from 5secs to 2secs. 4) Reduce the failafter condition from 3 to 2.
With the above change, it takes about 4 secs to cache the first
dataframe. And subsequent check takes a few hundred milliseconds. The
new runtime for 5 consecutive runs of this test is as follows :
```
[info] - cache UDF result correctly (4 seconds, 906 milliseconds)
[info] - cache UDF result correctly (4 seconds, 281 milliseconds)
[info] - cache UDF result correctly (4 seconds, 288 milliseconds)
[info] - cache UDF result correctly (4 seconds, 355 milliseconds)
[info] - cache UDF result correctly (4 seconds, 280 milliseconds)
```
## How was this patch tested? This is s test fix.
Closes #22638 from dilipbiswal/SPARK-25610.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile
<gatorsmile@gmail.com>
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala (diff)