Commit
dcb08204339e2291727be8e1a206e272652f9ae4
by yamamuro[SPARK-32785][SQL][DOCS][FOLLOWUP] Update migaration guide for incomplete interval literals ### What changes were proposed in this pull request? Address comments https://github.com/apache/spark/pull/29635#discussion_r507241899 to improve migration guide ### Why are the changes needed? improve migration guide ### Does this PR introduce _any_ user-facing change? NO,only doc update ### How was this patch tested? passing GitHub action Closes #30113 from yaooqinn/SPARK-32785-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (commit: dcb0820) |
 | docs/sql-migration-guide.md (diff) |
Commit
618695b78fe93ae6506650ecfbebe807a43c5f0c
by srowen[SPARK-33111][ML][FOLLOW-UP] aft transform optimization - predictQuantiles ### What changes were proposed in this pull request? 1, optimize `predictQuantiles` by pre-computing an auxiliary var. ### Why are the changes needed? In https://github.com/apache/spark/pull/30000, I optimized the `transform` method. I find that we can also optimize `predictQuantiles` by pre-computing an auxiliary var. It is about 56% faster than existing impl. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #30034 from zhengruifeng/aft_quantiles_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (commit: 618695b) |
 | mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (diff) |
 | mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala (diff) |
Commit
1b7367ccd7cdcbfc9ff9a3893693a3261a5eb7c1
by viirya[SPARK-33205][BUILD] Bump snappy-java version to 1.1.8 ### What changes were proposed in this pull request? This PR intends to upgrade snappy-java from 1.1.7.5 to 1.1.8. ### Why are the changes needed? For performance improvements; the released `snappy-java` bundles the latest `Snappy` v1.1.8 binaries with small performance improvements. - snappy-java release note: https://github.com/xerial/snappy-java/releases/tag/1.1.8 - snappy release note: https://github.com/google/snappy/releases/tag/1.1.8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA tests. Closes #30120 from maropu/Snappy1.1.8. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (commit: 1b7367c) |
 | dev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff) |
 | dev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff) |
 | pom.xml (diff) |
Commit
7aed81d4926c8f13ffb38f7ff90162b15c876016
by dhyun[SPARK-33202][CORE] Fix BlockManagerDecommissioner to return the correct migration status ### What changes were proposed in this pull request? This PR changes `<` into `>` in the following to fix data loss during storage migrations. ```scala // If we found any new shuffles to migrate or otherwise have not migrated everything. - newShufflesToMigrate.nonEmpty || migratingShuffles.size < numMigratedShuffles.get() + newShufflesToMigrate.nonEmpty || migratingShuffles.size > numMigratedShuffles.get() ``` ### Why are the changes needed? `refreshOffloadingShuffleBlocks` should return `true` when the migration is still on-going. Since `migratingShuffles` is defined like the following, `migratingShuffles.size > numMigratedShuffles.get()` means the migration is not finished. ```scala // Shuffles which are either in queue for migrations or migrated protected[storage] val migratingShuffles = mutable.HashSet[ShuffleBlockInfo]() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the updated test cases. Closes #30116 from dongjoon-hyun/SPARK-33202. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (commit: 7aed81d) |
 | core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionUnitSuite.scala (diff) |
 | core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala (diff) |
Commit
66005a323625fc8c7346d28e9a8c52f91ae8d1a0
by cutlerb[SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical ### What changes were proposed in this pull request? This PR is a small followup of https://github.com/apache/spark/pull/28793 and proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`. `is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (https://github.com/pandas-dev/pandas/commit/87a1cc21cab751c16fda4e6f0a95988a8d90462b). ### Why are the changes needed? To avoid using deprecated APIs, and remove warnings. ### Does this PR introduce _any_ user-facing change? Yes, it will remove warnings that says `is_categorical` is deprecated. ### How was this patch tested? By running any pandas UDF with pandas 1.1.0+: ```python import pandas as pd from pyspark.sql.functions import pandas_udf def func(x: pd.Series) -> pd.Series: return x spark.range(10).select(pandas_udf(func, "long")("id")).show() ``` Before: ``` /.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead ... ``` After: ``` ... ``` Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> (commit: 66005a3) |
 | python/pyspark/sql/pandas/serializers.py (diff) |
Commit
bbf2d6f6df0011c3035d829a56b035a2b094295c
by gurwls223[SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing ### What changes were proposed in this pull request? 1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata. 2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`| ### Why are the changes needed? To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By updating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` Closes #30118 from MaxGekk/int96-rebase-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (commit: bbf2d6f) |
 | sql/core/benchmarks/DateTimeRebaseBenchmark-results.txt (diff) |
 | sql/core/benchmarks/DateTimeRebaseBenchmark-jdk11-results.txt (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala (diff) |
Commit
4a33cd928df4739e69ae9530aae23964e470d2f8
by dhyun[SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors ### What changes were proposed in this pull request? Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment) ### Why are the changes needed? The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with ``` File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1)) AssertionError: False is not true ``` ``` File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS Failed example: predictions[0] Expected: Row(user=0, item=2, newPrediction=0.6929101347923279) Got: Row(user=0, item=2, newPrediction=0.6929104924201965) ... ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This path changes a test target. Just executed the tests to verify they pass. Closes #30104 from AlessandroPatti/apatti/rounding-errors. Authored-by: Alessandro Patti <ale812@yahoo.it> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (commit: 4a33cd9) |
 | python/pyspark/ml/tests/test_algorithms.py (diff) |
 | python/pyspark/ml/recommendation.py (diff) |
Commit
ba13b94f6b2b477a93c0849c1fc776ffd5f1a0e6
by wenchen[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default ### What changes were proposed in this pull request? 1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`. 2. Update the SQL migration guide. ### Why are the changes needed? Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites like `ParquetIOSuite`. Closes #30121 from MaxGekk/int96-exception-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: ba13b94) |
 | sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff) |
 | sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala (diff) |
 | docs/sql-migration-guide.md (diff) |
Commit
cb3fa6c9368e64184a5f7b19688181d11de9511c
by d_tsai[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? This serves two purposes: - to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop. - avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #29843 from sunchao/SPARK-29250. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (commit: cb3fa6c) |
 | resource-managers/yarn/pom.xml (diff) |
 | core/pom.xml (diff) |
 | hadoop-cloud/pom.xml (diff) |
 | launcher/pom.xml (diff) |
 | common/network-yarn/pom.xml (diff) |
 | resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala (diff) |
 | pom.xml (diff) |
 | core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff) |
 | external/kinesis-asl-assembly/pom.xml (diff) |
 | sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala (diff) |
 | sql/hive/pom.xml (diff) |
 | external/kafka-0-10-sql/pom.xml (diff) |
 | sql/catalyst/pom.xml (diff) |
 | external/kafka-0-10-token-provider/pom.xml (diff) |
 | resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala (diff) |
 | dev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff) |
 | dev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff) |
 | external/kafka-0-10-assembly/pom.xml (diff) |
Commit
eb33bcb4b2db2a13b3da783e58feb8852e04637b
by wenchen[SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE ### What changes were proposed in this pull request? `REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string. But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore: AddressLine1 -> address_line_1 If we support the parameter position, we can do like this(e.g. Oracle): ``` WITH strings as ( SELECT 'AddressLine1' s FROM dual union all SELECT 'ZipCode' s FROM dual union all SELECT 'Country' s FROM dual ) SELECT s "STRING", lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING" FROM strings; ``` The output: ``` STRING MODIFIED_STRING -------------------- -------------------- AddressLine1 address_line_1 ZipCode zip_code Country country ``` There are some mainstream database support the syntax. **Oracle** https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490 **Vertica** https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace **Redshift** https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html ### Why are the changes needed? The parameter position for `REGEXP_REPLACE` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #29891 from beliefer/add-position-for-regex_replace. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: eb33bcb) |
 | sql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff) |
 | sql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff) |
 | sql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql (diff) |
Commit
a908b67502164d5b1409aca912dac7042e825586
by dhyun[SPARK-33218][CORE] Update misleading log messages for removed shuffle blocks ### What changes were proposed in this pull request? This updates the misleading log messages for removed shuffle block during migration. ### Why are the changes needed? 1. For the deleted shuffle blocks, `IndexShuffleBlockResolver` shows users WARN message saying `skipping migration`. However, `BlockManagerDecommissioner` shows users INFO message including `Migrated ShuffleBlockInfo(...)` inconsistently. Technically, we didn't migrated. We should not show `Migrated` message in this case. ``` INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3) WARN IndexShuffleBlockResolver: Failed to resolve shuffle block ShuffleBlockInfo(109,18924), skipping migration. This is expected to occur if a block is removed after decommissioning has started. INFO BlockManagerDecommissioner: Got migration sub-blocks List() ... INFO BlockManagerDecommissioner: Migrated ShuffleBlockInfo(109,18924) to BlockManagerId(...) ``` 2. In addition, if the shuffle file is deleted while the information is in the queue, the above messages are repeated multiple times, `spark.storage.decommission.maxReplicationFailuresPerBlock`. We had better use one line instead of the group of messages for that case. ``` INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (0 / 3) ... INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (1 / 3) ... INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3) ``` 3. Skipping or not is a role of `BlockManagerDecommissioner` class. `IndexShuffleBlockResolver.getMigrationBlocks` is used twice differently like the following. We had better inform users at `BlockManagerDecommissioner` once. - At the beginning, to get the sub-blocks. - In case of `IOException`, to determine whether ignoring it or re-throwing. And, `BlockManagerDecommissioner` shows WARN message (`Skipping block ...`) again. ### Does this PR introduce _any_ user-facing change? No. This is an update for log message info to be consistent. ### How was this patch tested? Manually. Closes #30129 from dongjoon-hyun/SPARK-33218. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (commit: a908b67) |
 | core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff) |
 | core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala (diff) |
Commit
d9ee33cfb95e1f05878e498c93c5cc65ce449f0e
by yamamuro[SPARK-26533][SQL] Support query auto timeout cancel on thriftserver ### What changes were proposed in this pull request? Support query auto cancelling when running too long on thriftserver. This is the rework of #28991 and the credit should be the original author, leoluan2009. Closes #28991 ### Why are the changes needed? For some cases, we use thriftserver as long-running applications. Some times we want all the query need not to run more than given time. In these cases, we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #29933 from maropu/pr28991. Lead-authored-by: Xuedong Luan <luanxuedong2009@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Luan <luanxuedong2009@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (commit: d9ee33c) |
 | sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala (diff) |
 | sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2AppStatusStore.scala (diff) |
 | sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2ListenerSuite.scala (diff) |
 | sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala (diff) |
 | sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (diff) |
 | sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2EventManager.scala (diff) |
 | sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff) |
 | sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java (diff) |
 | sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperationSuite.scala (diff) |
 | sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/OperationManager.java (diff) |
 | sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff) |
Commit
8cae7f88b011939473fc9a6373012e23398bbc07
by wenchen[SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### What changes were proposed in this pull request? Override the default SQL strings for: ALTER TABLE UPDATE COLUMN TYPE ALTER TABLE UPDATE COLUMN NULLABILITY in the following MySQL JDBC dialect according to official documentation. Write MySQL integration tests for JDBC. ### Why are the changes needed? Improved code coverage and support mysql dialect for jdbc. ### Does this PR introduce _any_ user-facing change? Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### How was this patch tested? Added tests. Closes #30025 from ScrapCodes/mysql-dialect. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: 8cae7f8) |
 | sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala (diff) |
 | external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala |
 | external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff) |
Commit
a1629b4a5790dce1a57e2c2bad9e04c627b88d29
by wenchen[SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location ### What changes were proposed in this pull request? Support `spark.sql.hive.metastore.jars` use HDFS location. When user need to use path to set hive metastore jars, you should set `spark.sql.hive.metasstore.jars=path` and set real path in `spark.sql.hive.metastore.jars.path` since we use `File.pathSeperator` to split path, but `FIle.pathSeparator` is `:` in unix, it will split hdfs location `hdfs://nameservice/xx`. So add new config `spark.sql.hive.metastore.jars.path` to set comma separated paths. To keep both two way supported ### Why are the changes needed? All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node. ### Does this PR introduce _any_ user-facing change? User can use HDFS location to store hive metastore jars ### How was this patch tested? Manuel tested. Closes #29881 from AngersZhuuuu/SPARK-32852. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: a1629b4) |
 | sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (diff) |
Commit
b38f3a5557b45503e0f8d67bc77c5d390a67a42f
by wenchen[SPARK-32978][SQL] Make sure the number of dynamic part metric is correct ### What changes were proposed in this pull request? The purpose of this pr is to resolve SPARK-32978. The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side. The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct. ### Why are the changes needed? The the number of dynamic part metric we display on the UI should be correct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new test case refer to described in SPARK-32978 Closes #30026 from LuciferYang/SPARK-32978. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: b38f3a5) |
 | sql/core/benchmarks/InsertTableWithDynamicPartitionsBenchmark-jdk11-results.txt |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala |
 | sql/core/benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt |
Commit
a03d77d32696f5a33770e9bee654acde904da7d4
by wenchen[SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96` ### What changes were proposed in this pull request? 1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`. 2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`. 3. Change handling the metadata key in read: - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead` - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type. - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't. ### Why are the changes needed? - To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121. - To have the implementation similar to `org.apache.spark.legacyDateTime` - To minimise impact on other subsystems that are based on file sizes like gathering statistics. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test in `ParquetIOSuite` Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: a03d77d) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala (diff) |
 | sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/package.scala (diff) |