Commit
4b0e23e646b579b852056ffc87164b16adef5a09
by kabhwan.opensource[SPARK-33215][WEBUI] Speed up event log download by skipping UI rebuild ### What changes were proposed in this pull request? This patch separates the view permission checks from the getAppUi in FsHistoryServerProvider, thus enabling SHS to do view permissions check of a given attempt for a given user without rebuilding the UI. This is achieved by adding a method "checkUIViewPermissions(appId: String, attemptId: Option[String], user: String): Boolean" to many layers of history server components. Currently, this feature is useful for event log download. ### Why are the changes needed? Right now, when we want to download the event logs from the spark history server, SHS will need to parse entire the event log to rebuild UI, and this is just for view permission checks. UI rebuilding is a time-consuming and memory-intensive task, especially for large logs. However, this process is unnecessary for event log download. With this patch, UI rebuild can be skipped when downloading event logs from the history server. Thus the time of downloading a GB scale event log can be reduced from several minutes to several seconds, and the memory consumption of UI rebuilding can be avoided. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added test cases to confirm the view permission checks work properly and download event logs won't trigger UI loading. Also did some manual tests to verify the download speed can be drastically improved and the authentication works properly. Closes #30126 from baohe-zhang/bypass_ui_rebuild_for_log_download. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (commit: 4b0e23e) |
 | core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala (diff) |
 | core/src/main/scala/org/apache/spark/ui/SparkUI.scala (diff) |
 | core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala (diff) |
 | core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala (diff) |
 | core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala (diff) |
 | core/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala (diff) |
 | core/src/main/scala/org/apache/spark/status/api/v1/OneApplicationResource.scala (diff) |
 | core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala (diff) |
Commit
537a49fc0966b0b289b67ac9c6ea20093165b0da
by wenchen[SPARK-33140][SQL] remove SQLConf and SparkSession in all sub-class of Rule[QueryPlan] ### What changes were proposed in this pull request? Since Issue [SPARK-33139](https://issues.apache.org/jira/browse/SPARK-33139) has been done, and SQLConf.get and SparkSession.active are more reliable. We are trying to refine the existing code usage of passing SQLConf and SparkSession into sub-class of Rule[QueryPlan]. In this PR. * remove SQLConf from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SQLConf.get to replace the original SQLConf instance. * remove SparkSession from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SparkSession.active to replace the original SparkSession instance. ### Why are the changes needed? Code refine. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test Closes #30097 from leanken/leanken-SPARK-33140. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: 537a49f) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinals.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/timeZoneAnalysis.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SelectedFieldSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/DisableUnnecessaryBucketedScan.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FallBackFileSourceV2.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff) |
 | sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala (diff) |
 | sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala (diff) |
 | sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupingAnalyticsSuite.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/Rule.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveLambdaVariablesSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTablesSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ReuseAdaptiveSubquery.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolvedUuidExpressionsSuite.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala (diff) |
 | sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoinSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DemoteBroadcastHashJoin.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/exchange/EnsureRequirementsSuite.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/sources/DataSourceAnalysisSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PlanDynamicPruningFilters.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/ColumnarRulesSuite.scala (diff) |
Commit
281f99c70b2fab2839495638d07acc1e534e5ad6
by yamamuro[SPARK-33225][SQL] Extract AliasHelper trait ### What changes were proposed in this pull request? Extract methods related to handling Aliases to a trait. ### Why are the changes needed? Avoid code duplication ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs cover this Closes #30134 from tanelk/SPARK-33225_aliasHelper. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (commit: 281f99c) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala (diff) |
Commit
f284218dae23bf91e72e221943188cdb85e13dac
by wenchen[SPARK-33137][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (Postgres dialect) ### What changes were proposed in this pull request? Override the default SQL strings in Postgres Dialect for: - ALTER TABLE UPDATE COLUMN TYPE - ALTER TABLE UPDATE COLUMN NULLABILITY Add new docker integration test suite `jdbc/v2/PostgreSQLIntegrationSuite.scala` ### Why are the changes needed? supports Postgres specific ALTER TABLE syntax. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test `PostgreSQLIntegrationSuite` Closes #30089 from huaxingao/postgres_docker. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: f284218) |
 | external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala |
 | external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala (diff) |
 | external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala (diff) |
Commit
98f0a219915dc9ed696602b9bfad82d9cf6c4113
by dhyun[SPARK-33231][SPARK-33262][CORE] Make pod allocation executor timeouts configurable & allow scheduling with pending pods ### What changes were proposed in this pull request? Make pod allocation executor timeouts configurable. Keep all known pods in mind when allocating executors to avoid over-allocating if the pending time is much higher than the allocation interval. This PR increases the default wait time to 600s from the current 60s. Since nodes can now remain "pending" for long periods of time, we allow additional batches to be scheduled during pending allocation but keep the total number of pods in account. ### Why are the changes needed? The current executor timeouts do not match that of all real world clusters especially under load. While this can be worked around by increasing the allocation batch delay, that will decrease the speed at which the total number of executors will be able to be requested. The increase in default timeout is needed to handle real-world testing environments I've encountered on moderately busy clusters and K8s clusters with their own underlying dynamic scale-up of hardware (e.g. GKE, EKS, etc.) ### Does this PR introduce _any_ user-facing change? Yes new configuration property ### How was this patch tested? Updated existing test to use the timeout from the new configuration property. Verified test failed without the update. Closes #30155 from holdenk/SPARK-33231-make-pod-creation-timeout-configurable. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (commit: 98f0a21) |
 | resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala (diff) |
 | resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala (diff) |
 | resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff) |
Commit
3f2a2b5fe6ada37ef86f00737387e6cf2496df74
by dhyun[SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is Stream ### What changes were proposed in this pull request? The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`. ```sql SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) FROM table_4 GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 ``` When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later. Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680. The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes. Closes #30160 from ankurdave/SPARK-33260. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (commit: 3f2a2b5) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/SortSuite.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala (diff) |
Commit
7d11d972c356140d21909c6a62cdb8d813bd015e
by yamamuro[SPARK-33246][SQL][DOCS] Correct documentation for null semantics of "NULL AND False" ### What changes were proposed in this pull request? The documentation of the Spark SQL null semantics states that "NULL AND False" yields NULL. This is incorrect. "NULL AND False" yields False. ``` Seq[(java.lang.Boolean, java.lang.Boolean)]( (null, false) ) .toDF("left_operand", "right_operand") .withColumn("AND", 'left_operand && 'right_operand) .show(truncate = false) +------------+-------------+-----+ |left_operand|right_operand|AND | +------------+-------------+-----+ |null |false |false| +------------+-------------+-----+ ``` I propose the documentation be updated to reflect that "NULL AND False" yields False. This contribution is my original work and I license it to the project under the project’s open source license. ### Why are the changes needed? This change improves the accuracy of the documentation. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces a fix to the documentation. ### How was this patch tested? Since this is only a documentation change, no tests were added. Closes #30161 from stwhit/SPARK-33246. Authored-by: Stuart White <stuart@spotright.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (commit: 7d11d97) |
 | docs/sql-ref-null-semantics.md (diff) |
Commit
ea709d67486dd6329977df6c3ed7a443b835dd48
by gurwls223[SPARK-33258][R][SQL] Add asc_nulls_* and desc_nulls_* methods to SparkR ### What changes were proposed in this pull request? This PR adds the following `Column` methods to R API: - asc_nulls_first - asc_nulls_last - desc_nulls_first - desc_nulls_last ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? No, new methods. ### How was this patch tested? New unit tests. Closes #30159 from zero323/SPARK-33258. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (commit: ea709d6) |
 | R/pkg/R/generics.R (diff) |
 | R/pkg/NAMESPACE (diff) |
 | R/pkg/R/column.R (diff) |
 | R/pkg/tests/fulltests/test_sparkSQL.R (diff) |
Commit
c2bea045e3628081bca1ba752669a5bc009ebd00
by yamamuro[SPARK-33264][SQL][DOCS] Add a dedicated page for SQL-on-file in SQL documents ### What changes were proposed in this pull request? This PR intends to add a dedicated page for SQL-on-file in SQL documents. This comes from the comment: https://github.com/apache/spark/pull/30095/files#r508965149 ### Why are the changes needed? For better documentations. ### Does this PR introduce _any_ user-facing change? <img width="544" alt="Screen Shot 2020-10-28 at 9 56 59" src="https://user-images.githubusercontent.com/692303/97378051-c1fbcb80-1904-11eb-86c0-a88c5269d41c.png"> ### How was this patch tested? N/A Closes #30165 from maropu/DocForFile. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (commit: c2bea04) |
 | docs/sql-ref-syntax-qry.md (diff) |
 | docs/sql-ref-syntax-qry-select.md (diff) |
 | docs/_data/menu-sql.yaml (diff) |
 | docs/sql-ref-syntax-qry-select-file.md |
 | docs/sql-ref-syntax.md (diff) |
Commit
fcf8aa59b5025dde9b4af36953146894659967e2
by wenchen[SPARK-33240][SQL] Fail fast when fails to instantiate configured v2 session catalog ### What changes were proposed in this pull request? This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog. ### Why are the changes needed? The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation. This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list. ### Does this PR introduce _any_ user-facing change? Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog. ### How was this patch tested? New UT added. Closes #30147 from HeartSaVioR/SPARK-33240. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: fcf8aa5) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/connector/SupportsCatalogOptionsSuite.scala (diff) |
Commit
528160f0014206eaceb01ae0f3ad316bfbdc6885
by wenchen[SPARK-33174][SQL] Migrate DROP TABLE to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DROP TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns") sql("DROP TABLE t") // 't' is resolved to testcat.ns.t ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("DROP TABLE t") // 't' is resolved to a temp view ``` ### Does this PR introduce _any_ user-facing change? After this PR, for v2, `DROP TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`, consistent with v1 behavior. ### How was this patch tested? Added a new test Closes #30079 from imback82/drop_table_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: 528160f) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveNoopDropTable.scala |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala (diff) |
Commit
9fb45361fd00b046e04748e1a1c8add3fa09f01c
by wenchen[SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts ### What changes were proposed in this pull request? This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix. ### Why are the changes needed? A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts. ``` Sort(orders, global = True, ...) Sort(orders, global = False, ...) ``` Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Unit tests Closes #30093 from allisonwang-db/fix-sort. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: 9fb4536) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantSortsSuite.scala |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff) |
 | sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantSorts.scala |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff) |
Commit
3c3ad5f7c00f6f68bc659d4cf7020fa944b7bc69
by wenchen[SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction ### What changes were proposed in this pull request? Spark SQL supports some window function like `NTH_VALUE`. If we specify window frame like `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, we can elimate some calculations. For example: if we execute the SQL show below: ``` SELECT NTH_VALUE(col, 2) OVER(ORDER BY rank UNBOUNDED PRECEDING AND CURRENT ROW) FROM tab; ``` The output for row number greater than 1, return the fixed value. otherwise, return null. So we just calculate the value once and notice whether the row number less than 2. `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` is simpler. ### Why are the changes needed? Improve the performance for `NTH_VALUE`, `FIRST_VALUE` and `LAST_VALUE`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29800 from beliefer/optimize-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: 3c3ad5f) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff) |
 | sql/core/src/test/resources/sql-tests/inputs/window.sql (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala (diff) |
 | sql/core/src/test/resources/sql-tests/results/window.sql.out (diff) |
Commit
2b8fe6d9ae2fe31d1545da98003f931ee1aa11d5
by gurwls223[SPARK-33269][INFRA] Ignore ".bsp/" directory in Git ### What changes were proposed in this pull request? After SBT upgrade into 1.4.0 and above. there is always a ".bsp" directory after sbt starts: https://github.com/sbt/sbt/releases/tag/v1.4.0 This PR is to put the directory in to `.gitignore`. ### Why are the changes needed? The ".bsp" directory is an untracked file for git during development. This is annoying. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual local test Closes #30171 from gengliangwang/ignoreBSP. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (commit: 2b8fe6d) |
 | .gitignore (diff) |
Commit
b26ae98407c6c017a4061c0c420f48685ddd6163
by wenchen[SPARK-33208][SQL] Update the document of SparkSession#sql Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b ### What changes were proposed in this pull request? Update the document of SparkSession#sql, mention that this API eagerly runs DDL/DML commands, but not for SELECT queries. ### Why are the changes needed? To clarify the behavior of SparkSession#sql. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No needed. Closes #30168 from waitinfuture/master. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: b26ae98) |
 | sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala (diff) |
Commit
a6216e2446b6befc3f6d6b370e694421aadda9dd
by dhyun[SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType ### What changes were proposed in this pull request? This PR intends to fix bus for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows; ``` >>> from pyspark.sql import Row >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import * >>> from pyspark.testing.sqlutils import * >>> >>> row = Row(point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> df.select(col("point").cast(PythonOnlyUDT())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.select. : java.lang.NullPointerException at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84) at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96) at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290) ``` A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`. ### Why are the changes needed? Bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30169 from maropu/FixPythonUDTCast. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (commit: a6216e2) |
 | python/pyspark/sql/tests/test_types.py (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala (diff) |
Commit
a744fea3be12f1a53ab553040b95da730210bc88
by dhyun[SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values contains null ### What changes were proposed in this pull request? This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue. ### Why are the changes needed? This is an obvious bug as `In` filter doesn't care about null value when calculating hash code. ### Does this PR introduce _any_ user-facing change? Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail. ### How was this patch tested? UT added. The new UT fails without the PR and passes with the PR. Closes #30170 from HeartSaVioR/SPARK-33267. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (commit: a744fea) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala (diff) |
 | sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala (diff) |