Commit
5debde94019d46d4ab66f7927d9e5e8c4d16a7ec
by dongjoon[SPARK-32357][INFRA] Publish failed and succeeded test reports in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to report the failed and succeeded tests in GitHub Actions in order to improve the development velocity by leveraging [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report). See the example below:  Note that we cannot just use [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) in Apache Spark because PRs are from the forked repository, and GitHub secrets are unavailable for the security reason. This plugin and all similar plugins require to have the GitHub token that has the write access in order to post test results but it is unavailable in PRs. To work around this limitation, I took this approach: 1. In workflow A, run the tests and upload the JUnit XML test results. GitHub provides to upload and download some files. 2. GitHub introduced new event type [`workflow_run`](https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/) 10 days ago. By leveraging this, it triggers another workflow B. 3. Workflow B is in the main repo instead of fork repo, and has the write access the plugin needs. In workflow B, it downloads the artifact uploaded from workflow A (from the forked repository). 4. Workflow B generates the test reports to port from JUnit xml files. 5. Workflow B looks up the PR and posts the test reports. The `workflow_run` event is very new feature, and looks not so many GitHub Actions plugins support. In order to make this working with [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report), I had to fork two GitHub Actions plugins to use: - [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) to have this custom fix: https://github.com/HyukjinKwon/action-surefire-report/commit/c96094cc35061fcf154a7cb46807f2f3e2339476 It added `commit` argument to specify the commit to post the test reports. With `workflow_run`, it can access, in workflow B, to the commit from workflow A. - [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact) to have this custom fix: https://github.com/HyukjinKwon/action-download-artifact/commit/750b71af351aba467757d7be6924199bb08db4ed It added the support of downloading all artifacts from workflow A, in workflow B. By default, it only supports to specify the name of artifact. Note that I was not able to use the official [actions/download-artifact](https://github.com/actions/download-artifact) because: - It does not support to download artifacts between different workflows, see also https://github.com/actions/download-artifact/issues/3. Once this issue is resolved, we can switch it back to [actions/download-artifact](https://github.com/actions/download-artifact). I plan to make a pull request for both repositories so we don't have to rely on forks. ### Why are the changes needed? Currently, it's difficult to check the failed tests. You should scroll down long logs from GitHub Actions logs. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested at: https://github.com/HyukjinKwon/spark/pull/17, https://github.com/HyukjinKwon/spark/pull/18, https://github.com/HyukjinKwon/spark/pull/19, https://github.com/HyukjinKwon/spark/pull/20, and master branch of my forked repository. Closes #29333 from HyukjinKwon/SPARK-32357-fix. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (commit: 5debde9) |
 | .github/workflows/test_report.yml |
 | .github/workflows/master.yml (diff) |
Commit
339eec5f321fac07e01441a3fd83024f4b92a0d6
by wenchen[SPARK-20680][SQL][FOLLOW-UP] Add HiveVoidType in HiveClientImpl ### What changes were proposed in this pull request? Discussion with [comment](https://github.com/apache/spark/pull/29244#issuecomment-671746329). Add `HiveVoidType` class in `HiveClientImpl` then we can replace `NullType` to `HiveVoidType` before we call hive client. ### Why are the changes needed? Better compatible with hive. More details in [#29244](https://github.com/apache/spark/pull/29244). ### Does this PR introduce _any_ user-facing change? Yes, user can create view with null type in Hive. ### How was this patch tested? New test. Closes #29423 from ulysses-you/add-HiveVoidType. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: 339eec5) |
 | sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (diff) |
 | sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala (diff) |
 | sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff) |
Commit
14003d4c30e881940b6e3bf2d978bc8dca941146
by wenchen[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec ### What changes were proposed in this pull request? Remove `fullOutput` from `RowDataSourceScanExec` ### Why are the changes needed? `RowDataSourceScanExec` requires the full output instead of the scan output after column pruning. However, in v2 code path, we don't have the full output anymore so we just pass the pruned output. `RowDataSourceScanExec.fullOutput` is actually meaningless so we should remove it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #29415 from huaxingao/rm_full_output. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: 14003d4) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (diff) |
Commit
10edeafc69250afef8c71ed7b3c77992f67aa4ff
by yamamuro[MINOR][SQL] Fixed approx_count_distinct rsd param description ### What changes were proposed in this pull request? In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from **_maximum estimation error allowed_** to _**maximum relative standard deviation allowed**_ ### Why are the changes needed? Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter. ### Does this PR introduce _any_ user-facing change? This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead, ### How was this patch tested? No tests, as no code changes were made. Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description. Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (commit: 10edeaf) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala (diff) |
 | python/pyspark/sql/functions.py (diff) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala (diff) |
 | R/pkg/R/functions.R (diff) |
 | sql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff) |
Commit
c6be2074cc704ec1578eb05d1d8a696227cd2bbf
by wenchen[SPARK-32616][SQL] Window operators should be added determinedly ### What changes were proposed in this pull request? Use the `LinkedHashMap` instead of `immutable.Map` to hold the `Window` expressions in `ExtractWindowExpressions.addWindow`. ### Why are the changes needed? This is a bug fix for https://github.com/apache/spark/pull/29270. In that PR, the generated plan(especially for the queries q47, q49, q57) on Jenkins always can not match the golden plan generated on my laptop. It happens because `ExtractWindowExpressions.addWindow` now uses `immutable.Map` to hold the `Window` expressions by the key `(spec.partitionSpec, spec.orderSpec, WindowFunctionType.functionType(expr))` and converts the map to `Seq` at the end. Then, the `Seq` is used to add Window operators on top of the child plan. However, for the same query, the order of Windows expression inside the `Seq` could be undetermined when the expression id changes(which can affect the key). As a result, the same query could have different plans because of the undetermined order of Window operators. Therefore, we use `LinkedHashMap`, which records the insertion order of entries, to make the adding order determined. ### Does this PR introduce _any_ user-facing change? Maybe yes, users now always see the same plan for the same queries with multiple Window operators. ### How was this patch tested? It's really hard to make a reproduce demo. I just tested manually with https://github.com/apache/spark/pull/29270 and it looks good. Closes #29432 from Ngone51/fix-addWindow. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (commit: c6be207) |
 | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff) |
Commit
1a4c8f718f748d8c2a39f0a3b209ce606bffe958
by mridulatgmail.com[SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars ### What changes were proposed in this pull request? This PR changes Executor to load jars and files added by --jars and --files on Executor initialization. To avoid downloading those jars/files twice, they are assosiated with `startTime` as their uploaded timestamp. ### Why are the changes needed? ExecutorPlugin can't work with Standalone Cluster and Kubernetes when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit. This is because jars and files added by --jars and --files are not loaded on Executor initialization. I confirmed it works with YARN because jars/files are distributed as distributed cache. ### Does this PR introduce _any_ user-facing change? Yes. jars/files added by --jars and --files are downloaded on each executor on initialization. ### How was this patch tested? Added a new testcase. Closes #28939 from sarutak/fix-plugin-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (commit: 1a4c8f7) |
 | core/src/main/scala/org/apache/spark/SparkContext.scala (diff) |
 | core/src/main/scala/org/apache/spark/executor/Executor.scala (diff) |
 | core/src/test/java/test/org/apache/spark/JavaSparkContextSuite.java (diff) |
 | core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala (diff) |