FailedChanges

Summary

  1. [SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in (commit: 11d2b07b74c73ce6d59ac4f7446f1eb8bc6bbb4b) (details)
  2. [SPARK-32073][R] Drop R < 3.5 support (commit: b62e2536db9def0d11605ceac8990f72a515e9a0) (details)
  3. [SPARK-32028][WEBUI] fix app id link for multi attempts app in history (commit: eedc6cc37df9b32995f41bd0e1779101ba1df1b8) (details)
  4. [SPARK-32075][DOCS] Fix a few issues in parameters table (commit: 986fa01747db4b52bb8ca1165e759ca2d46d26ff) (details)
  5. [SPARK-32072][CORE][TESTS] Fix table formatting with benchmark results (commit: 045106e29d6b3cbb7be61b46604b85297c405aa3) (details)
  6. [SPARK-32062][SQL] Reset listenerRegistered in SparkSession (commit: 9f540fac2e50bbcc214351f0c80690eae7be6b98) (details)
  7. [SPARK-32074][BUILD][R] Update AppVeyor R version to 4.0.2 (commit: e29ec428796eac4ebdb9c853131465d7570dc2f1) (details)
  8. [SPARK-32080][SPARK-31998][SQL] Simplify ArrowColumnVector ListArray (commit: df04107934241965199bd5454c62e1016bb3bdd9) (details)
  9. [SPARK-32087][SQL] Allow UserDefinedType to use encoder to deserialize (commit: 47fb9d60549da02b869a3f0aad2ccb34d455c963) (details)
  10. [SPARK-32089][R][BUILD] Upgrade R version to 4.0.2 in the release (commit: 71b6d462fbeebf5e7e9a95896f0dca8297d0b8dd) (details)
  11. [SPARK-32078][DOC] Add a redirect to sql-ref from sql-reference (commit: d06604f60a8a2ba0877616370a20aa18be15f8c4) (details)
Commit 11d2b07b74c73ce6d59ac4f7446f1eb8bc6bbb4b by gurwls223
[SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in
closure cleaning to support R 4.0.0+
### What changes were proposed in this pull request?
This PR proposes to ignore S4 generic methods under SparkR namespace in
closure cleaning to support R 4.0.0+.
Currently, when you run the codes that runs R native codes, it fails as
below with R 4.0.0:
```r df <- createDataFrame(lapply(seq(100), function (e) list(value=e)))
count(dapply(df, function(x) as.data.frame(x[x$value < 50,]),
schema(df)))
```
``` org.apache.spark.SparkException: R unexpectedly exited. R worker
produced errors: Error in lapply(part, FUN) : attempt to bind a variable
to R_UnboundValue
```
The root cause seems to be related to when an S4 generic method is
manually included into the closure's environment via
`SparkR:::cleanClosure`. For example, when an RRDD is created via
`createDataFrame` with calling `lapply` to convert, `lapply` itself:
https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/RDD.R#L484
is added into the environment of the cleaned closure - because this is
not an exposed namespace; however, this is broken in R 4.0.0+ for an
unknown reason with an error message such as "attempt to bind a variable
to R_UnboundValue".
Actually, we don't need to add the `lapply` into the environment of the
closure because it is not supposed to be called in worker side. In fact,
there is no private generic methods supposed to be called in worker side
in SparkR at all from my understanding.
Therefore, this PR takes a simpler path to work around just by
explicitly excluding the S4 generic methods under SparkR namespace to
support R 4.0.0. in SparkR.
### Why are the changes needed?
To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN
requires the tests pass with the latest R.
### Does this PR introduce _any_ user-facing change?
Yes, it will support R 4.0.0 to end-users.
### How was this patch tested?
Manually tested. Both CRAN and tests with R 4.0.1:
```
══ testthat results
═══════════════════════════════════════════════════════════
[ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ]
✔ |  OK F W S | Context
✔ |  11       | binary functions [2.5 s]
✔ |   4       | functions on binary files [2.1 s]
✔ |   2       | broadcast variables [0.5 s]
✔ |   5       | functions in client.R
✔ |  46       | test functions in sparkR.R [6.3 s]
✔ |   2       | include R packages [0.3 s]
✔ |   2       | JVM API [0.2 s]
✔ |  75       | MLlib classification algorithms, except for tree-based
algorithms [86.3 s]
✔ |  70       | MLlib clustering algorithms [44.5 s]
✔ |   6       | MLlib frequent pattern mining [3.0 s]
✔ |   8       | MLlib recommendation algorithms [9.6 s]
✔ | 136       | MLlib regression algorithms, except for tree-based
algorithms [76.0 s]
✔ |   8       | MLlib statistics algorithms [0.6 s]
✔ |  94       | MLlib tree-based algorithms [85.2 s]
✔ |  29       | parallelize() and collect() [0.5 s]
✔ | 428       | basic RDD functions [25.3 s]
✔ |  39       | SerDe functionality [2.2 s]
✔ |  20       | partitionBy, groupByKey, reduceByKey etc. [3.9 s]
✔ |   4       | functions in sparkR.R
✔ |  16       | SparkSQL Arrow optimization [19.2 s]
✔ |   6       | test show SparkDataFrame when eager execution is
enabled. [1.1 s]
✔ | 1175       | SparkSQL functions [134.8 s]
✔ |  42       | Structured Streaming [478.2 s]
✔ |  16       | tests RDD function take() [1.1 s]
✔ |  14       | the textFile() function [2.9 s]
✔ |  46       | functions in utils.R [0.7 s]
✔ |   0     1 | Windows-specific tests
────────────────────────────────────────────────────────────────────────────────
test_Windows.R:22: skip: sparkJars tag in SparkContext Reason: This test
is only for Windows, skipped
────────────────────────────────────────────────────────────────────────────────
══ Results
═════════════════════════════════════════════════════════════════════
Duration: 987.3 s
OK:       2304 Failed:   0 Warnings: 0 Skipped:  1
... Status: OK
+ popd Tests passed.
```
Note that I tested to build SparkR in R 4.0.0, and run the tests with R
3.6.3. It all passed. See also [the comment in the
JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837).
Closes #28907 from HyukjinKwon/SPARK-31918.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 11d2b07b74c73ce6d59ac4f7446f1eb8bc6bbb4b)
The file was modifiedR/pkg/tests/fulltests/test_mllib_classification.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_mllib_clustering.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_mllib_regression.R (diff)
The file was modifiedR/pkg/R/utils.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_context.R (diff)
Commit b62e2536db9def0d11605ceac8990f72a515e9a0 by gurwls223
[SPARK-32073][R] Drop R < 3.5 support
### What changes were proposed in this pull request?
Spark 3.0 accidentally dropped R < 3.5. It is built by R 3.6.3 which not
support R < 3.5:
``` Error in readRDS(pfile) : cannot read workspace version 3 written by
R 3.6.3; need R 3.5.0 or newer version.
```
In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to
support R 4.0.0. This is inevitable to release on CRAN because they
require to make the tests pass with the latest R.
### Why are the changes needed?
To show the supported versions correctly, and support R 4.0.0 to unblock
the releases.
### Does this PR introduce _any_ user-facing change?
In fact, no because Spark 3.0.0 already does not work with R < 3.5.
Compared to Spark 2.4, yes. R < 3.5 would not work.
### How was this patch tested?
Jenkins should test it out.
Closes #28908 from HyukjinKwon/SPARK-32073.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: b62e2536db9def0d11605ceac8990f72a515e9a0)
The file was modifiedR/pkg/inst/profile/general.R (diff)
The file was modifiedR/pkg/DESCRIPTION (diff)
The file was modifieddocs/index.md (diff)
The file was modifiedR/WINDOWS.md (diff)
The file was modifiedR/pkg/inst/profile/shell.R (diff)
Commit eedc6cc37df9b32995f41bd0e1779101ba1df1b8 by srowen
[SPARK-32028][WEBUI] fix app id link for multi attempts app in history
summary page
### What changes were proposed in this pull request?
Fix app id link for multi attempts application in history summary page
If attempt id is available (yarn), app id link url will contain correct
attempt id, like `/history/application_1561589317410_0002/1/jobs/`. If
attempt id is not available (standalone), app id link url will not
contain fake attempt id, like `/history/app-20190404053606-0000/jobs/`.
### Why are the changes needed?
This PR is for fixing
[32028](https://issues.apache.org/jira/browse/SPARK-32028). App id link
use application attempt count as attempt id. this would cause link url
wrong for below cases: 1. there are multi attempts, all links point to
last attempt
![multi_same](https://user-images.githubusercontent.com/10524738/85098505-c45c5500-b1af-11ea-8912-fa5fd72ce064.JPG)
2. if there is one attempt, but attempt id is not 1 (before attempt
maybe crash or fail to gerenerate event file). link url points to worng
attempt (1) here.
![wrong_attemptJPG](https://user-images.githubusercontent.com/10524738/85098513-c9b99f80-b1af-11ea-8cbc-fd7f745c1080.JPG)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested this manually.
Closes #28867 from zhli1142015/fix-appid-link-in-history-page.
Authored-by: Zhen Li <zhli@microsoft.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: eedc6cc37df9b32995f41bd0e1779101ba1df1b8)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/historypage.js (diff)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/historypage-template.html (diff)
Commit 986fa01747db4b52bb8ca1165e759ca2d46d26ff by gurwls223
[SPARK-32075][DOCS] Fix a few issues in parameters table
### What changes were proposed in this pull request?
Fix a few issues in parameters table in
structured-streaming-kafka-integration doc.
### Why are the changes needed?
Make the title of the table consistent with the data.
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
![image](https://user-images.githubusercontent.com/67275816/85414316-8475e300-b59e-11ea-84ec-fa78ecc980b3.png)
After:
![image](https://user-images.githubusercontent.com/67275816/85414562-d61e6d80-b59e-11ea-9fe6-247e0ad4d9ee.png)
Before:
![image](https://user-images.githubusercontent.com/67275816/85414467-b8510880-b59e-11ea-92a0-7205542fe28b.png)
After:
![image](https://user-images.githubusercontent.com/67275816/85414589-de76a880-b59e-11ea-91f2-5073eaf3444b.png)
Before:
![image](https://user-images.githubusercontent.com/67275816/85414502-c69f2480-b59e-11ea-837f-1201f10a56b6.png)
After:
![image](https://user-images.githubusercontent.com/67275816/85414615-e9313d80-b59e-11ea-9b1a-fc11da0b6bc5.png)
### How was this patch tested?
Manually build and check.
Closes #28910 from sidedoorleftroad/SPARK-32075.
Authored-by: sidedoorleftroad <sidedoorleftroad@163.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 986fa01747db4b52bb8ca1165e759ca2d46d26ff)
The file was modifieddocs/structured-streaming-kafka-integration.md (diff)
Commit 045106e29d6b3cbb7be61b46604b85297c405aa3 by wenchen
[SPARK-32072][CORE][TESTS] Fix table formatting with benchmark results
### What changes were proposed in this pull request? Set column width w/
benchmark names to maximum of either 1. 40 (before this PR) or 2. The
length of benchmark name or 3. Maximum length of cases names
### Why are the changes needed? To improve readability of benchmark
results. For example, `MakeDateTimeBenchmark`.
Before:
``` make_timestamp():                         Best Time(ms)   Avg
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
prepare make_timestamp()                           3636           3673 
       38          0.3        3635.7       1.0X make_timestamp(2019, 1,
2, 3, 4, 50.123456)             94             99           4       
10.7          93.8      38.8X make_timestamp(2019, 1, 2, 3, 4,
60.000000)             68             80          13         14.6      
  68.3      53.2X make_timestamp(2019, 12, 31, 23, 59, 60.00)          
65             79          19         15.3          65.3      55.7X
make_timestamp(*, *, *, 3, 4, 50.123456)            271            280 
       14          3.7         270.7      13.4X
```
After:
``` make_timestamp():                            Best Time(ms)   Avg
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------
prepare make_timestamp()                              3694         
3745          82          0.3        3694.0       1.0X
make_timestamp(2019, 1, 2, 3, 4, 50.123456)             82           
90           9         12.2          82.3      44.9X
make_timestamp(2019, 1, 2, 3, 4, 60.000000)             72           
77           5         13.9          71.9      51.4X
make_timestamp(2019, 12, 31, 23, 59, 60.00)             67           
71           5         15.0          66.8      55.3X make_timestamp(*,
*, *, 3, 4, 50.123456)               273            289          14    
    3.7         273.2      13.5X
```
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By re-generating benchmark results for
`MakeDateTimeBenchmark`:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain
org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark"
``` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI |
ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1
(ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM
11.0.7+10 |
Closes #28906 from MaxGekk/benchmark-table-formatting.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 045106e29d6b3cbb7be61b46604b85297c405aa3)
The file was modifiedsql/core/benchmarks/MakeDateTimeBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/MakeDateTimeBenchmark-results.txt (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/benchmark/Benchmark.scala (diff)
Commit 9f540fac2e50bbcc214351f0c80690eae7be6b98 by wenchen
[SPARK-32062][SQL] Reset listenerRegistered in SparkSession
### What changes were proposed in this pull request?
Reset listenerRegistered when application end.
### Why are the changes needed?
Within a jvm, stop and create `SparkContext` multi times will cause the
bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add UT.
Closes #28899 from ulysses-you/SPARK-32062.
Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 9f540fac2e50bbcc214351f0c80690eae7be6b98)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
Commit e29ec428796eac4ebdb9c853131465d7570dc2f1 by gurwls223
[SPARK-32074][BUILD][R] Update AppVeyor R version to 4.0.2
### What changes were proposed in this pull request? R version 4.0.2 was
released, see
https://cran.r-project.org/doc/manuals/r-release/NEWS.html. This PR
targets to upgrade R version in AppVeyor CI environment.
### Why are the changes needed?
To test the latest R versions before the release, and see if there are
any regressions.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
AppVeyor will test.
Closes #28909 from HyukjinKwon/SPARK-32074.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: e29ec428796eac4ebdb9c853131465d7570dc2f1)
The file was modifiedR/install-dev.bat (diff)
The file was modifieddev/appveyor-install-dependencies.ps1 (diff)
Commit df04107934241965199bd5454c62e1016bb3bdd9 by gurwls223
[SPARK-32080][SPARK-31998][SQL] Simplify ArrowColumnVector ListArray
accessor
### What changes were proposed in this pull request?
This change simplifies the ArrowColumnVector ListArray accessor to use
provided Arrow APIs available in v0.15.0 to calculate element indices.
### Why are the changes needed?
This simplifies the code by avoiding manual calculations on the Arrow
offset buffer and makes use of more stable APIs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes #28915 from
BryanCutler/arrow-simplify-ArrowColumnVector-ListArray-SPARK-32080.
Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: df04107934241965199bd5454c62e1016bb3bdd9)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java (diff)
Commit 47fb9d60549da02b869a3f0aad2ccb34d455c963 by wenchen
[SPARK-32087][SQL] Allow UserDefinedType to use encoder to deserialize
rows in ScalaUDF as well
### What changes were proposed in this pull request?
This PR tries to address the comment:
https://github.com/apache/spark/pull/28645#discussion_r442183888 It
changes `canUpCast/canCast` to allow cast from sub UDT to base UDT, in
order to achieve the goal to allow UserDefinedType to use
`ExpressionEncoder` to deserialize rows in ScalaUDF as well.
One thing that needs to mention is, even we allow cast from sub UDT to
base UDT, it doesn't really do the cast in `Cast`. Because, yet, sub UDT
and base UDT are considered as the same type(because of #16660), see:
https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L81-L86
https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L92-L95
Therefore, the optimize rule `SimplifyCast` will eliminate the cast at
the end.
### Why are the changes needed?
Reduce the special case caused by `UserDefinedType` in
`ResolveEncodersInUDF` and `ScalaUDF`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
It should be covered by the test of `SPARK-19311`, which is also updated
a little in this PR.
Closes #28920 from Ngone51/fix-udf-udt.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 47fb9d60549da02b869a3f0aad2ccb34d455c963)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 71b6d462fbeebf5e7e9a95896f0dca8297d0b8dd by wenchen
[SPARK-32089][R][BUILD] Upgrade R version to 4.0.2 in the release
DockerFiile
### What changes were proposed in this pull request?
This PR proposes to upgrade R version to 4.0.2 in the release docker
image. As of SPARK-31918, we should make a release with R 4.0.0+ which
works with R 3.5+ too.
### Why are the changes needed?
To unblock releases on CRAN.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested via scripts under `dev/create-release`, manually
attaching to the container and checking the R version.
Closes #28922 from HyukjinKwon/SPARK-32089.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: 71b6d462fbeebf5e7e9a95896f0dca8297d0b8dd)
The file was modifieddev/create-release/spark-rm/Dockerfile (diff)
Commit d06604f60a8a2ba0877616370a20aa18be15f8c4 by dongjoon
[SPARK-32078][DOC] Add a redirect to sql-ref from sql-reference
### What changes were proposed in this pull request? This PR is to add a
redirect to sql-ref.html.
### Why are the changes needed? Before Spark 3.0 release, we are using
sql-reference.md, which was replaced by sql-ref.md instead. A number of
Google searches I’ve done today have turned up
https://spark.apache.org/docs/latest/sql-reference.html, which does not
exist any more. Thus, we should add a redirect to sql-ref.html.
### Does this PR introduce _any_ user-facing change?
https://spark.apache.org/docs/latest/sql-reference.html will be
redirected to https://spark.apache.org/docs/latest/sql-ref.html
### How was this patch tested?
Build it in my local environment. It works well. The sql-reference.html
file was generated. The contents are like:
```
<!DOCTYPE html>
<html lang="en-US">
<meta charset="utf-8">
<title>Redirecting&hellip;</title>
<link rel="canonical" href="http://localhost:4000/sql-ref.html">
<script>location="http://localhost:4000/sql-ref.html"</script>
<meta http-equiv="refresh" content="0;
url=http://localhost:4000/sql-ref.html">
<meta name="robots" content="noindex">
<h1>Redirecting&hellip;</h1>
<a href="http://localhost:4000/sql-ref.html">Click here if you are not
redirected.</a>
</html>
```
Closes #28914 from gatorsmile/addRedirectSQLRef.
Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: d06604f60a8a2ba0877616370a20aa18be15f8c4)
The file was modifieddocs/_config.yml (diff)
The file was modifieddocs/sql-ref.md (diff)