FailedChanges

Summary

  1. [SPARK-19939][ML] Add support for association rules in ML (commit: d1255297b85d9b39376bb479821cfb603bc7b47b) (details)
  2. [SPARK-20249][ML][PYSPARK] Add training summary for LinearSVCModel (commit: 879513370767f647765ff5b96adb08f5b8c46489) (details)
  3. [SPARK-32088][PYTHON] Pin the timezone in timestamp_seconds doctest (commit: ac3a0551d82c8e808d01aecbd1f6918cfe331ec4) (details)
  4. [SPARK-32099][DOCS] Remove broken link in cloud integration (commit: 44aecaa9124fb2158f009771022c64ede4b582dc) (details)
  5. [SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing (commit: 7445c7534ba11bcbdf2e05259cd4f5cde13fe5fb) (details)
  6. [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default (commit: 9c134b57bff5b7e7f9c85aeed2e9539117a5b57d) (details)
  7. [SPARK-32071][SQL][TESTS] Add `make_interval` benchmark (commit: 8c44d744631516a5cdaf63406e69a9dd11e5b878) (details)
Commit d1255297b85d9b39376bb479821cfb603bc7b47b by srowen
[SPARK-19939][ML] Add support for association rules in ML
### What changes were proposed in this pull request? Adding support to
Association Rules in Spark ml.fpm.
### Why are the changes needed? Support is an indication of how
frequently the itemset of an association rule appears in the database
and suggests if the rules are generally applicable to the dateset. Refer
to
[wiki](https://en.wikipedia.org/wiki/Association_rule_learning#Support)
for more details.
### Does this PR introduce _any_ user-facing change? Yes. Associate
Rules now have support measure
### How was this patch tested? existing and new unit test
Closes #28903 from huaxingao/fpm.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: d1255297b85d9b39376bb479821cfb603bc7b47b)
The file was modifiedR/pkg/R/mllib_fpm.R (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala (diff)
The file was modifiedR/pkg/tests/fulltests/test_mllib_fpm.R (diff)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_algorithms.py (diff)
Commit 879513370767f647765ff5b96adb08f5b8c46489 by srowen
[SPARK-20249][ML][PYSPARK] Add training summary for LinearSVCModel
### What changes were proposed in this pull request? Add training
summary for LinearSVCModel......
### Why are the changes needed?
so that user can get the training process status, such as loss value of
each iteration and total iteration number.
### Does this PR introduce _any_ user-facing change? Yes
```LinearSVCModel.summary```
```LinearSVCModel.evaluate```
### How was this patch tested? new tests
Closes #28884 from huaxingao/svc_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 879513370767f647765ff5b96adb08f5b8c46489)
The file was modifiedpython/pyspark/ml/tests/test_training_summary.py (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala (diff)
Commit ac3a0551d82c8e808d01aecbd1f6918cfe331ec4 by dongjoon
[SPARK-32088][PYTHON] Pin the timezone in timestamp_seconds doctest
### What changes were proposed in this pull request?
Add American timezone during timestamp_seconds doctest
### Why are the changes needed?
`timestamp_seconds` doctest in `functions.py` used default timezone to
get expected result For example:
```python
>>> time_df = spark.createDataFrame([(1230219000,)], ['unix_time'])
>>>
time_df.select(timestamp_seconds(time_df.unix_time).alias('ts')).collect()
[Row(ts=datetime.datetime(2008, 12, 25, 7, 30))]
```
But when we have a non-american timezone, the test case will get
different test result.
For example, when we set current timezone as `Asia/Shanghai`, the test
result will be
```
[Row(ts=datetime.datetime(2008, 12, 25, 23, 30))]
```
So no matter where we run the test case ,we will always get the expected
permanent result if we set the timezone on one specific area.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes #28932 from GuoPhilipse/SPARK-32088-fix-timezone-issue.
Lead-authored-by: GuoPhilipse
<46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by:
GuoPhilipse <guofei_ok@126.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: ac3a0551d82c8e808d01aecbd1f6918cfe331ec4)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
Commit 44aecaa9124fb2158f009771022c64ede4b582dc by dongjoon
[SPARK-32099][DOCS] Remove broken link in cloud integration
documentation
### What changes were proposed in this pull request?
The 3rd link in `IBM Cloud Object Storage connector for Apache Spark` is
broken. The PR removes this link.
### Why are the changes needed?
broken link
### Does this PR introduce _any_ user-facing change?
yes, the broken link is removed from the doc.
### How was this patch tested?
doc generation passes successfully as before
Closes #28927 from guykhazma/spark32099.
Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 44aecaa9124fb2158f009771022c64ede4b582dc)
The file was modifieddocs/cloud-integration.md (diff)
Commit 7445c7534ba11bcbdf2e05259cd4f5cde13fe5fb by dongjoon
[SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing
completeAndCheckAnswer and using completeNextStageWithFetchFailure
### What changes were proposed in this pull request?
**First**
`DAGSchedulerSuite` provides `completeNextStageWithFetchFailure` to make
all tasks in non first stage occur `FetchFailed`. But many test case
uses complete directly as follows:
```scala
complete(taskSets(1), Seq(
    (FetchFailed(makeBlockManagerId("hostA"),
       shuffleDep1.shuffleId, 0L, 0, 0, "ignored"), null)))
``` We need to reuse `completeNextStageWithFetchFailure`.
**Second**
`DAGSchedulerSuite` also check the results show below:
```scala complete(taskSets(0), Seq((Success, 42))) assert(results ===
Map(0 -> 42))
``` We can extract it as a generic method of `checkAnswer`.
### Why are the changes needed? Reuse
`completeNextStageWithFetchFailure`
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested? Jenkins test
Closes #28866 from beliefer/reuse-completeNextStageWithFetchFailure.
Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 7445c7534ba11bcbdf2e05259cd4f5cde13fe5fb)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala (diff)
Commit 9c134b57bff5b7e7f9c85aeed2e9539117a5b57d by dongjoon
[SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default
### What changes were proposed in this pull request?
According to the dev mailing list discussion, this PR aims to switch the
default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark
3.1.0 on December 2020.
| Item | Default Hadoop Dependency |
|------|-----------------------------|
| Apache Spark Website | 3.2.0 |
| Apache Download Site | 3.2.0 |
| Apache Snapshot | 3.2.0 |
| Maven Central | 3.2.0 |
| PyPI | 2.7.4 (We will switch later) |
| CRAN | 2.7.4 (We will switch later) |
| Homebrew | 3.2.0 (already) |
In Apache Spark 3.0.0 release, we focused on the other features. This PR
targets for [Apache Spark 3.1.0 scheduled on December
2020](https://spark.apache.org/versioning-policy.html).
### Why are the changes needed?
Apache Hadoop 3.2 has many fixes and new cloud-friendly features.
**Reference**
- 2017-08-04: https://hadoop.apache.org/release/2.7.4.html
- 2019-01-16: https://hadoop.apache.org/release/3.2.0.html
### Does this PR introduce _any_ user-facing change?
Since the default Hadoop dependency changes, the users will get a better
support in a cloud environment.
### How was this patch tested?
Pass the Jenkins.
Closes #28897 from dongjoon-hyun/SPARK-32058.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 9c134b57bff5b7e7f9c85aeed2e9539117a5b57d)
The file was modifiedresource-managers/kubernetes/integration-tests/pom.xml (diff)
The file was modifieddev/create-release/release-build.sh (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/run-tests.py (diff)
Commit 8c44d744631516a5cdaf63406e69a9dd11e5b878 by dongjoon
[SPARK-32071][SQL][TESTS] Add `make_interval` benchmark
### What changes were proposed in this pull request? Add benchmarks for
interval constructor `make_interval` and measure perf of 4 cases: 1.
Constant (year, month) 2. Constant (week, day) 3. Constant (hour,
minute, second, second fraction) 4. All fields are NOT constant.
The benchmark results are generated in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI |
ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1
(ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM
11.0.7+10 |
### Why are the changes needed? To have a base line for future perf
improvements of `make_interval`, and to prevent perf regressions in the
future.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By running `IntervalBenchmark` via:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain
org.apache.spark.sql.execution.benchmark.IntervalBenchmark"
```
Closes #28905 from MaxGekk/benchmark-make_interval.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 8c44d744631516a5cdaf63406e69a9dd11e5b878)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/IntervalBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/IntervalBenchmark-jdk11-results.txt (diff)