SuccessChanges

Summary

  1. [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages (commit: 765105b6f1b65493755386293f504f65d9fa8d25) (details)
  2. [SPARK-31803][ML] Make sure instance weight is not negative (commit: 50492c0bd353b5bd6c06099a53d55377a91a6705) (details)
  3. [SPARK-31719][SQL] Refactor JoinSelection (commit: f6f1e51072d6d7cb67486257f4c86447d959718f) (details)
  4. [SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in (commit: 311fe6a880f371c20ca5156ca6eb7dec5a15eff6) (details)
  5. [SPARK-31762][SQL][FOLLOWUP] Avoid double formatting in legacy (commit: b5eb0933acf4de6a07906617f5bf426a7047a366) (details)
  6. [SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 (commit: 1528fbced83a7fbcf70e09d6a898728370d8fa62) (details)
  7. [SPARK-31764][CORE] JsonProtocol doesn't write RDDInfo#isBarrier (commit: d19b173b47af04fe6f03e2b21b60eb317aeaae4f) (details)
  8. [SPARK-31730][CORE][TEST] Fix flaky tests in BarrierTaskContextSuite (commit: efe7fd2b6bea4a945ed7f3f486ab279c505378b4) (details)
  9. [SPARK-25351][SQL][PYTHON] Handle Pandas category type when converting (commit: 339b0ecadb9c66ec8a62fd1f8e5a7a266b465aef) (details)
  10. [SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame (commit: 2f92ea0df4ef1f127d25a009272005e5ad8811fa) (details)
  11. [SPARK-31839][TESTS] Delete duplicate code in castsuit (commit: dfbc5edf20040e8163ee3beef61f2743a948c508) (details)
  12. [SPARK-25351][PYTHON][TEST][FOLLOWUP] Fix test assertions to be (commit: 8bbb666622e042c1533da294ac7b504b6aaa694a) (details)
Commit 765105b6f1b65493755386293f504f65d9fa8d25 by srowen
[SPARK-31638][WEBUI] Clean Pagination code for all webUI pages
### What changes were proposed in this pull request?
Pagination code across pages needs to be cleaned. I have tried to clear
out these things :
* Unused methods
* Unused method arguments
* remove redundant `if` expressions
* fix indentation
### Why are the changes needed? This fix will make code more readable
and remove unnecessary methods and variables.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Manually
Closes #28448 from iRakson/refactorPagination.
Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 765105b6f1b65493755386293f504f65d9fa8d25)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ui/StagePageSuite.scala (diff)
Commit 50492c0bd353b5bd6c06099a53d55377a91a6705 by srowen
[SPARK-31803][ML] Make sure instance weight is not negative
### What changes were proposed in this pull request? In the algorithms
that support instance weight, add checks to make sure instance weight is
not negative.
### Why are the changes needed? instance weight has to be >= 0.0
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Manually tested
Closes #28621 from huaxingao/weight_check.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 50492c0bd353b5bd6c06099a53d55377a91a6705)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/Predictor.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/functions.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala (diff)
Commit f6f1e51072d6d7cb67486257f4c86447d959718f by wenchen
[SPARK-31719][SQL] Refactor JoinSelection
### What changes were proposed in this pull request? This PR extracts
the logic for selecting the planned join type out of the `JoinSelection`
rule and moves it to `JoinSelectionHelper` in Catalyst.
### Why are the changes needed? This change both cleans up the code in
`JoinSelection` and allows the logic to be in one place and be used from
other rules that need to make decision based on the join type before the
planning time.
### Does this PR introduce _any_ user-facing change?
`BuildSide`, `BuildLeft`, and `BuildRight` are moved from
`org.apache.spark.sql.execution` to Catalyst in
`org.apache.spark.sql.catalyst.optimizer`.
### How was this patch tested? This is a refactoring, passes existing
tests.
Closes #28540 from dbaliafroozeh/RefactorJoinSelection.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: f6f1e51072d6d7cb67486257f4c86447d959718f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStageStrategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala (diff)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinSelectionHelperSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PlanDynamicPruningFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/package.scala
Commit 311fe6a880f371c20ca5156ca6eb7dec5a15eff6 by wenchen
[SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in
DateExpressionsSuite
### What changes were proposed in this pull request?
This PR modifies some codegen related tests to test escape characters
for datetime functions which are time zone aware. If the timezone is
absent, the formatter could result in `null` caused by
`java.util.NoSuchElementException: None.get` and bypassing the real
intention of those test cases.
### Why are the changes needed?
fix tests
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
passing the modified test cases.
Closes #28653 from yaooqinn/SPARK-31835.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 311fe6a880f371c20ca5156ca6eb7dec5a15eff6)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
Commit b5eb0933acf4de6a07906617f5bf426a7047a366 by wenchen
[SPARK-31762][SQL][FOLLOWUP] Avoid double formatting in legacy
fractional formatter
### What changes were proposed in this pull request? Currently, the
legacy fractional formatter is based on the implementation from Spark
2.4 which formats the input timestamp twice:
```
   val timestampString = ts.toString
   val formatted = legacyFormatter.format(ts)
``` to strip trailing zeros. This PR proposes to avoid the first
formatting by forming the second fraction directly.
### Why are the changes needed? It makes legacy fractional formatter
faster.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By existing test "format fraction of
second" in `TimestampFormatterSuite` + added test for timestamps before
1970-01-01 00:00:00Z
Closes #28643 from MaxGekk/optimize-legacy-fract-format.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: b5eb0933acf4de6a07906617f5bf426a7047a366)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala (diff)
Commit 1528fbced83a7fbcf70e09d6a898728370d8fa62 by wenchen
[SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8
bug of stand-alone form
### What changes were proposed in this pull request?
If `LLL`/`qqq` is used in the datetime pattern string, and the current
JDK in use has a bug for the stand-alone form (see
https://bugs.openjdk.java.net/browse/JDK-8114833), throw an exception
with a clear error message.
### Why are the changes needed?
to keep backward compatibility with Spark 2.4
### Does this PR introduce _any_ user-facing change?
Yes
Spark 2.4
``` scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                          Jan|
+---------------------------------------------+
```
Spark 3.0 with Java 11
``` scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                          Jan|
+---------------------------------------------+
```
Spark 3.0 with Java 8
```
// before this PR
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                            1|
+---------------------------------------------+
// after this PR scala> sql("select date_format('1990-1-1',
'LLL')").show org.apache.spark.SparkUpgradeException
```
### How was this patch tested?
manual test with java 8 and 11
Closes #28646 from cloud-fan/format.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: 1528fbced83a7fbcf70e09d6a898728370d8fa62)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala (diff)
The file was modifieddocs/sql-ref-datetime-pattern.md (diff)
Commit d19b173b47af04fe6f03e2b21b60eb317aeaae4f by xingbo.jiang
[SPARK-31764][CORE] JsonProtocol doesn't write RDDInfo#isBarrier
### What changes were proposed in this pull request?
This PR changes JsonProtocol to write RDDInfos#isBarrier.
### Why are the changes needed?
JsonProtocol reads RDDInfos#isBarrier but not write it so it's a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I added a testcase.
Closes #28583 from sarutak/SPARK-31764.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Xingbo Jiang <xingbo.jiang@databricks.com>
(commit: d19b173b47af04fe6f03e2b21b60eb317aeaae4f)
The file was modifiedcore/src/main/scala/org/apache/spark/util/JsonProtocol.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala (diff)
Commit efe7fd2b6bea4a945ed7f3f486ab279c505378b4 by xingbo.jiang
[SPARK-31730][CORE][TEST] Fix flaky tests in BarrierTaskContextSuite
### What changes were proposed in this pull request?
To wait until all the executors have started before submitting any job.
This could avoid the flakiness caused by waiting for executors coming
up.
### How was this patch tested?
Existing tests.
Closes #28584 from jiangxb1987/barrierTest.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by:
Xingbo Jiang <xingbo.jiang@databricks.com>
(commit: efe7fd2b6bea4a945ed7f3f486ab279c505378b4)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala (diff)
Commit 339b0ecadb9c66ec8a62fd1f8e5a7a266b465aef by cutlerb
[SPARK-25351][SQL][PYTHON] Handle Pandas category type when converting
from Python with Arrow
Handle Pandas category type while converting from python with Arrow
enabled. The category column will be converted to whatever type the
category elements are as is the case with Arrow disabled.
### Does this PR introduce any user-facing change? No
### How was this patch tested? New unit tests were added for
`createDataFrame` and scalar `pandas_udf`
Closes #26585 from jalpan-randeri/feature-pyarrow-dictionary-type.
Authored-by: Jalpan Randeri <randerij@amazon.com> Signed-off-by: Bryan
Cutler <cutlerb@gmail.com>
(commit: 339b0ecadb9c66ec8a62fd1f8e5a7a266b465aef)
The file was modifiedpython/pyspark/sql/pandas/types.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_arrow.py (diff)
The file was modifiedpython/pyspark/sql/pandas/serializers.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_scalar.py (diff)
Commit 2f92ea0df4ef1f127d25a009272005e5ad8811fa by gurwls223
[SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame
Class
### What changes were proposed in this pull request? Adds `inputFiles()`
method to PySpark `DataFrame`. Using this, PySpark users can list all
files constituting a `DataFrame`.
**Before changes:**
```
>>> spark.read.load("examples/src/main/resources/people.json",
format="json").inputFiles() Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/***/***/spark/python/pyspark/sql/dataframe.py", line 1388, in
__getattr__
   "'%s' object has no attribute '%s'" % (self.__class__.__name__,
name)) AttributeError: 'DataFrame' object has no attribute 'inputFiles'
```
**After changes:**
```
>>> spark.read.load("examples/src/main/resources/people.json",
format="json").inputFiles()
[u'file:///***/***/spark/examples/src/main/resources/people.json']
```
### Why are the changes needed? This method is already supported for
spark with scala and java.
### Does this PR introduce _any_ user-facing change? Yes, Now users can
list all files of a DataFrame using `inputFiles()`
### How was this patch tested? UT added.
Closes #28652 from iRakson/SPARK-31763.
Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 2f92ea0df4ef1f127d25a009272005e5ad8811fa)
The file was modifiedpython/pyspark/sql/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
Commit dfbc5edf20040e8163ee3beef61f2743a948c508 by gurwls223
[SPARK-31839][TESTS] Delete duplicate code in castsuit
### What changes were proposed in this pull request? Delete duplicate
code castsuit
### Why are the changes needed? keep spark code clean
### Does this PR introduce _any_ user-facing change? no
### How was this patch tested? no need
Closes #28655 from GuoPhilipse/delete-duplicate-code-castsuit.
Lead-authored-by: GuoPhilipse
<46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by:
GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: dfbc5edf20040e8163ee3beef61f2743a948c508)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
Commit 8bbb666622e042c1533da294ac7b504b6aaa694a by gurwls223
[SPARK-25351][PYTHON][TEST][FOLLOWUP] Fix test assertions to be
consistent
### What changes were proposed in this pull request? Followup to make
assertions from recent test consistent with the rest of the module
### Why are the changes needed?
Better to use assertions from `unittest` and be consistent
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes #28659 from BryanCutler/arrow-category-test-fix-SPARK-25351.
Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 8bbb666622e042c1533da294ac7b504b6aaa694a)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_scalar.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_arrow.py (diff)