SuccessChanges

Summary

  1. [SPARK-23569][PYTHON] Allow pandas_udf to work with python3 style (commit: c8aa6fbb049795195e414597839fe61ae3f56d92) (details)
  2. [MINOR][DOCS] Fix a link in "Compatibility with Apache Hive" (commit: 88dd335f6f36ce68862d33720959aaf62742f86d) (details)
  3. [SPARK-23329][SQL] Fix documentation of trigonometric functions (commit: 232b9f81f02ec00fc698f610ecc1ca25740e8802) (details)
  4. [SPARK-22882][ML][TESTS] ML test for structured streaming: (commit: 4550673b1a94e9023a0c6fdc6a92e4b860e1cfb2) (details)
  5. [SPARK-23457][SQL][BRANCH-2.3] Register task completion listeners first (commit: 911b83da42fa850eb3ae419687c204cb2e25767b) (details)
  6. [SPARK-23434][SQL][BRANCH-2.3] Spark should not warn `metadata (commit: b9ea2e87bb24c3731bd2dbd044d10d18dbbf9c6f) (details)
Commit c8aa6fbb049795195e414597839fe61ae3f56d92 by hyukjinkwon
[SPARK-23569][PYTHON] Allow pandas_udf to work with python3 style
type-annotated functions
## What changes were proposed in this pull request?
Check python version to determine whether to use `inspect.getargspec` or
`inspect.getfullargspec` before applying `pandas_udf` core logic to a
function. The former is python2.7 (deprecated in python3) and the latter
is python3.x. The latter correctly accounts for type annotations, which
are syntax errors in python2.x.
## How was this patch tested?
Locally, on python 2.7 and 3.6.
Author: Michael (Stu) Stewart <mstewart141@gmail.com>
Closes #20728 from mstewart141/pandas_udf_fix.
(cherry picked from commit 7965c91d8a67c213ca5eebda5e46e7c49a8ba121)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: c8aa6fbb049795195e414597839fe61ae3f56d92)
The file was modifiedpython/pyspark/sql/udf.py (diff)
The file was modifiedpython/pyspark/sql/tests.py (diff)
Commit 88dd335f6f36ce68862d33720959aaf62742f86d by gatorsmile
[MINOR][DOCS] Fix a link in "Compatibility with Apache Hive"
## What changes were proposed in this pull request?
This PR fixes a broken link as below:
**Before:**
<img width="678" alt="2018-03-05 12 23 58"
src="https://user-images.githubusercontent.com/6477701/36957930-6d00ebda-207b-11e8-9ae4-718561b0428c.png">
**After:**
<img width="680" alt="2018-03-05 12 23 20"
src="https://user-images.githubusercontent.com/6477701/36957934-6f834ac4-207b-11e8-97b4-18832b2b80cd.png">
Also see
https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#compatibility-with-apache-hive
## How was this patch tested?
Manually tested. I checked the same instances in `docs` directory. Seems
this is the only one.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20733 from HyukjinKwon/minor-link.
(cherry picked from commit 269cd53590dd155aeb5269efc909a6e228f21e22)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(commit: 88dd335f6f36ce68862d33720959aaf62742f86d)
The file was modifieddocs/sql-programming-guide.md (diff)
Commit 232b9f81f02ec00fc698f610ecc1ca25740e8802 by hyukjinkwon
[SPARK-23329][SQL] Fix documentation of trigonometric functions
## What changes were proposed in this pull request?
Provide more details in trigonometric function documentations.
Referenced `java.lang.Math` for further details in the descriptions.
## How was this patch tested?
Ran full build, checked generated documentation manually
Author: Mihaly Toth <misutoth@gmail.com>
Closes #20618 from misutoth/trigonometric-doc.
(cherry picked from commit a366b950b90650693ad0eb1e5b9a988ad028d845)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
(commit: 232b9f81f02ec00fc698f610ecc1ca25740e8802)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (diff)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
Commit 4550673b1a94e9023a0c6fdc6a92e4b860e1cfb2 by joseph
[SPARK-22882][ML][TESTS] ML test for structured streaming:
ml.classification
## What changes were proposed in this pull request?
adding Structured Streaming tests for all Models/Transformers in
spark.ml.classification
## How was this patch tested?
N/A
Author: WeichenXu <weichen.xu@databricks.com>
Closes #20121 from WeichenXu123/ml_stream_test_classification.
(cherry picked from commit 98a5c0a35f0a24730f5074522939acf57ef95422)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
(commit: 4550673b1a94e9023a0c6fdc6a92e4b860e1cfb2)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala (diff)
Commit 911b83da42fa850eb3ae419687c204cb2e25767b by wenchen
[SPARK-23457][SQL][BRANCH-2.3] Register task completion listeners first
in ParquetFileFormat
## What changes were proposed in this pull request?
ParquetFileFormat leaks opened files in some cases. This PR prevents
that by registering task completion listers first before initialization.
-
[spark-branch-2.3-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/205/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/)
-
[spark-master-test-sbt-hadoop-2.6](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4228/testReport/junit/org.apache.spark.sql.execution.datasources.parquet/ParquetQuerySuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/)
``` Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
at
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:538)
at
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
at
```
## How was this patch tested?
Manual. The following test case generates the same leakage.
```scala
test("SPARK-23457 Register task completion listeners first in
ParquetFileFormat") {
   withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key ->
s"${Int.MaxValue}") {
     withTempDir { dir =>
       val basePath = dir.getCanonicalPath
       Seq(0).toDF("a").write.format("parquet").save(new Path(basePath,
"first").toString)
       Seq(1).toDF("a").write.format("parquet").save(new Path(basePath,
"second").toString)
       val df = spark.read.parquet(
         new Path(basePath, "first").toString,
         new Path(basePath, "second").toString)
       val e = intercept[SparkException] {
         df.collect()
       }
       assert(e.getCause.isInstanceOf[OutOfMemoryError])
     }
   }
}
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20714 from dongjoon-hyun/SPARK-23457-2.3.
(commit: 911b83da42fa850eb3ae419687c204cb2e25767b)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala (diff)
Commit b9ea2e87bb24c3731bd2dbd044d10d18dbbf9c6f by wenchen
[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn `metadata
directory` for a HDFS file path
## What changes were proposed in this pull request?
In a kerberized cluster, when Spark reads a file path (e.g.
`people.json`), it warns with a wrong warning message during looking up
`people.json/_spark_metadata`. The root cause of this situation is the
difference between `LocalFileSystem` and `DistributedFileSystem`.
`LocalFileSystem.exists()` returns `false`, but
`DistributedFileSystem.exists` raises
`org.apache.hadoop.security.AccessControlException`.
```scala scala>
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
scala> spark.read.json("hdfs:///tmp/people.json") 18/02/15 05:00:48 WARN
streaming.FileStreamSink: Error while looking for metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for
metadata directory.
```
After this PR,
```scala scala> spark.read.json("hdfs:///tmp/people.json").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20713 from dongjoon-hyun/SPARK-23434-2.3.
(commit: b9ea2e87bb24c3731bd2dbd044d10d18dbbf9c6f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala (diff)