SuccessChanges

Summary

  1. [SPARK-30457][ML] Use PeriodicRDDCheckpointer instead of NodeIdCache (details)
  2. [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not (details)
  3. [SPARK-28152][SQL][FOLLOWUP] Add a legacy conf for old (details)
  4. [SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool (details)
Commit 308ae287a989f38daf22c72fbb7543a55744f43e by ruifengz
[SPARK-30457][ML] Use PeriodicRDDCheckpointer instead of NodeIdCache
### What changes were proposed in this pull request? 1, del
`NodeIdCache`, and use `PeriodicRDDCheckpointer` instead; 2, reuse
broadcasted `Splits` in the whole training;
### Why are the changes needed? 1, The functionality of `NodeIdCache`
and `PeriodicRDDCheckpointer` are highly similar, and the update process
of nodeIds is simple; One goal of "Generalize PeriodicGraphCheckpointer
for RDDs" in SPARK-5561 is to use checkpointer in RandomForest; 2, only
need to broadcast `Splits` once;
### Does this PR introduce any user-facing change? No
### How was this patch tested? Existing testsuites
Closes #27145 from zhengruifeng/del_NodeIdCache.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by:
zhengruifeng <ruifengz@foxmail.com>
The file was removedmllib/src/main/scala/org/apache/spark/ml/tree/impl/NodeIdCache.scala
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala (diff)
Commit 8ce7962931680c204e84dd75783b1c943ea9c525 by gurwls223
[SPARK-30245][SQL] Add cache for Like and RLike when pattern is not
static
### What changes were proposed in this pull request?
Add cache for Like and RLike when pattern is not static
### Why are the changes needed?
When pattern is not static, we should avoid compile pattern every time
if some pattern is same. Here is perf numbers, include 3 test groups and
use `range` to make it easy.
```
// ---------------------
// 10,000 rows and 10 partitions val df1 = spark.range(0, 10000, 1,
10).withColumnRenamed("id", "id1") val df2 = spark.range(0, 10000, 1,
10).withColumnRenamed("id", "id2")
val start = System.currentTimeMillis df1.join(df2).where("id2 like
id1").count()
// before  16939
// after    6352 println(System.currentTimeMillis - start)
// ---------------------
// 10,000 rows and 100 partitions val df1 = spark.range(0, 10000, 1,
100).withColumnRenamed("id", "id1") val df2 = spark.range(0, 10000, 1,
100).withColumnRenamed("id", "id2")
val start = System.currentTimeMillis df1.join(df2).where("id2 like
id1").count()
// before  11070
// after    4680 println(System.currentTimeMillis - start)
// ---------------------
// 20,000 rows and 10 partitions val df1 = spark.range(0, 20000, 1,
10).withColumnRenamed("id", "id1") val df2 = spark.range(0, 20000, 1,
10).withColumnRenamed("id", "id2")
val start = System.currentTimeMillis df1.join(df2).where("id2 like
id1").count()
// before 66962
// after  29934 println(System.currentTimeMillis - start)
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Closes #26875 from ulysses-you/SPARK-30245.
Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
Commit 28fc0437ce6d2f6fbcd83be38aafb8a491c1a67d by dhyun
[SPARK-28152][SQL][FOLLOWUP] Add a legacy conf for old
MsSqlServerDialect numeric mapping
### What changes were proposed in this pull request?
This is a follow-up for https://github.com/apache/spark/pull/25248 .
### Why are the changes needed?
The new behavior cannot access the existing table which is created by
old behavior. This PR provides a way to avoid new behavior for the
existing users.
### Does this PR introduce any user-facing change?
Yes. This will fix the broken behavior on the existing tables.
### How was this patch tested?
Pass the Jenkins and manually run JDBC integration test.
``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests
-pl :spark-docker-integration-tests_2.12 test
```
Closes #27184 from dongjoon-hyun/SPARK-28152-CONF.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala (diff)
Commit eefcc7d762a627bf19cab7041a1a82f88862e7e1 by dhyun
[SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool
configuration
### What changes were proposed in this pull request?
This patch documents the configuration for the Kafka producer pool,
newly revised via SPARK-21869 (#26845)
### Why are the changes needed?
The explanation of new Kafka producer pool configuration is missing,
whereas the doc has Kafka
consumer pool configuration.
### Does this PR introduce any user-facing change?
Yes. This is a documentation change.
![Screen Shot 2020-01-12 at 11 16 19
PM](https://user-images.githubusercontent.com/9700541/72238148-c8959e00-3591-11ea-87fc-a8918792017e.png)
### How was this patch tested?
N/A
Closes #27146 from HeartSaVioR/SPARK-21869-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
The file was modifieddocs/structured-streaming-kafka-integration.md (diff)