SuccessChanges

Summary

  1. [SPARK-28825][SQL][DOC] Documentation for Explain Command (details)
  2. [SPARK-30381][ML] Refactor GBT to reuse treePoints for all trees (details)
  3. [SPARK-30302][SQL] Complete info for show create table for views (details)
  4. [SPARK-30453][BUILD][R] Update AppVeyor R version to 3.6.2 (details)
  5. [SPARK-30429][SQL] Optimize catalogString and usage in (details)
Commit ed73ed83d36e2c832889e281c32f50046c6fbec5 by yamamuro
[SPARK-28825][SQL][DOC] Documentation for Explain Command
## What changes were proposed in this pull request? Document Explain
statement in SQL Reference Guide.
## Why are the changes needed? Adding documentation for SQL reference.
## Does this PR introduce any user-facing change? yes
Before: There was no documentation for this. After:
![image
(11)](https://user-images.githubusercontent.com/51401130/71816281-18fb9000-30a8-11ea-94cb-8380de1d5da4.png)
![image
(10)](https://user-images.githubusercontent.com/51401130/71816282-18fb9000-30a8-11ea-8505-1ef3effb01ac.png)
![image
(9)](https://user-images.githubusercontent.com/51401130/71816283-19942680-30a8-11ea-9c20-b81e18c7d7e2.png)
## How was this patch tested? Used jekyll build and serve to verify.
Closes #26970 from PavithraRamachandran/explain_doc.
Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
The file was modifieddocs/sql-ref-syntax-qry-explain.md (diff)
Commit d7c7e37ae09eb8536ca3a7e47ff3e5fda826b376 by ruifengz
[SPARK-30381][ML] Refactor GBT to reuse treePoints for all trees
### What changes were proposed in this pull request? Make GBT reuse
splits/treePoints for all trees: 1, reuse splits/treePoints for all
trees: existing impl will find feature splits and transform input
vectors to treePoints for each tree; while other famous impls like
XGBoost/lightGBM will build a global splits/binned features and reuse
them for all trees; Note: the sampling rate in existing impl to build
`splits` is not the param `subsamplingRate` but the output of
`RandomForest.samplesFractionForFindSplits` which depends on `maxBins`
and `numExamples`. Note II: Existing impl do not guarantee that splits
among iteration are the same, so this may cause a little difference in
convergence.
2, do not cache input vectors: existing impl will cached the input
twice: 1,`input: RDD[Instance]` is used to compute/update prediction and
errors; 2, at each iteration, input is transformed to bagged points, the
bagged points will be cached during this iteration; In this PR,`input:
RDD[Instance]` is no longer cached, since it is only used three times:
1, compute metadata; 2, find splits; 3, converted to treePoints;
Instead, the treePoints `RDD[TreePoint]` is cached, at each iter, it is
convert to bagged points by attaching extra `labelWithCounts:
RDD[(Double, Int)]` containing residuals/sampleCount information, this
rdd is relative small (like cached `norms` in KMeans); To compute/update
prediction and errors, new prediction method based on binned features
are added in `Node`
### Why are the changes needed? for perfermance improvement: 1,40%~50%
faster than existing impl 2,save 30%~50% RAM
### Does this PR introduce any user-facing change? No
### How was this patch tested? existing testsuites & several manual
tests in REPL
Closes #27103 from zhengruifeng/gbt_reuse_bagged.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by:
zhengruifeng <ruifengz@foxmail.com>
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/Node.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/Split.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala (diff)
Commit 9535776e288da7c1a582de09f2079c34dfba1fed by yamamuro
[SPARK-30302][SQL] Complete info for show create table for views
### What changes were proposed in this pull request?
Add table/column comments and table properties to the result of show
create table of views.
### Does this PR introduce any user-facing change?
When show create table for views, after this patch, the result can
contain table/column comments and table properties if they exist.
### How was this patch tested?
add new tests
Closes #26944 from wzhfy/complete_show_create_view.
Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
The file was modifiedsql/core/src/test/resources/sql-tests/results/show-create-table.sql.out (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/show-create-table.sql (diff)
Commit 390e6bd7bcd6aa624a5887b2cce2720abeb67b00 by dhyun
[SPARK-30453][BUILD][R] Update AppVeyor R version to 3.6.2
### What changes were proposed in this pull request? R version 3.6.2
(Dark and Stormy Night) was released on 2019-12-12. This PR targets to
upgrade R installation for AppVeyor CI environment.
### Why are the changes needed? To test the latest R versions before the
release, and see if there are any regressions.
### Does this PR introduce any user-facing change? No.
### How was this patch tested? AppVeyor will test.
Closes #27124 from HyukjinKwon/upgrade-r-version-appveyor.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
The file was modifieddev/appveyor-install-dependencies.ps1 (diff)
Commit 1160457eedf2756e920b9d47277dcd6483963a11 by dhyun
[SPARK-30429][SQL] Optimize catalogString and usage in
ValidateExternalType.errMsg to avoid OOM
### What changes were proposed in this pull request?
This patch proposes:
1.  Fix OOM at WideSchemaBenchmark: make `ValidateExternalType.errMsg`
lazy variable, i.e. not to initiate it in the constructor 2. Truncate
`errMsg`: Replacing `catalogString` with `simpleString` which is
truncated 3. Optimizing `override def catalogString` in `StructType`:
Make `catalogString` more efficient in string generation by using
`StringConcat`
### Why are the changes needed?
In the JIRA, it is found that WideSchemaBenchmark fails with OOM, like:
```
[error] Exception in thread "main"
org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
makeCopy, tree: validateexternaltype(getexternalrowfield(input[0,
org.apac he.spark.sql.Row, true], 0, a),
StructField(b,StructType(StructField(c,StructType(StructField(value_1,LongType,true),
StructField(value_10,LongType,true), StructField(value_
100,LongType,true), StructField(value_1000,LongType,true),
StructField(value_1001,LongType,true),
StructField(value_1002,LongType,true),
StructField(value_1003,LongType,true
), StructField(value_1004,LongType,true),
StructField(value_1005,LongType,true),
StructField(value_1006,LongType,true),
StructField(value_1007,LongType,true), StructField(va
lue_1008,LongType,true), StructField(value_1009,LongType,true),
StructField(value_101,LongType,true),
StructField(value_1010,LongType,true), StructField(value_1011,LongType,
... ue), StructField(value_99,LongType,true),
StructField(value_990,LongType,true),
StructField(value_991,LongType,true),
StructField(value_992,LongType,true), StructField(value
_993,LongType,true), StructField(value_994,LongType,true),
StructField(value_995,LongType,true),
StructField(value_996,LongType,true),
StructField(value_997,LongType,true),
StructField(value_998,LongType,true),
StructField(value_999,LongType,true)),true))
[error]         at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:435)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:408)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
....
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:404)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:307)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
[error]         at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:198)
[error]         at
org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:71)
[error]         at
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
[error]         at
org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:554)
[error]         at
org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:476)
[error]         at
org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark$.$anonfun$wideShallowlyNestedStructFieldReadAndWrite$1(WideSchemaBenchmark.scala:126)
...
[error] Caused by: java.lang.OutOfMemoryError: GC overhead limit
exceeded
[error]         at java.util.Arrays.copyOf(Arrays.java:3332)
[error]         at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
[error]         at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
[error]         at
java.lang.StringBuilder.append(StringBuilder.java:136)
[error]         at
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:213)
[error]         at
scala.collection.TraversableOnce.$anonfun$addString$1(TraversableOnce.scala:368)
[error]         at
scala.collection.TraversableOnce$$Lambda$67/667447085.apply(Unknown
Source)
[error]         at
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
[error]         at
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[error]         at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
[error]         at
scala.collection.TraversableOnce.addString(TraversableOnce.scala:362)
[error]         at
scala.collection.TraversableOnce.addString$(TraversableOnce.scala:358)
[error]         at
scala.collection.mutable.ArrayOps$ofRef.addString(ArrayOps.scala:198)
[error]         at
scala.collection.TraversableOnce.mkString(TraversableOnce.scala:328)
[error]         at
scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:327)
[error]         at
scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:198)
[error]         at
scala.collection.TraversableOnce.mkString(TraversableOnce.scala:330)
[error]         at
scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:330)
[error]         at
scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:198)
[error]         at
org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411)
[error]         at
org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1695)
[error]         at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[error]         at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[error]         at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[error]         at
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$934/387827651.apply(Unknown
Source)
[error]         at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$929/449240381.apply(Unknown
Source)
[error]         at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
[error]         at
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:435)
```
It is after cb5ea201df5fae8aacb653ffb4147b9288bca1e9 commit which
refactors `ExpressionEncoder`.
The stacktrace shows it fails at `transformUp` on `objSerializer` in
`ExpressionEncoder`. In particular, it fails at initializing
`ValidateExternalType.errMsg`, that interpolates `catalogString` of
given `expected` data type in a string. In WideSchemaBenchmark we have
very deeply nested data type. When we transform on the serializer which
contains `ValidateExternalType`, we create redundant big string
`errMsg`. Because we just in transforming it and don't use it yet, it is
useless and waste a lot of memory.
After make `ValidateExternalType.errMsg` as lazy variable,
WideSchemaBenchmark works.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual test with WideSchemaBenchmark.
Closes #27117 from viirya/SPARK-30429.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by:
Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala (diff)