1. [SPARK-9612][ML][FOLLOWUP] fix GBT support weights if subsamplingRate<1 (details)
  2. [SPARK-30426][SS][DOC] Fix the disorder of (details)
Commit f8cfefaf8d27924a8c357a084f944b278f6c9170 by ruifengz
[SPARK-9612][ML][FOLLOWUP] fix GBT support weights if subsamplingRate<1
### What changes were proposed in this pull request? 1, fix
`BaggedPoint.convertToBaggedRDD` when `subsamplingRate < 1.0` 2, reorg
`RandomForest.runWithMetadata` btw
### Why are the changes needed? In GBT, Instance weights will be
discarded if subsamplingRate<1
1, `baggedPoint: BaggedPoint[TreePoint]` is used in the tree growth to
find best split; 2, `BaggedPoint[TreePoint]` contains two weights:
```scala class BaggedPoint[Datum](val datum: Datum, val subsampleCounts:
Array[Int], val sampleWeight: Double = 1.0) class TreePoint(val label:
Double, val binnedFeatures: Array[Int], val weight: Double)
``` 3, only the var `sampleWeight` in `BaggedPoint` is used, the var
`weight` in `TreePoint` is never used in finding splits; 4, The method
`BaggedPoint.convertToBaggedRDD` was changed in, it was only for
decisiontree, so only the following code path was changed;
``` if (numSubsamples == 1 && subsamplingRate == 1.0) {
       convertToBaggedRDDWithoutSampling(input, extractSampleWeight)
``` 5, In, I made GBT support
weights, but only test it with default `subsamplingRate==1`. GBT with
`subsamplingRate<1` will convert treePoints to baggedPoints via
```scala convertToBaggedRDDSamplingWithoutReplacement(input,
subsamplingRate, numSubsamples, seed)
``` in which the orignial weights from `weightCol` will be discarded and
all `sampleWeight` are assigned default 1.0;
### Does this PR introduce any user-facing change? No
### How was this patch tested? updated testsuites
Closes #27070 from zhengruifeng/gbt_sampling.
Authored-by: zhengruifeng <> Signed-off-by:
zhengruifeng <>
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/tree/impl/BaggedPointSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/regression/GBTRegressorSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala (diff)
Commit bc16bb1dd095c9e1c8deabf6ac0d528441a81d88 by wenchen
[SPARK-30426][SS][DOC] Fix the disorder of
structured-streaming-kafka-integration page
### What changes were proposed in this pull request? Fix the disorder of
`structured-streaming-kafka-integration` page caused by #23747.
### Why are the changes needed? A typo messed up the HTML page.
### Does this PR introduce any user-facing change? No
### How was this patch tested? Locally test by Jekyll. Before:
Closes #27098 from xuanyuanking/SPARK-30426.
Authored-by: Yuanjian Li <> Signed-off-by: Wenchen
Fan <>
The file was modifieddocs/ (diff)