SuccessChanges

Summary

  1. [SPARK-30215][SQL] Remove PrunedInMemoryFileIndex and merge its (details)
  2. [SPARK-30410][SQL] Calculating size of table with large number of (details)
  3. [MINOR][CORE] Process bar should print new line to avoid polluting logs (details)
  4. [MINOR][ML][INT] Array.fill(0) -> Array.ofDim; Array.empty -> (details)
  5. [SPARK-30445][CORE] Accelerator aware scheduling handle setting configs (details)
  6. [SPARK-30281][SS] Consider partitioned/recursive option while verifying (details)
Commit 047bff06c3ff11b84dbc2297fda943ce16ec0db5 by wenchen
[SPARK-30215][SQL] Remove PrunedInMemoryFileIndex and merge its
functionality into InMemoryFileIndex
### What changes were proposed in this pull request? Remove
PrunedInMemoryFileIndex and merge its functionality into
InMemoryFileIndex.
### Why are the changes needed? PrunedInMemoryFileIndex is only used in
CatalogFileIndex.filterPartitions, and its name is kind of confusing, we
can completely merge its functionality into InMemoryFileIndex and remove
the class.
### Does this PR introduce any user-facing change? No
### How was this patch tested? Existing unit tests.
Closes #26850 from fuwhu/SPARK-30215.
Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala (diff)
Commit fa36966b1ee878b1cf8b66da79b8e0cc283f55ae by srowen
[SPARK-30410][SQL] Calculating size of table with large number of
partitions causes flooding logs
### What changes were proposed in this pull request?
For a partitioned table, if the number of partitions are very large,
e.g. tens of thousands or even larger, calculating its total size causes
flooding logs. The flooding happens in: 1. `calculateLocationSize`
prints the starting and ending for calculating the location size, and it
is called per partition; 2. `bulkListLeafFiles` prints all partition
paths.
This pr is to simplify the logging when calculating the size of a
partitioned table.
### How was this patch tested?
not related
Closes #27079 from wzhfy/improve_log.
Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (diff)
Commit b3c2d735d4ead101a0438bb7bbfef833b3a7b68f by srowen
[MINOR][CORE] Process bar should print new line to avoid polluting logs
### What changes were proposed in this pull request?
Use `println()` instead of `print()` to show process bar in console.
### Why are the changes needed?
Logs are polluted by process bar:
![image](https://user-images.githubusercontent.com/16397174/71623360-f59f9380-2c16-11ea-8e27-858a10caf1f5.png)
This is easy to reproduce:
1. start `./bin/spark-shell` 2. `sc.setLogLevel("INFO")` 3. run:
`spark.range(100000000).coalesce(1).write.parquet("/tmp/result")`
### Does this PR introduce any user-facing change?
Yeah, more friendly format in console.
### How was this patch tested?
Tested manually.
Closes #27061 from Ngone51/fix-processbar.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
The file was modifiedcore/src/main/scala/org/apache/spark/ui/ConsoleProgressBar.scala (diff)
Commit a93b9966358f4e21818cf2d42cf65cd8dcd3fa41 by gurwls223
[MINOR][ML][INT] Array.fill(0) -> Array.ofDim; Array.empty ->
Array.emptyIntArray
### What changes were proposed in this pull request? 1, for primitive
types `Array.fill(n)(0)` -> `Array.ofDim(n)`; 2, for `AnyRef` types
`Array.fill(n)(null)` -> `Array.ofDim(n)`; 3, for primitive types
`Array.empty[XXX]` -> `Array.emptyXXXArray`
### Why are the changes needed?
`Array.ofDim` avoid assignments;
`Array.emptyXXXArray` avoid create new object;
### Does this PR introduce any user-facing change? No
### How was this patch tested? existing testsuites
Closes #27133 from zhengruifeng/minor_fill_ofDim.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SortBenchmark.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/FMRegressor.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/SizeEstimator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/ClusteringSummary.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/tree/impl/TreePoint.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala (diff)
Commit 0a72dba6f5530de215eb842a8e3242fcc94db342 by dhyun
[SPARK-30445][CORE] Accelerator aware scheduling handle setting configs
to 0
### What changes were proposed in this pull request?
Handle the accelerator aware configs being set to 0. This PR will just
ignore the requests when the amount is 0.
### Why are the changes needed?
Better user experience
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Unit tests added and manually tested on yarn, standalone, local, k8s.
Closes #27118 from tgravescs/SPARK-30445.
Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
The file was modifiedcore/src/test/scala/org/apache/spark/resource/ResourceUtilsSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/SparkConfSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/resource/ResourceUtils.scala (diff)
Commit bd7510bcb75bf6543ac4065f95723e2943114dcb by vanzin
[SPARK-30281][SS] Consider partitioned/recursive option while verifying
archive path on FileStreamSource
### What changes were proposed in this pull request?
This patch renews the verification logic of archive path for
FileStreamSource, as we found the logic doesn't take
partitioned/recursive options into account.
Before the patch, it only requires the archive path to have depth more
than 2 (two subdirectories from root), leveraging the fact
FileStreamSource normally reads the files where the parent directory
matches the pattern or the file itself matches the pattern. Given
'archive' operation moves the files to the base archive path with
retaining the full path, archive path is tend to be safe if the depth is
more than 2, meaning FileStreamSource doesn't re-read archived files as
new source files.
WIth partitioned/recursive options, the fact is invalid, as
FileStreamSource can read any files in any depth of subdirectories for
source pattern. To deal with this correctly, we have to renew the
verification logic, which may not intuitive and simple but works for all
cases.
The new verification logic prevents both cases:
1) archive path matches with source pattern as "prefix" (the depth of
archive path > the depth of source pattern)
e.g.
* source pattern: `/hello*/spar?`
* archive path: `/hello/spark/structured/streaming`
Any files in archive path will match with source pattern when recursive
option is enabled.
2) source pattern matches with archive path as "prefix" (the depth of
source pattern > the depth of archive path)
e.g.
* source pattern: `/hello*/spar?/structured/hello2*`
* archive path: `/hello/spark/structured`
Some archive files will not match with source pattern, e.g. file path:
`/hello/spark/structured/hello2`, then final archived path:
`/hello/spark/structured/hello/spark/structured/hello2`.
But some other archive files will still match with source pattern, e.g.
file path: `/hello2/spark/structured/hello2`, then final archived path:
`/hello/spark/structured/hello2/spark/structured/hello2` which matches
with source pattern when recursive is enabled.
Implicitly it also prevents archive path matches with source pattern as
full match (same depth).
We would want to prevent any source files to be archived and added to
new source files again, so the patch takes most restrictive approach to
prevent the possible cases.
### Why are the changes needed?
Without this patch, there's a chance archived files are included as new
source files when partitioned/recursive option is enabled, as current
condition doesn't take these options into account.
### Does this PR introduce any user-facing change?
Only for Spark 3.0.0-preview (only preview 1 for now, but possibly
preview 2 as well) - end users are required to provide archive path with
ensuring a bit complicated conditions, instead of simply higher than 2
depths.
### How was this patch tested?
New UT.
Closes #26920 from HeartSaVioR/SPARK-30281.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
The file was modifieddocs/structured-streaming-programming-guide.md (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala (diff)