FailedChanges

Summary

  1. [SPARK-29943][SQL] Improve error messages for unsupported data type (commit: 332e593093460c34fae303a913a862b1b579c83f) (details)
  2. [SPARK-30044][ML] MNB/CNB/BNB use empty sigma matrix instead of null (commit: 4021354b73dd86ee765f50ff90ab777edfc21bdb) (details)
  3. [SPARK-29537][SQL] throw exception when user defined a wrong base path (commit: 075ae1eeaf198792650287cd5b3f607a05c574bf) (details)
  4. [SPARK-29348][SQL] Add observable Metrics for Streaming queries (commit: d7b268ab3264b892c4477cf8af30fb78c2694748) (details)
  5. [SPARK-30048][SQL] Enable aggregates with interval type values for (commit: 39291cff951639a7ae4b487ea2c606affa5ff76f) (details)
Commit 332e593093460c34fae303a913a862b1b579c83f by gurwls223
[SPARK-29943][SQL] Improve error messages for unsupported data type
### What changes were proposed in this pull request? Improve error
messages for unsupported data type.
### Why are the changes needed? When the spark reads the hive table and
encounters an unsupported field type, the exception message has only one
unsupported type, and the user cannot know which field of which table.
### Does this PR introduce any user-facing change? No.
### How was this patch tested?
```create view t AS SELECT STRUCT('a' AS `$a`, 1 AS b) as q;``` current:
org.apache.spark.SparkException: Cannot recognize hive type string:
struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot
recognize hive type string: struct<$a:string,b:int>, column: q
```select * from t,t_normal_1,t_normal_2``` current:
org.apache.spark.SparkException: Cannot recognize hive type string:
struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot
recognize hive type string: struct<$a:string,b:int>, column: q, db:
default, table: t
Closes #26577 from cxzl25/unsupport_data_type_msg.
Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 332e593093460c34fae303a913a862b1b579c83f)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
Commit 4021354b73dd86ee765f50ff90ab777edfc21bdb by ruifengz
[SPARK-30044][ML] MNB/CNB/BNB use empty sigma matrix instead of null
### What changes were proposed in this pull request? MNB/CNB/BNB use
empty sigma matrix instead of null
### Why are the changes needed? 1,Using empty sigma matrix will simplify
the impl 2,I am reviewing FM impl these days, FMModels have optional
bias and linear part. It seems more reasonable to set optional part an
empty vector/matrix or zero value than `null`
### Does this PR introduce any user-facing change? yes, sigma from
`null` to empty matrix
### How was this patch tested? updated testsuites
Closes #26679 from zhengruifeng/nb_use_empty_sigma.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by:
zhengruifeng <ruifengz@foxmail.com>
(commit: 4021354b73dd86ee765f50ff90ab777edfc21bdb)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
Commit 075ae1eeaf198792650287cd5b3f607a05c574bf by wenchen
[SPARK-29537][SQL] throw exception when user defined a wrong base path
### What changes were proposed in this pull request?
When user defined a base path which is not an ancestor directory for all
the input paths, throw exception immediately.
### Why are the changes needed?
Assuming that we have a DataFrame[c1, c2] be written out in parquet and
partitioned by c1.
When using `spark.read.parquet("/path/to/data/c1=1")` to read the data,
we'll have a DataFrame with column c2 only.
But if we use `spark.read.option("basePath",
"/path/from").parquet("/path/to/data/c1=1")` to read the data, we'll
have a DataFrame with column c1 and c2.
This's happens because a wrong base path does not actually work in
`parsePartition()`, so paring would continue until it reaches a
directory without "=".
And I think the result of the second read way doesn't make sense.
### Does this PR introduce any user-facing change?
Yes, with this change, user would hit `IllegalArgumentException ` when
given a wrong base path while previous behavior doesn't.
### How was this patch tested?
Added UT.
Closes #26195 from Ngone51/dev-wrong-basePath.
Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: wuyi
<yi.wu@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 075ae1eeaf198792650287cd5b3f607a05c574bf)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala (diff)
Commit d7b268ab3264b892c4477cf8af30fb78c2694748 by herman
[SPARK-29348][SQL] Add observable Metrics for Streaming queries
### What changes were proposed in this pull request? Observable metrics
are named arbitrary aggregate functions that can be defined on a query
(Dataframe). As soon as the execution of a Dataframe reaches a
completion point (e.g. finishes batch query or reaches streaming epoch)
a named event is emitted that contains the metrics for the data
processed since the last completion point.
A user can observe these metrics by attaching a listener to spark
session, it depends on the execution mode which listener to attach:
- Batch: `QueryExecutionListener`. This will be called when the query
completes. A user can access the metrics by using the
`QueryExecution.observedMetrics` map.
- (Micro-batch) Streaming: `StreamingQueryListener`. This will be called
when the streaming query completes an epoch. A user can access the
metrics by using the `StreamingQueryProgress.observedMetrics` map.
Please note that we currently do not support continuous execution
streaming.
### Why are the changes needed? This enabled observable metrics.
### Does this PR introduce any user-facing change? Yes. It adds the
`observe` method to `Dataset`.
### How was this patch tested?
- Added unit tests for the `CollectMetrics` logical node to the
`AnalysisSuite`.
- Added unit tests for `StreamingProgress` JSON serialization to the
`StreamingQueryStatusAndProgressSuite`.
- Added integration tests for streaming to the
`StreamingQueryListenerSuite`.
- Added integration tests for batch to the `DataFrameCallbackSuite`.
Closes #26127 from hvanhovell/SPARK-29348.
Authored-by: herman <herman@databricks.com> Signed-off-by: herman
<herman@databricks.com>
(commit: d7b268ab3264b892c4477cf8af30fb78c2694748)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/AggregatingAccumulator.scala (diff)
The file was modifiedproject/MimaExcludes.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/CollectMetricsExec.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/PlanHelper.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/AggregatingAccumulatorSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
Commit 39291cff951639a7ae4b487ea2c606affa5ff76f by wenchen
[SPARK-30048][SQL] Enable aggregates with interval type values for
RelationalGroupedDataset
### What changes were proposed in this pull request?
Now the min/max/sum/avg are support for intervals, we should also enable
it in RelationalGroupedDataset
### Why are the changes needed?
API consistency improvement
### Does this PR introduce any user-facing change?
yes, Dataset support min/max/sum/avg(mean) on intervals
### How was this patch tested?
add ut
Closes #26681 from yaooqinn/SPARK-30048.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 39291cff951639a7ae4b487ea2c606affa5ff76f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala (diff)