SuccessChanges

Summary

  1. [SPARK-30214][SQL] A new framework to resolve v2 commands (details)
  2. [SPARK-30225][CORE] Correct read() behavior past EOF in (details)
  3. [SPARK-29768][SQL][FOLLOW-UP] Improve handling non-deterministic filter (details)
Commit c49388a48431e37745ab9ee690b3ef3743f4b3b5 by wenchen
[SPARK-30214][SQL] A new framework to resolve v2 commands
### What changes were proposed in this pull request? Currently, we have
a v2 adapter for v1 catalog (`V2SessionCatalog`), all the
table/namespace commands can be implemented via v2 APIs.
Usually, a command needs to know which catalog it needs to operate, but
different commands have different requirements about what to resolve. A
few examples:
  - `DROP NAMESPACE`: only need to know the name of the namespace.
- `DESC NAMESPACE`: need to lookup the namespace and get metadata, but
is done during execution
- `DROP TABLE`: need to do lookup and make sure it's a table not (temp)
view.
- `DESC TABLE`: need to lookup the table and get metadata.
For namespaces, the analyzer only needs to find the catalog and the
namespace name. The command can do lookup during execution if needed.
For tables, mostly commands need the analyzer to do lookup.
Note that, table and namespace have a difference: `DESC NAMESPACE
testcat` works and describes the root namespace under `testcat`, while
`DESC TABLE testcat` fails if there is no table `testcat` under the
current catalog. It's because namespaces can be named [], but tables
can't. The commands should explicitly specify it needs to operate on
namespace or table.
In this Pull Request, we introduce a new framework to resolve v2
commands: 1. parser creates logical plans or commands with
`UnresolvedNamespace`/`UnresolvedTable`/`UnresolvedView`/`UnresolvedRelation`.
(CREATE TABLE still keeps Seq[String], as it doesn't need to look up
relations) 2. analyzer converts 2.1 `UnresolvedNamespace` to
`ResolvesNamespace` (contains catalog and namespace identifier) 2.2
`UnresolvedTable` to `ResolvedTable` (contains catalog, identifier and
`Table`) 2.3 `UnresolvedView` to `ResolvedView` (will be added later
when we migrate view commands) 2.4 `UnresolvedRelation` to relation. 3.
an extra analyzer rule to match commands with `V1Table` and converts
them to corresponding v1 commands. This will be added later when we
migrate existing commands 4. planner matches commands and converts them
to the corresponding physical nodes.
We also introduce brand new v2 commands - the `comment` syntaxes to
illustrate how to work with the newly added framework.
```sql COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ... COMMENT ON
TABLE ... IS ...
``` Details about the `comment` syntaxes: As the new design of catalog
v2, some properties become reserved, e.g. `location`, `comment`. We are
going to disable setting reserved properties by dbproperties or
tblproperites directly to avoid confliction with their related subClause
or specific commands.
They are the best practices from PostgreSQL and presto.
https://www.postgresql.org/docs/12/sql-comment.html
https://prestosql.io/docs/current/sql/comment.html
Mostly, the basic thoughts of the new framework came from the
discussions bellow with cloud-fan,
https://github.com/apache/spark/pull/26847#issuecomment-564510061,
### Why are the changes needed? To make it easier to add new v2
commands, and easier to unify the table relation behavior.
### Does this PR introduce any user-facing change? yes, add new syntax
### How was this patch tested?
add uts.
Closes #26847 from yaooqinn/SPARK-30214.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/comments.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/table.scala
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/namespace.scala
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
Commit cb9fc4bb6ffe542e526a6006582606eb709fb311 by gurwls223
[SPARK-30225][CORE] Correct read() behavior past EOF in
NioBufferedFileInputStream
This bug manifested itself when another stream would potentially make a
call to NioBufferedFileInputStream.read() after it had reached EOF in
the wrapped stream. In that case, the refill() code would clear the
output buffer the first time EOF was found, leaving it in a readable
state for subsequent read() calls. If any of those calls were made, bad
data would be returned.
By flipping the buffer before returning, even in the EOF case, you get
the correct behavior in subsequent calls. I picked that approach to
avoid keeping more state in this class, although it means calling the
underlying stream even after EOF (which is fine, but perhaps a little
more expensive).
This showed up (at least) when using encryption, because the
commons-crypto StreamInput class does not track EOF internally, leaving
it for the wrapped stream to behave correctly.
Tested with added unit test + slightly modified test case attached to
SPARK-18105.
Closes #27084 from vanzin/SPARK-30225.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
The file was modifiedcore/src/test/java/org/apache/spark/io/GenericFileInputStreamSuite.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java (diff)
Commit e38964c44211c652ffc8c3d504b80e7fa0b6930d by wenchen
[SPARK-29768][SQL][FOLLOW-UP] Improve handling non-deterministic filter
of ScanOperation
### What changes were proposed in this pull request?
1. For `ScanOperation`, if it collects more than one filters, then all
filters must be deterministic. And filter can be non-deterministic iff
there's only one collected filter.
2. `FileSourceStrategy` should filter out non-deterministic filter, as
it will hit haven't initialized exception if it's a partition related
filter.
### Why are the changes needed?
Strictly follow `CombineFilters`'s behavior which doesn't allow combine
two filters where non-deterministic predicates exist. And avoid hitting
exception for file source.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Test exists.
Closes #27073 from Ngone51/SPARK-29768-FOLLOWUP.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala (diff)