FailedChanges

Summary

  1. [SPARK-32625][SQL] Log error message when falling back to interpreter (commit: c280c7f) (details)
  2. [SPARK-32610][DOCS] Fix the link to metrics.dropwizard.io in (commit: 9a79bbc) (details)
  3. [SPARK-32399][SQL] Full outer shuffled hash join (commit: 8f0fef1) (details)
Commit c280c7f529e2766dd7dd45270bde340c28b9d74b by dongjoon
[SPARK-32625][SQL] Log error message when falling back to interpreter
mode
### What changes were proposed in this pull request?
This pr log the error message when falling back to interpreter mode.
### Why are the changes needed?
Not all error messages are in `CodeGenerator`, such as:
``` 21:48:44.612 WARN
org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error
and falling back to interpreter mode java.lang.IllegalArgumentException:
Can not interpolate org.apache.spark.sql.types.Decimal into code block.
at
org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1(javaCode.scala:240)
at
org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1$adapted(javaCode.scala:236)
at
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes #29440 from wangyum/SPARK-32625.
Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: c280c7f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala (diff)
Commit 9a79bbc8b6e426e7b29a9f4867beb396014d8046 by srowen
[SPARK-32610][DOCS] Fix the link to metrics.dropwizard.io in
monitoring.md to refer the proper version
### What changes were proposed in this pull request?
This PR fixes the link to metrics.dropwizard.io in monitoring.md to
refer the proper version of the library.
### Why are the changes needed?
There are links to metrics.dropwizard.io in monitoring.md but the link
targets refer the version 3.1.0, while we use 4.1.1. Now that users can
create their own metrics using the dropwizard library, it's better to
fix the links to refer the proper version.
### Does this PR introduce _any_ user-facing change?
Yes. The modified links refer the version 4.1.1.
### How was this patch tested?
Build the docs and visit all the modified links.
Closes #29426 from sarutak/fix-dropwizard-url.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Sean Owen <srowen@gmail.com>
(commit: 9a79bbc)
The file was modifieddocs/monitoring.md (diff)
The file was modifiedpom.xml (diff)
Commit 8f0fef18438aa8fb07f5ed885ffad1339992f102 by yamamuro
[SPARK-32399][SQL] Full outer shuffled hash join
### What changes were proposed in this pull request?
Add support for full outer join inside shuffled hash join. Currently if
the query is a full outer join, we only use sort merge join as the
physical operator. However it can be CPU and IO intensive in case input
table is large for sort merge join. Shuffled hash join on the other hand
saves the sort CPU and IO compared to sort merge join, especially when
table is large.
This PR implements the full outer join as followed:
* Process rows from stream side by looking up hash relation, and mark
the matched rows from build side by:
* for joining with unique key, a `BitSet` is used to record matched
rows from build side (`key index` to represent each row)
* for joining with non-unique key, a `HashSet[Long]` is  used to record
matched rows from build side (`key index` + `value index` to represent
each row).
`key index` is defined as the index into key addressing array
`longArray` in `BytesToBytesMap`.
`value index` is defined as the iterator index of values for same key.
* Process rows from build side by iterating hash relation, and filter
out rows from build side being looked up already (done in
`ShuffledHashJoinExec.fullOuterJoin`)
For context, this PR was originally implemented as followed (up to
commit
https://github.com/apache/spark/pull/29342/commits/e3322766d4ea6d039f819a46e12dc8641ca59c63):
1. Construct hash relation from build side, with extra boolean value at
the end of row to track look up information (done in
`ShuffledHashJoinExec.buildHashedRelation` and
`UnsafeHashedRelation.apply`). 2. Process rows from stream side by
looking up hash relation, and mark the matched rows from build side be
looked up (done in `ShuffledHashJoinExec.fullOuterJoin`). 3. Process
rows from build side by iterating hash relation, and filter out rows
from build side being looked up already (done in
`ShuffledHashJoinExec.fullOuterJoin`).
See discussion of pros and cons between these two approaches
[here](https://github.com/apache/spark/pull/29342#issuecomment-672275450),
[here](https://github.com/apache/spark/pull/29342#issuecomment-672288194)
and
[here](https://github.com/apache/spark/pull/29342#issuecomment-672640531).
TODO: codegen for full outer shuffled hash join can be implemented in
another followup PR.
### Why are the changes needed?
As implementation in this PR, full outer shuffled hash join will have
overhead to iterate build side twice (once for building hash map, and
another for outputting non-matching rows), and iterate stream side once.
However, full outer sort merge join needs to iterate both sides twice,
and sort the large table can be more CPU and IO intensive. So full outer
shuffled hash join can be more efficient than sort merge join when
stream side is much more larger than build side.
For example query below, full outer SHJ saved 30% wall clock time
compared to full outer SMJ.
``` def shuffleHashJoin(): Unit = {
   val N: Long = 4 << 22
   withSQLConf(
     SQLConf.SHUFFLE_PARTITIONS.key -> "2",
     SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "20000000") {
     codegenBenchmark("shuffle hash join", N) {
       val df1 = spark.range(N).selectExpr(s"cast(id as string) as k1")
       val df2 = spark.range(N / 10).selectExpr(s"cast(id * 10 as
string) as k2")
       val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
       df.noop()
   }
}
}
```
``` Running benchmark: shuffle hash join
Running case: shuffle hash join off
Stopped after 2 iterations, 16602 ms
Running case: shuffle hash join on
Stopped after 5 iterations, 31911 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz shuffle hash join:            
          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per
Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join off                              7900           8301 
      567          2.1         470.9       1.0X shuffle hash join on   
                          6250           6382          95          2.7 
      372.5       1.3X
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `JoinSuite.scala`,
`AbstractBytesToBytesMapSuite.java` and `HashedRelationSuite.scala`.
Closes #29342 from c21/full-outer-shj.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro
<yamamuro@apache.org>
(commit: 8f0fef1)
The file was modifiedcore/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledJoin.scala (diff)
The file was modifiedcore/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)