spark
[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters
#47017
Closed

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

yaooqinn wants to merge 12 commits into apache:master from yaooqinn:SPARK-48658
yaooqinn
yaooqinn323 days ago

What changes were proposed in this pull request?

This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take select encode('渭城朝雨浥轻尘', 'US-ASCII') as an example

Before this PR,

???????

After this PR,

org.apache.spark.SparkRuntimeException
{
  "errorClass" : "MALFORMED_CHARACTER_CODING",
  "sqlState" : "22000",
  "messageParameters" : {
    "charset" : "US-ASCII",
    "function" : "`encode`"
  }
}

Why are the changes needed?

Improve data quality.

Does this PR introduce any user-facing change?

Yes.

When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.

How was this patch tested?

new unit tests

Was this patch authored or co-authored using generative AI tooling?

no
yaooqinn [SPARK-48658][SQL] Encode/Decode functions report coding error instea…
6426fddd
github-actions github-actions added SQL
yaooqinn [SPARK-48658][SQL] Encode/Decode functions report coding error instea…
aee78a58
yaooqinn [SPARK-48658][SQL] Encode/Decode functions report coding error instea…
afb2d08a
yaooqinn fix ExplainSuite
f6dd4fa9
yaooqinn fix golden file tests
851135c9
yaooqinn fix golden file tests
d3473a45
github-actions github-actions added CONNECT
yaooqinn
yaooqinn323 days ago
cloud-fan
cloud-fan commented on 2024-06-19
Conversation is marked as resolved
Show resolved
connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain
1Project [decode(cast(g#0 as binary), UTF-8, false) AS decode(g, UTF-8)#0]
1
Project [decode(cast(g#0 as binary), UTF-8, false, false) AS decode(g, UTF-8)#0]
cloud-fan323 days ago

shall we exclude these two flags from the pretty string of the expression?

yaooqinn323 days ago

OK

cloud-fan
cloud-fan commented on 2024-06-19
Conversation is marked as resolved
Show resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
27352759 }
27362760
27372761 override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
2738 nullSafeCodeGen(ctx, ev, (bytes, charset) => {
2739 val fromCharset = ctx.freshName("fromCharset")
2740 val sc = JavaCode.global(
2741 ctx.addReferenceObj("supportedCharsets", supportedCharsets),
2742 supportedCharsets.getClass)
2743 s"""
2744 String $fromCharset = $charset.toString();
2745 try {
2746 if ($legacyCharsets || $sc.contains($fromCharset.toUpperCase(java.util.Locale.ROOT))) {
2747 ${ev.value} = UTF8String.fromString(new String($bytes, $fromCharset));
2748 } else {
2749 throw new java.io.UnsupportedEncodingException();
2750 }
2751 } catch (java.io.UnsupportedEncodingException e) {
2752 throw QueryExecutionErrors.invalidCharsetError("$prettyName", $fromCharset);
2753 }
2754 """
2755 })
2762
val expr = ctx.addReferenceObj("this", this)
cloud-fan323 days ago

since we are here, shall we re-implement this expression with StaticInvoke?

yaooqinn Encode RuntimeReplaceable with StaticInvoke
3e90976d
yaooqinn Decode RuntimeReplaceable with StaticInvoke
b0cf6eeb
yaooqinn fix tests
64f3c393
yaooqinn Merge branch 'master' into SPARK-48658
9d2583c7
yaooqinn fix
8f5a2360
yaooqinn
yaooqinn321 days ago

Hi @cloud-fan, I have addressed your comments. The expressions are now replaced at runtime by static invoke, and the string representations no longer contain those legacy flags.

yaooqinn yaooqinn requested a review from cloud-fan cloud-fan 318 days ago
cloud-fan
cloud-fan commented on 2024-06-24
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
107107
strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8")
cloud-fan318 days ago

can we use a different expression for testing? The codegen size is greatly decreased after using StaticInvoke in Encode.

cloud-fan318 days ago

e.g. StringTrim

yaooqinn318 days ago

Nice catch!

cloud-fan
cloud-fan approved these changes on 2024-06-24
HyukjinKwon
HyukjinKwon commented on 2024-06-24
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
5016 "instead of reporting coding errors.")
5017 .version("4.0.0")
5018 .booleanConf
5019
.createWithDefault(false)
HyukjinKwon318 days ago

I wonder if it should be a fallback conf to ANSI.

yaooqinn318 days ago

The reasons I'd like to make it independent of ANSI are:

  • Part of the implication of ANSI is Hive-incompatibility,
  • Hive also reports coding errors, so it was a mistake when we ported this from hive
  • These functions are not ANSI-defined
  • The error behaviors are also not found in ANSI

The reasons mentioned above indicate that this behavior is more of a legacy trait of Spark itself.

yaooqinn address comments
d7a4199f
yaooqinn yaooqinn closed this 318 days ago
yaooqinn
yaooqinn318 days ago

Merged to master.

Thank you @cloud-fan @HyukjinKwon for the help

yaooqinn yaooqinn deleted the SPARK-48658 branch 317 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone