spark
[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters
#47017

Closed

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

yaooqinn wants to merge 12 commits into apache:master from yaooqinn:SPARK-48658

yaooqinn323 days ago

What changes were proposed in this pull request?

This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take select encode('渭城朝雨浥轻尘', 'US-ASCII') as an example

Before this PR，

???????

After this PR，

org.apache.spark.SparkRuntimeException
{
  "errorClass" : "MALFORMED_CHARACTER_CODING",
  "sqlState" : "22000",
  "messageParameters" : {
    "charset" : "US-ASCII",
    "function" : "`encode`"
  }
}

Why are the changes needed?

Improve data quality.

Does this PR introduce any user-facing change?

Yes.

When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.

How was this patch tested?

new unit tests

Was this patch authored or co-authored using generative AI tooling?

[SPARK-48658][SQL] Encode/Decode functions report coding error instea…

6426fddd

github-actions added SQL

[SPARK-48658][SQL] Encode/Decode functions report coding error instea…

aee78a58

[SPARK-48658][SQL] Encode/Decode functions report coding error instea…

afb2d08a

fix ExplainSuite

f6dd4fa9

fix golden file tests

851135c9

fix golden file tests

d3473a45

github-actions added CONNECT

yaooqinn323 days ago

cc @cloud-fan @dongjoon-hyun @LuciferYang @HyukjinKwon thanks

cloud-fan commented on 2024-06-19

Conversation is marked as resolved

Show resolved

cloud-fan commented on 2024-06-19

Conversation is marked as resolved

Show resolved

Encode RuntimeReplaceable with StaticInvoke

3e90976d

Decode RuntimeReplaceable with StaticInvoke

b0cf6eeb

fix tests

64f3c393

Merge branch 'master' into SPARK-48658

9d2583c7

fix

8f5a2360

yaooqinn321 days ago

Hi @cloud-fan, I have addressed your comments. The expressions are now replaced at runtime by static invoke, and the string representations no longer contain those legacy flags.

yaooqinn requested a review from

cloud-fan 318 days ago

cloud-fan commented on 2024-06-24

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

107

strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8")

cloud-fan318 days ago

can we use a different expression for testing? The codegen size is greatly decreased after using StaticInvoke in Encode.

cloud-fan318 days ago

e.g. StringTrim

yaooqinn318 days ago

Nice catch!

cloud-fan approved these changes on 2024-06-24

HyukjinKwon commented on 2024-06-24

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

	5016		"instead of reporting coding errors.")
	5017		.version("4.0.0")
	5018		.booleanConf
	5019		.createWithDefault(false)

HyukjinKwon318 days ago

I wonder if it should be a fallback conf to ANSI.

yaooqinn318 days ago

The reasons I'd like to make it independent of ANSI are:

Part of the implication of ANSI is Hive-incompatibility,
Hive also reports coding errors, so it was a mistake when we ported this from hive
These functions are not ANSI-defined
The error behaviors are also not found in ANSI

The reasons mentioned above indicate that this behavior is more of a legacy trait of Spark itself.

address comments

d7a4199f

yaooqinn closed this 318 days ago

yaooqinn318 days ago

Merged to master.

Thank you @cloud-fan @HyukjinKwon for the help

yaooqinn deleted the SPARK-48658 branch 317 days ago

Reviewers

cloud-fan

HyukjinKwon

Assignees

No one assigned

Labels

SQL CONNECT

Milestone

No milestone

2735	2759	}
2736	2760
2737	2761	override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
2738		nullSafeCodeGen(ctx, ev, (bytes, charset) => {
2739		val fromCharset = ctx.freshName("fromCharset")
2740		val sc = JavaCode.global(
2741		ctx.addReferenceObj("supportedCharsets", supportedCharsets),
2742		supportedCharsets.getClass)
2743		s"""
2744		String $fromCharset = $charset.toString();
2745		try {
2746		if ($legacyCharsets \|\| $sc.contains($fromCharset.toUpperCase(java.util.Locale.ROOT))) {
2747		${ev.value} = UTF8String.fromString(new String($bytes, $fromCharset));
2748		} else {
2749		throw new java.io.UnsupportedEncodingException();
2750		}
2751		} catch (java.io.UnsupportedEncodingException e) {
2752		throw QueryExecutionErrors.invalidCharsetError("$prettyName", $fromCharset);
2753		}
2754		"""
2755		})
	2762	val expr = ctx.addReferenceObj("this", this)

1			Project [decode(cast(g#0 as binary), UTF-8, false) AS decode(g, UTF-8)#0]
	1		Project [decode(cast(g#0 as binary), UTF-8, false, false) AS decode(g, UTF-8)#0]

spark [SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017 Closed

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters
#47017

Closed