cc @cloud-fan @dongjoon-hyun @LuciferYang @HyukjinKwon thanks
Hi @cloud-fan, I have addressed your comments. The expressions are now replaced at runtime by static invoke, and the string representations no longer contain those legacy flags.
107 | 107 | strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8") |
can we use a different expression for testing? The codegen size is greatly decreased after using StaticInvoke
in Encode
.
e.g. StringTrim
Nice catch!
5016 | "instead of reporting coding errors.") | ||
5017 | .version("4.0.0") | ||
5018 | .booleanConf | ||
5019 | .createWithDefault(false) |
I wonder if it should be a fallback conf to ANSI.
The reasons I'd like to make it independent of ANSI are:
The reasons mentioned above indicate that this behavior is more of a legacy trait of Spark itself.
Merged to master.
Thank you @cloud-fan @HyukjinKwon for the help
Login to write a write a comment.
What changes were proposed in this pull request?
This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take
select encode('渭城朝雨浥轻尘', 'US-ASCII')
as an exampleBefore this PR,
After this PR,
Why are the changes needed?
Improve data quality.
Does this PR introduce any user-facing change?
Yes.
When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.
How was this patch tested?
new unit tests
Was this patch authored or co-authored using generative AI tooling?
no