trino
Use new parseHiveDate for OpenX reader to remove any characters after yyyy-mm-dd
#25792
Open

Use new parseHiveDate for OpenX reader to remove any characters after yyyy-mm-dd #25792

ljw9111 wants to merge 1 commit into trinodb:master from ljw9111:openx-date-parse-issue
ljw9111
ljw91119 days ago

Description

The parseHiveDate method in HiveFormatUtils.java that the native OpenX reader was using only supported
a space delimiter to remove any characters after 'yyyy-mm-dd'. As a result, while '2025-01-04 00:00:00.000Z'
was correctly parsed as '2025-01-04', strings like '2025-01-04T00:00:00.000Z' or '2025-01-04AA00:00:00.000Z'
were throwing exceptions and being parsed as null.
This new parseHiveDate method removes any characters after 'yyyy-mm-dd', regardless of the delimiter using regex.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(O) Release notes are required, with the following suggested text:

## Section
* The native OpenX reader now ignores any characters after 'yyyy-mm-dd', regardless of the delimiter. Previously, it only supported space delimiters, but now it correctly parses formats like '2025-01-04T00:00:00.000Z' or '2025-01-04AA00:00:00.000Z' to '2025-01-04' ({issue}issuenumber)
cla-bot cla-bot added cla-signed
ljw9111 ljw9111 requested a review from dain dain 9 days ago
pettyjamesm
pettyjamesm commented on 2025-05-14
pettyjamesm9 days ago

It seems like we have to options here:

  1. Preserve compatibility with the (original) OpenX JSON SerDe, which means preserving the semantics of doing new SimpleDateFormat("yyyy-MM-dd").parse(value) (see: JavaStringDateObjectInspector.parse
  2. Match Hive's parsing semantics at one variation of the following:
    • Later versions of Hive 3 which support parsing timestamps separating the date and time with either a space or ISO 8601 style timestamps separating the two with 'T' (see: Hive 3 Date.valueOf)
    • Hive 4 plus style parsing which ignores anything after the date portion (see: Hive 4 Date.valueOf)

If we're preserving the original OpenX serde semantics, that means:

  • Ignoring any contents of the string after the date portion (similar to Hive 4+)
  • (optionally) also support parsing date strings that only use a single digit for the month portion, e.g.: "2025-1-1" since that's what SimpleDateFormat did.

I would support matching the permissive OpenX SimpleDateFormat equivalent behavior, but only for the OpenX SerDe (as is the approach here, avoiding using the shared HiveFormatUtils.parseHiveDate method).

cc: @electrum, @dain - any thoughts on what the compatibility goal should be here and whether special casing this logic in OpenX is the right approach to take?

Conversation is marked as resolved
Show resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/openxjson/OpenXJsonDeserializer.java
910 private static LocalDate parseHiveDate(String value)
911 {
912 value = value.trim();
913
String datePortion = value.replaceAll("(\\d+[-]\\d+[-]\\d+).*", "$1");
pettyjamesm9 days ago

Using a regex to do this is going to be prohibitively expensive (especially when the input doesn't require any adaptation). We'll want a better way to handle this.

ljw91119 days ago

Thanks for the review! I updated the method without using regex

ljw9111 ljw9111 assigned ljw9111 ljw9111 9 days ago
ljw9111 ljw9111 force pushed from 5cd34102 to e8101852 9 days ago
ljw9111 Use new parseHiveDate for OpenX reader to remove any characters after…
46ed36ea
ljw9111 ljw9111 force pushed from e8101852 to 46ed36ea 12 hours ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
Labels
Milestone