drill
DRILL-7978: Fixed Width Format Plugin
#2282
Open

DRILL-7978: Fixed Width Format Plugin #2282

MFoss19 wants to merge 46 commits into apache:master from MFoss19:format-fixedwidth
MFoss19
MFoss193 years ago

DRILL-7978: Fixed Width Format Plugin

Description

Developing format plugin to parse fixed width files.

Fixed Width Text File Definition: https://www.oracle.com/webfolder/technetwork/data-quality/edqhelp/Content/introduction/getting_started/configuring_fixed_width_text_file_formats.htm

Documentation

Users can now create a format configuration to parse fixed width files.

Testing

Unit tests added. More to come.

luocooong luocooong assigned MFoss19 MFoss19 3 years ago
luocooong luocooong added documentation
luocooong luocooong added enhancement
luocooong luocooong added new-storage
cgivre
cgivre requested changes on 2021-07-29
cgivre3 years ago👍 1

@MFoss19
Thank you very much for this contribution to Drill and nice work! I left some review comments. General comment would be to please remove any extraneous commented out code and Drill does have an automated style sheet which you can install for Intellij or Eclipse.

Also, to pass the CI and merge this, all the files will need to have the Apache license in them at the beginning of the file. The only exception would be the fwf test files. For that, you're going to have to add an exclusion to the RAT check in the root pom.xml file.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/pom.xml
18 limitations under the License.
19
20-->
21
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
cgivre3 years ago

Nit: Can you please reformat with 2 space indentation.

MFoss193 years ago

Done

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
85 }
86
87 reader = new BufferedReader(new InputStreamReader(fsStream, Charsets.UTF_8));
88
cgivre3 years ago

Nit: Please remove extra whitespace, here and elsewhere.

MFoss193 years ago

Done

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java
114 System.out.println(expected);
115 assertEquals(25, results.rowCount());
116
117
//System.out.println(results.batchSchema());
cgivre3 years ago

Please remove any calls to System.out.println in unit tests.

MFoss193 years ago

Done

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java
119
120
121 new RowSetComparison(expected).verifyAndClearAll(results);
122
System.out.println("Test complete.");
cgivre3 years ago

These last two lines are not necessary.

MFoss193 years ago

Done

contrib/format-fixedwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java
119
120
121 new RowSetComparison(expected).verifyAndClearAll(results);
122
System.out.println("Test complete.");
cgivre3 years ago

Please add tests for:

  • Serialization/Deserialization
  • Compressed file
  • Various invalid schemata. For instance, what happens if you don't have fields defined in the config and try to query the data?
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
92
93 @Override
94 public boolean next() { // Use loader to read data from file to turn into Drill rows
95
cgivre3 years ago

There is a small issue with the next() function as written. You define maxRecords but don't do anything with it. The maxRecords is the LIMIT which gets pushed down from the query. The idea being that if a user does a SELECT ... LIMIT 10 your reader should stop reading as soon as that limit is reached. Which means that the next function should return false when the limit has been reached. The good news is that your writer object actually has a method called limitReached(<maxRecords>) which will track this for you.

Another thing you might consider doing which would clean up the code a bit would be to make the parseLine method return true if there is more data to read, false if not. Also move the writer start and end to that method, then you could have a next method that looks like this:

@Override
  public boolean next() {
    recordCount = 0;
    while (!writer.isFull()) {
      if (!parseLine(writer)) {
        return false;
      }
    }
    return true;
  }
MFoss193 years ago

Fixed use of MaxRecords in next() function.

Still looking at moving writer start and end to next.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
107
108 line = reader.readLine();
109 }
110
} catch (Exception e) {
cgivre3 years ago

I'd suggest tightening this a bit. Instead of capturing a generic Exception, perhaps use specific exceptions with different error messages. Alternatively just put the try/catch around the line(s) in which data is being read.

MFoss193 years ago

Modified to capture IOException. At first, had done this deliberately since parseLine would throw Runtime Exception, but captured this to rethrow as IOException for clarity.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
121 @Override
122 public void close() {
123 try {
124
fsStream.close();
cgivre3 years ago

Consider using Autoclosables here. That will do all the error handling for you.

 @Override
  public void close() {
    if (fsStream != null) {
      AutoCloseables.closeSilently(fsStream);
      fsStream = null;
    }
  }
MFoss193 years ago

Done

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
143 return builder.buildSchema();
144 }
145
146
private void parseLine(String line, RowSetLoader writer) {
cgivre3 years ago

See comment above but I'd recommend making this method return a boolean value.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
180 break;
181 default:
182 throw new RuntimeException("Unknown data type specified in fixed width. Found data type " + dataType);
183
cgivre3 years ago

Nit: Please remove excess whitespace here and elsewhere.

MFoss193 years ago

Done

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
179 writer.scalar(i).setTimestamp(timeStamp);
180 break;
181 default:
182
throw new RuntimeException("Unknown data type specified in fixed width. Found data type " + dataType);
cgivre3 years ago

Please convert this to a UserException.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
157 DateTimeFormatter formatter = DateTimeFormatter.ofPattern(dateTimeFormat, Locale.ENGLISH);
158
159 switch (dataType) {
160
case INT:
cgivre3 years ago

It looks like we are only supporting INT, VARCHAR, DATE, TIME and TIMESTAMP. Do we want to support a few other data types such as LONG or DOUBLE?

MFoss193 years ago

Yes, will add LONG and DOUBLE. Also looking to add FLOAT and DECIMAL.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
cgivre3 years ago

Please add Apache license to all files.

MFoss193 years ago

Done

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
32 return dataType;
33 }
34
35
// public void setDataType(TypeProtos.MinorType dataType){
cgivre3 years ago

Please remove struts here and elsewhere.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
cgivre3 years ago

For all serializable classes in Drill, I would recommend overriding equals, hashcode, and toString() methods.

MFoss193 years ago

Done

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java
65 return EasyFormatConfig.builder()
66 .readable(true)
67 .writable(false)
68
.blockSplittable(false)
cgivre3 years ago

I think this actually might be blocksplittable.

paul-rogers
paul-rogers commented on 2021-08-19
paul-rogers3 years ago

@MFoss19, thanks much for the contribution. This should be very helpful for folks.

My comments are about some of the nuances of creating a storage plugin. I reference a couple of Drill internals. Let me know if you want more details about these.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
37 @JsonProperty("fieldName") String fieldName,
38 @JsonProperty("dateTimeFormat") String dateTimeFormat,
39 @JsonProperty("startIndex") int startIndex,
40
@JsonProperty("fieldWidth") int fieldWidth) {
paul-rogers3 years ago

Drill has no config builder UI, users have to create JSON configs by hand. We've found it is helpful to keep field names short (Go-style.) So, perhaps:

  • dataTypetype
  • fieldNamename
  • dateTimeFormat long, but OK
  • startIndexindex
  • fieldWidthwidth

Note that, since this is a FieldConfig we don't need the field prefix.

Also, it can help readers if the "primary key" fields are first, so perhaps change the order to

  • name
  • index
  • width
  • type
  • dateTimeFormat
MFoss193 years ago

Modified names and order.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
42 this.fieldName = fieldName;
43 this.dateTimeFormat = dateTimeFormat;
44 this.startIndex = startIndex;
45
this.fieldWidth = fieldWidth;
paul-rogers3 years ago

Since configs are created by hand, having good defaults is helpful. Perhaps:

  • name: required; throw an exception if blank, or if the stripped name is not a valid SQL symbol.
  • type: default to VARCHAR
  • dateTimeFormat: null is allowed, so no default.
  • index: required, must be >= 0.
  • width: either required, or can be optional. If provided must be > 0. (See below.)

For this plugin, we also have to check the set of fields.

  • No two fields can have the same name.
  • Should fields be allowed to overlap?

We could be clever. Scan all fields and sort into ascending order. If a field omits the width, just compute it from the index of this and the next field.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
33 private final int startIndex;
34 private final int fieldWidth;
35
36
public FixedwidthFieldConfig(@JsonProperty("dataType") TypeProtos.MinorType dataType,
paul-rogers3 years ago

Does it work to use the MinorType here? Is that type set up for Jackson serialization? I don't know the answer, just noting we should double-check to ensure it works OK.

MFoss193 years ago

This was tested configuring the plugin within the Drill UI (via manual test). I will also add an automated unit test for parsing the Json. To answer your question, yes it works to use MinorType here. Jackson can always read in Java enums, MinorType was generated as part of our protobuf and for this type of data they generate proper Java enums.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatConfig.java
77 .field("fields", fields)
78 .toString();
79 }
80
}
paul-rogers3 years ago

Nit: GitHub is complaining about a lack of final newlines.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java
64 return EasyFormatConfig.builder()
65 .readable(true)
66 .writable(false)
67
.blockSplittable(false)
paul-rogers3 years ago

I'm pretty sure fix-width files are splittable. If every records resides on a single line, they the file is spittable if we add code that, if the start offset !=0, scan to the next newline. And, on read, read rows until the file position is greater than the block end. See the text file (CSV) plugin for details (though, don't follow its implementation as that implementation is rather unique to that one use case.)

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
paul-rogers3 years ago

This uses "EVF V1". Your plugin provides schema information, and is thus a perfect fit for "EVF V2" which can use your schema information to set up the row set loader schema automatically for you.

cgivre3 years ago

@paul-rogers Is there an example somewhere of how to use the rowset loader to set up the schema automatically? Is this as simple as checking to see whether the schema is provided and if so, use that?

paul-rogers3 years ago

@cgivre, it's been a while since I looked at that stuff. As I recall, the CSV reader was converted, but it is a bit obscure. The unit test also show what can be done, IIRC.

Basically, V2 gets schema from multiple sources:

  • Schema info in the plan object (if available in the future)
  • Provided schema (the current interim solution to system-provided schema)
  • Schema implied by the selected columns (for example, a[1] means a has to be an array.)
  • Schema defined by the data source itself (as in Parquet, or in CSV where you can choose the columns array)
  • Schema discovered by the traditional "schema on read"

The V2 SchemaNegotiator provides the needed methods. V2 will then come up with the final schema. V2 lets you read the entire row (columns a, b, c, say) even if the query wants only column a: V2 silently discards the unwanted columns.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
142 for (FixedwidthFieldConfig field : config.getFields()) {
143 value = line.substring(field.getStartIndex() - 1, field.getStartIndex() + field.getFieldWidth() - 1);
144 dataType = field.getDataType();
145
dateTimeFormat = field.getDateTimeFormat();
paul-rogers3 years ago

This is OK, but slow because of the switch. There is a set of field converter classes which can handle the string-to-whatever conversions. With that, there is a direct call per field (inner loop) from reading the field to convert to write into value vectors. The field-specific-type switch is done only once, at setup time.

These converters are used in the CSV reader when a schema is provided. I can dig up more examples if helpful.

paul-rogers3 years ago

To be clearer: what we want is to minimize the per-field work. Ideally, we'd set up an array of column converters so the loop looks like:

for (int i = 0; i < config.getFields().size(); i++) {
FixedwidthFieldConfig field = config.getFields().get(i);
value = line.substring(field.getIndex() - 1, field.getIndex() + field.getWidth() - 1);
writer.scalar(i).setString(value)
}


However, the above requires the user to specify the config. For every file. On every Drill cluster. Better, if the schema is not given, infer it from the first (few) row(s). Then, set up runtime field objects:

```java
for (FieldReader fieldReader : fieldReaders) {
  fieldReader.load(line);
}

The field reader:

  • Remembers the start and offset
  • Pulls out the string, handling a truncated line
  • Calls a cached column writer

@cgivre mentioned the idea of a column converter. There is a defined set for the common cases. The underlying mechanism sets them up for you. (V2 makes it simpler.) That way, a call to setString() directly invokes the thing that converts from string and writes the resulting value: no per-column switch needed.

lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 2d17f1b into 0c9451e - view on LGTM.com

new alerts:

  • 2 for Unused format argument
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 18380ea into f4ea90c - view on LGTM.com

new alerts:

  • 2 for Unused format argument
MFoss19 MFoss19 force pushed from 18380eaf to a91be4c6 3 years ago
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging a91be4c into f4ea90c - view on LGTM.com

new alerts:

  • 2 for Unused format argument
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging dc60d28 into b6da35e - view on LGTM.com

new alerts:

  • 2 for Unused format argument
estherbuchwalter estherbuchwalter force pushed from dc60d28e to 05ae3f1d 3 years ago
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 05ae3f1 into 58ced60 - view on LGTM.com

new alerts:

  • 2 for Unused format argument
luocooong luocooong removed new-storage
luocooong luocooong added new-format
jnturton
jnturton commented on 2021-11-04
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
138 private boolean parseLine(String line, RowSetLoader writer) throws IOException {
139 int i = 0;
140 TypeProtos.MinorType dataType;
141
String dateTimeFormat;
jnturton3 years ago (edited 3 years ago)

@cgivre @MFoss19 @estherbuchwalter here we are reading column data types from the format config, where we also specify their start and stop offsets. But this format plugin can also accept data types from a provided schema. So my question is: can we drop the data type information from the format config so that we don't introduce multiple ad-hoc ways of specifying this info? This is genuinely a question because I don't know this subject well, but should we not work with data type specs here exactly the same way we do for CSV (cannot be provided in the format config I don't think)?

cgivre3 years ago

My original understanding of this was that for the fixed width plugin was that it would work in a similar manner to the log regex reader where the user provides the schema in the config, either in the format config or at query time using the table() function.

paul-rogers3 years ago

What you want to do is to resolve the schema at open time, not when parsing. At open time, you can:

  • Get the schema from the plugin, if provided, and use that as the schema.
  • Else, sample the first line to infer the schema.

Since this is a fixed format, we don't want to rediscover the schema on every line: that costs too much. (Think of the case of reading 100M or 1B rows: optimizing the inner loop is critical.)

MFoss19 MFoss19 force pushed from 05ae3f1d to 9d66f911 3 years ago
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 9d66f91 into 52838ef - view on LGTM.com

new alerts:

  • 2 for Unused format argument
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 9b95c45 into 52838ef - view on LGTM.com

new alerts:

  • 2 for Unused format argument
jnturton
jnturton3 years ago (edited 3 years ago)

@MFoss19 @estherbuchwalter following some recent chat with @paul-rogers and my last comment here, how about a reduced format config such as the following? The goal is to get to something terse and consistent with what we do for other text formats.

"fixedwidth": {
  "type": "fixedwidth",
  "extensions": [
    "fwf"
  ],
  "extractHeader": true,
  "trimStrings": true,
  "columnOffsets": [1, 11, 21, 31],
  "columnWidths": [10, 10, 10, 10]
}

Column names and types can already come from a provided schema or aliasing after calls to CAST(). Incidentally, the settings above can be overriden per query using a provided schema too.

There's also a part of that wonders whether we could have justified adding our fixed width functionality to the existing delimited text format reader.

cgivre
cgivre3 years ago

@MFoss19 @estherbuchwalter following some recent chat with @paul-rogers and my last comment here, how about a reduced format config such as the following? The goal is to get to something terse and consistent with what we do for other text formats.

"fixedwidth": {
  "type": "fixedwidth",
  "extensions": [
    "fwf"
  ],
  "extractHeader": true,
  "trimStrings": true,
  "columnOffsets": [1, 11, 21, 31],
  "columnWidths": [10, 10, 10, 10]
}

Column names and types can already come from a provided schema or aliasing after calls to CAST(). Incidentally, the settings above can be overriden per query using a provided schema too.

There's also a part of that wonders whether we could have justified adding our fixed width functionality to the existing delimited text format reader.

@dzamo In this case, I'd respectfully disagree here. In effect, the configuration is providing a schema to the user, similar to the way the logRegex reader works. In this case, the user will get the best data possible if we can include datatypes and field names in the schema, so that they can just do a SELECT * and not have to worry about casting etc.

Let's consider a real world use case: some fixed width log generated by a database. Since the fields may be mashed together, there isn't a delimiter that you can use to divide the fields. You could use however the logRegex reader to do this. That point aside for the moment, the way I imagined someone using this was that different configs could be set up and linked to workspaces such that if a file was in the mysql_logs folder, it would use the mysql log config, and if it was in the postgres it would use another.

My opinion here is that the goal should be to get the cleanest data to the user as possible without the user having to rely on CASTs and other complicating factors.

jnturton
jnturton3 years ago (edited 3 years ago)

Let's consider a real world use case: some fixed width log generated by a database. Since the fields may be mashed together, there isn't a delimiter that you can use to divide the fields. You could use however the logRegex reader to do this. That point aside for the moment, the way I imagined someone using this was that different configs could be set up and linked to workspaces such that if a file was in the mysql_logs folder, it would use the mysql log config, and if it was in the postgres it would use another.

@cgivre This use case would still work after two CREATE SCHEMA statements to set the names and data types, wouldn't it? The schemas would be applied every subsequent query.

My opinion here is that the goal should be to get the cleanest data to the user as possible without the user having to rely on CASTs and other complicating factors.

Let's drop the CASTs, those aren't fun. So we're left with different ways a user can specify column names and types.

  1. With a CREATE SCHEMA against a directory.
  2. With an inline schema to a table function.
  3. With some plugin-specific format config that works for this plugin but generally not for others.

Any one requires some effort, any one gets you to select * returning nice results (disclaimer: is this claim I'm making actually true?) which is super valuable. So shouldn't we avoid the quirky 3 and commit to 1 and 2 consistently wherever we can?

lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging f9e96fe into 42e7b77 - view on LGTM.com

new alerts:

  • 2 for Unused format argument
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 428a512 into 14d96d1 - view on LGTM.com

new alerts:

  • 2 for Unused format argument
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 428a2dd into 17f3654 - view on LGTM.com

new alerts:

  • 2 for Unused format argument
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 881d465 into 38d0c1d - view on LGTM.com

new alerts:

  • 2 for Unused format argument
lgtm-com
lgtm-com3 years ago

This pull request introduces 2 alerts when merging 56d8f6e into 38d0c1d - view on LGTM.com

new alerts:

  • 2 for Unused format argument
cgivre
cgivre requested changes on 2021-11-25
cgivre3 years ago

@MFoss19 @estherbuchwalter I added some review comments. Please make sure the unit tests pass and also that there are no code style violations before pushing to github.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
78 throw UserException
79 .dataReadError(e)
80 .message("Failed to open input file: {}", split.getPath().toString())
81
.addContext(errorContext)
cgivre3 years ago

You can remove this second line. Also, please add e.getMessage() to the message line.

paul-rogers3 years ago

I like that second line (if you mean the message call). Another solution is to change message() to addContext(). This way, we preserve the message from the actual error, and add context to explain the source of the error. Then, as Charles suggested, we don't need the addContext(e.getMessage()) bit.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
88
89 @Override
90 public boolean next() { // Use loader to read data from file to turn into Drill rows
91
String line;
cgivre3 years ago

This line should be in the open() method.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
104 throw UserException
105 .dataReadError(e)
106 .message("Failed to read input file: {}", split.getPath().toString())
107
.addContext(errorContext)
cgivre3 years ago

For the error message, you don't need to have multiple addContext() calls. The main thing is to pass the errorContext. I would add the e.getMessage() to the message() call.

cgivre3 years ago

Here and elsewhere.

paul-rogers3 years ago

See explanation above. Ideally:

      throw UserException
        .dataReadError(e)
        .addContext("Failed to read input file: {}", split.getPath().toString())
        .addContext(errorContext)
        .addContext("Line Number", lineNum)
        .build(logger);

Thanks for adding the line number: nice touch.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
91 String line;
92 RowSetLoader writer = loader.writer();
93
94
try {
cgivre3 years ago

Why not include this in the loop?

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
108 .addContext(e.getMessage())
109 .addContext("Line Number", lineNum)
110 .build(logger);
111
}
cgivre3 years ago

The next() method needs some work. Really this should be called nextBatch() as the next method returns true when there is more data, to read, false if not.

  @Override
  public boolean next() {
    while (!rowWriter.isFull()) {
      if (!processNextLine()) {
        return false;
      }
    }
    return true;
  }

This method will iterate through the batch of data, and when the rowWriter is full, (IE the batch is full) it will stop reading, BUT the method will return true because there is more data to read. The limit is pushed down in the processNextLine() method.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
71 errorContext = negotiator.parentErrorContext();
72 lineNum = 0;
73 try {
74
fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath());
cgivre3 years ago

Here's where you can check to see whether the user provided a schema or not. You could do something like this:

   if (negotiator.hasProvidedSchema()) {
      TupleMetadata providedSchema = negotiator.providedSchema();
      // Build column writer array
      negotiator.tableSchema(finalSchema, true);
    } else {
      negotiator.tableSchema(buildSchema(), true);
    }
paul-rogers3 years ago

This is done internally in V2. V2 does it that way because it became clear that there is no reason for every reader to have to know to do this same pattern.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
114
115 @Override
116 public void close() {
117
if (fsStream != null){
cgivre3 years ago

This line should be out of the if statement.

paul-rogers3 years ago

The reason it an be out of the if statement is that the method itself handles null values. Otherwise, the code would be fine as it is if it called close() directly.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java
64 return EasyFormatConfig.builder()
65 .readable(true)
66 .writable(false)
67
.blockSplittable(false) // Change to true
cgivre3 years ago

Change to true.

paul-rogers3 years ago

But only if the additional work described above is done.

paul-rogers
paul-rogers requested changes on 2021-11-25
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
96 while (!writer.isFull() && line != null) {
97 writer.start();
98 parseLine(line, writer);
99
writer.save();
paul-rogers3 years ago

You can lose a line here: read the next line, but the writer is full, so we discard it. As @cgivre suggests:

while (!writer.isFull()) {
  String line = reader.readLine();
  if (line == null) {
    break;
  }
  writer.start();
  parseLine(line, writer);
  writer.save();
}
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
70 split = negotiator.split();
71 errorContext = negotiator.parentErrorContext();
72 lineNum = 0;
73
try {
paul-rogers3 years ago

Fixed-width files seem perfect for HDFS splits.

So there is a trick here for old-school HDFS systems (if anyone still runs them.) A large file will be split across HDFS nodes, often at 256MB boundaries. A reader that supports "splits" has to handle the fact that node 1 will read the first 256 MB, node 2 the next 256 MB, and so on. Standard HDFS stuff.

This means the reader has to accept the offset and scan for the first record separator after that start point. If the fixed-width records are newline-terminated, that means finding the next newline.

Also, the reader has to stop after it finds the record terminator after its assigned split. (On HDFS, that means part of the last record will be read from a remote HDFS node.)

Further, it means that line numbers, as counted here, are relative: they are from the start of the current split (since node 2, reading the second split, doesn't know the number of records in the first split.)

The CSV reader handles this busy-work, but it a way that is a bit hard to follow as an example, sadly.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
30
31@JsonTypeName("fixedwidthReaderFieldDescription")
32@JsonInclude(JsonInclude.Include.NON_DEFAULT)
33
public class FixedwidthFieldConfig implements Comparable<FixedwidthFieldConfig> {
paul-rogers3 years ago

Just a high-level design note: we now have multiple plugin configs that ask the user to use ad-hoc formats for specifying a per-file schema. This is not how storage plugins were meant to be used, but it is all we have.

Over time, we need a better solution. The provided-schema is a start: it provides a single, well-defines syntax for schema. But, it is a bit limited in handling per-format specific, such as width. (There are extension properties for this kind of information, but that's a bit fiddly.)

We need a way to provide the schema of a file simply as a file the user creates that sits along side the data file. Something like a schema.json file if all files in a directory have the same schema, or a foo.json if foo.txt has a distinct schema.

Support for such a file should be added to the V2 mechanism. (Just use the new format in place of the provided schema.)

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
48 @JsonCreator
49 public FixedwidthFieldConfig(@JsonProperty("name") String name,
50 @JsonProperty("index") int index,
51
@JsonProperty("width") int width,
paul-rogers3 years ago

This makes the user do the math. Given indexes, we can compute the width. Given widths, we can compute the index. The one missing piece might the the width of the field separator (if any).

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatConfig.java
162 }
163 }
164
165
@JsonIgnore
paul-rogers3 years ago

No need for this tag: Jackson doesn't know what to do with this method anyway.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatConfig.java
57 Collections.sort(fields);
58 this.fields = fields;
59
60
validateFieldInput();
paul-rogers3 years ago

This is just a bit dangerous. We validate on deserialize. This seems like a great idea: we do the deserialize in the UI when the user saves the config. But, we also do it on system start. If the deserialize fails there, Drill won't start and it takes a long time to figure out why.

We don't have a good answer for config validation. I'd suggest adding a validate() method that we call:

  • When saving a config from the UI
  • When planning a query using a config

But, allow an invalid config to be stored. Otherwise, unless everything is perfect, nothing can be saved. And, JSON, in a text editor, is a horrible way to write complex config such as this one.

contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatConfig.java
192 if (!Pattern.matches("[a-zA-Z]\\w*", name)) {
193 throw UserException
194 .validationError()
195
.message("Invalid input: " + name)
paul-rogers3 years ago

This message will (I hope) be shown to the poor user trying to get a config right in the Drill web UI. Can we be more specific? Such as, "Name is not valid. Only letters allowed."

But, then, why don't we allow numbers or underscores? "Field_3"? If we allowed that, you could use the Java method. Or, extend the pattern.

Also, it is cleaner to use name = name.strip() to remove both leading and trailing whitespace so we don't leave whitespace in the name. Otherwise, all users have to know to do their own strip() call.

estherbuchwalter3 years ago

The regex here says that the first character must be a letter, and the rest (\w*) must be alphabetical, numerical, or the underscore.

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java
89 builder.nullType(Types.optional(TypeProtos.MinorType.VARCHAR));
90 return builder;
91 }
92
}
paul-rogers3 years ago

Nit: missing newline.

MFoss193 years ago

Fixed

Conversation is marked as resolved
Show resolved
contrib/format-fixedwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java
51 FixedwidthFormatConfig formatConfig = new FixedwidthFormatConfig(Lists.newArrayList("fwf"),
52 Lists.newArrayList(
53 new FixedwidthFieldConfig("Number", 1, 5, TypeProtos.MinorType.VARDECIMAL),
54
new FixedwidthFieldConfig("Address",12, 3,TypeProtos.MinorType.INT, ""),
paul-rogers3 years ago

Nit: spacing after comma. Remove unwanted format as per previous line.

MFoss193 years ago

Fixed spacing and removed unnecessary dateTimeFormats

contrib/format-fixedwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java
135 .addNullable("Letter", TypeProtos.MinorType.VARCHAR)
136 .buildSchema();
137 RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
138
.addRow(567, Instant.parse("2021-02-10T15:30:27.00Z"), LocalDate.parse("2021-02-10"), "test")
paul-rogers3 years ago

Generally, for something like this, you can use a LIMIT 2 and only check the first row or two. The code is not going to change column order after the 10th row!

contrib/format-fixedwidth/src/test/resources/fwf/test_blankrow.fwf
977.77 yzzz 777 06-05-7777 05:11:11 06-05-7777T05:11:11.11Z
1088.88 aabb 888 07-06-8888 06:22:22 07-07-8888T06:22:22.22Z
1188.88 aabb 888 07-06-8888 06:22:22 07-07-8888T06:22:22.22Z
12
paul-rogers3 years ago

Is the blank line intentional? Kind of hard to document data files. You could allow comments (lines that start with #, say, but there is always someone who has valid data that starts with #.) Another idea would be to add a "explanation.txt" file that explains the purpose of each of the data files.

paul-rogers3 years ago

Also, I'd hoped to see all manner of invalid files:

  • Truncated line (not just an entirely blank line)
  • String where a number should be.
  • Badly formatted date or time.

Such files will ensure that the error handling works and will raise that perpetual question: if we're a hundred million lines into a fixed-width file, and we hit an error, should we ignore that line and move to the next, or should we fail the query?

MFoss193 years ago

It is intentional to test what would happen if there was a blank line in a user input file. Good idea to add the explanation.txt file and additional invalid files.

jnturton
jnturton3 years ago

Dear PR author and reviewers.

This is a generic message to say that we would like to merge this PR in time for the 1.20 release. Currently we're targeting a master branch freeze date of 2021-12-10 (10 Dec). Please strive to complete development and review by this time, or indicate that the PR will need more time (and how much).

Thank you.

jnturton jnturton marked this pull request as draft 3 years ago
jnturton
jnturton3 years ago

Hi @paul-rogers. We're in the throes of trying to convert this plugin to use EVF v2 / scan.v3. This will be the first instance of this kind in the Drill code base, apart from a very simple mock plugin which supports unit tests (CompliantTextBatchReader remains based on EVF v1, from what I can see).

Something that's confusing me is that the EasyFormatPlugin base class is coded against the ManagedReader interface from EVF v1. So I cannot see that we can both derive from EasyFormatPlugin, and also implement ManagedReader from EVF v2. Am I missing something here?

Thanks, James

paul-rogers
paul-rogers3 years ago (edited 3 years ago)

@jnturton , thanks for pushing the EVF V2 stuff forward! The EasyFormatPlugin should contain "shims" for the original format, for EVF1 and for EVF2. Given that you said you can't find it, and that the CSV reader is still based on V1, I wonder if there is some branch that never got pushed a PR? I'll do some research to determine what's what.

paul-rogers
paul-rogers3 years ago👍 1

@jnturton, turns out the required changes are sitting in a branch in my private repo, csv, that somehow never not converted to a PR. I'll see if I can merge that stuff into a PR.

paul-rogers
paul-rogers3 years ago

PR 2419 has the EVF V2 adapter for the Easy Format Plugin. I suggest that you use that code, and follow the example there, to add EVF V2 support here.

Start of fixed width format plugin
8fd4018e
Work in Progress. Producing Rows. Currently complains about buffer no…
5d1ea8b2
First working version
7ef5cd30
Added more data types, refactored code
6f4a2e77
Checkstyle fixes
f59d4e43
Removed println statement from Batch Reader, Simplified logic
9f2648c2
Modified format, fixed maxRecords in next(), modified Exception handl…
8c3f6ebd
Addressing Review Comments.
7c3b5a23
Added Serialization/Deserialization test, added blank row test file, …
57d49db0
Fixed Serialization/Deserialization test
1a4818ee
Added another constructor to enable user to not have to enter dateTim…
1f1051ea
Added method to validate field name input and verify there are no dup…
cb1b9327
Added two getters to FixedwidthFormatConfig to prep for offset verifi…
84198624
estherbuchwalter Added a check for overlapping fields
31e1549b
estherbuchwalter Updated check for overlapping fields
07edbded
Added field validation for data types, indices, width. Includes creat…
fa47a143
Modified validation for field width and field index. Added comments t…
523366aa
estherbuchwalter Added to field validation for field names. Checks for valid length an…
1a91592e
WIP converting to EVF v2. Pushing to repo for troubleshooting purposes.
4b221b5f
MFoss19 MFoss19 force pushed from 054b4c03 to 4b221b5f 3 years ago
Start of fixed width format plugin
4875367d
Work in Progress. Producing Rows. Currently complains about buffer no…
ef0bc826
First working version
be14e258
Added more data types, refactored code
c9014d2a
Checkstyle fixes
6d7a2a5e
Removed println statement from Batch Reader, Simplified logic
056df138
Modified format, fixed maxRecords in next(), modified Exception handl…
d2097b3e
Addressing Review Comments.
f2920a07
Added Serialization/Deserialization test, added blank row test file, …
7a68da5f
Fixed Serialization/Deserialization test
e0110da4
Added another constructor to enable user to not have to enter dateTim…
f01f1aa1
Added method to validate field name input and verify there are no dup…
5da7a773
Added two getters to FixedwidthFormatConfig to prep for offset verifi…
22ccbcbb
estherbuchwalter Added a check for overlapping fields
ecf6fb8e
estherbuchwalter Updated check for overlapping fields
1978e14c
Added field validation for data types, indices, width. Includes creat…
a79f8a55
Modified validation for field width and field index. Added comments t…
32e53122
estherbuchwalter Added to field validation for field names. Checks for valid length an…
4d534df3
WIP converting to EVF v2. Pushing to repo for troubleshooting purposes.
aa74ec53
Updating pom.xml with new drill snapshot version
1972fb9b
MFoss19 Merge branch 'apache:master' into format-fixedwidth
f4981235
tswagger Renamed classes
1e75757d
tswagger Merge branch 'master' into format-fixedwidth
a134619e
tswagger Merge remote-tracking branch 'megan/format-fixedwidth' into format-fi…
dfe894e5
Merge branch 'format-fixedwidth' of github.com:MFoss19/drill into for…
28df7b2d
Merge branch 'format-fixedwidth' of github.com:MFoss19/drill into for…
b68c54dd
tswagger Updated pom.xml
bf6a16c3
lgtm-com
lgtm-com3 years ago

This pull request introduces 1 alert when merging bf6a16c into 4e97f5c - view on LGTM.com

new alerts:

  • 1 for Unused format argument

Login to write a write a comment.

Login via GitHub

Assignees
Labels
Milestone