DRILL-5080: Memory-managed version of external sort
Please see JIRA entry for reasons for revision, design spec and list of
changes.
This PR covers the changes to the external sort itself. Tests for this
operator require the test framework in DRILL-5126 and the mock data
source in DRILL-5152. Tests for this operator will be issued as a
separate PR once those two dependencies are committed.
Until then, the new operator is disabled by default. It can be enabled
using drill.sort.external.disable_managed: false.
The operator now spills before receiving a new batch. Revised memory calcs and
merge calcs to make them a bit clearer and provide more margin of error
for the power-of-two allocations used when allocating vectors.
We have two external sort implementations, but only one operator code
for both. They can use only one Metrics enum between them. When adding
new metrics to the new version, didn’t add matching metrics to the old
one. This fixes that issue. (The issue will go away once the old one is
retired.)
Revised memory calculations to reflect limit of 16 MB per vector.
Current revision limits to 16 MB per output batch to be safe. Next
revision will enforce per-vector limits to allow the overall batch to
be larger when possible.
Also simplified the merge-time calculations.
Original code provided only crude methods to learn the size of a record
batch. Adds a "RecordBatchSizer" to provide detailed analysis so the
sort can know the amount of memory used to buffer a batch, the number
of rows, and the expected row width once the rows are copied to a
spill file or the output.
Moved generic spill classes to a separate package.
Created parameters for spill batch size and merge batch size. Separated
these values in code. Deprecated the min, max spill parameters as they
no longer add much value. Minor code rearranging.
Bug fix
Fixes a corner case of merging spilled files in a low-memory condition.
Fixes from code review
close apache/drill#717