dataset-viewer
de9116fb - feat: primitive parquet reader with page pruning (#3244)

Commit

49 days ago

feat: primitive parquet reader with page pruning (#3244) * refactor(libcommon): remove unused `RowsIndex.partial` and `duckdb_index_is_partial()` * chore: restore RowsIndex.partial * build: use a single compose file with .env file chore: add .env.debug configuration chore: add .env.debug feat: primitive parquet reader with page pruning add poetry build for libviewer add libviewer to rows refactor: only extract metadata and don't try to calculate offset index ci: update dockerfiles to include the rust toolchain and libviewer chore: pin python to 3.12.11 in libviewer and update lockfile feat: use PageIndexPolicy to optionally read offset index feat: support querying RowsIndex with page pruning build: add libviewer as a dependency to libcommon style: ruff format libcommon changes chore: use query_with_page_pruning from the rows endpoint chore: fix mypy errors style: import Sequence from collections.abc build: don't use libviewer as an editable dependency build: try to configure poetry to properly install libviewer ci: temporarily disable poetry cache style: fixx ruff check errors build: relock projects depending on libcommon build: add rust toolchain to more dockerfiles build: copy the entire libviewer directory in dockerfiles because poetry install is called at the build phase build: turn libviewer an optional dependency due to build difficulties chore: missing api stage from dockerfile ci: install libviewer extra in the libcommon build style: fix ruff check error in parquet utils ci: disable poetry cache feat: raise TooBigRows exceptions if the scan size would exceed a limit feat: implement binary truncation for page pruning reader style: ignore variable shadowing ruff check ci: install libviewer in the worker image feat: pass hf_token to the opendal store chore: remove files_to_index estimation chore: poetry lock worker service chore: remove reduntand gitignore entries from libviewer ci: install libviewer in the worker build style: fix mypy ignore chore: cleanup the libviewer python code style: try to please mypy due to missing import style: make token optional test: make the mocking compatible with the page pruning reader in test_first_rows * test(libviewer): add a generic test case to exercise sync scanning * ci(libviewer): try to add a github actions job for libviewer * chore(libviewer): relock poetry * chore(libviewer): add and install pytest as a dev dependency * ci(libviewer): add style build for libviewer * ci(libviewer): remove style build * ci(libviewer): don't inherit secrets in the libviewer tests * chore: debug * chore: debug * chore: debug * chore: debug * chore: debug * chore: temp disable libviewer * chore: don't pass file size to read_metadata * chore: check that the metadata file exists * chore: capture backtrace * chore: capture backtrace * chore: force capture backtrace * chore: force dev profile * chore: try not to load index * deps(libviewer): use opendal fork supporting custom HF_ENDPOINT * chore: debug e2e builds * chore: run only the first rows test * chore: run a single test for rows endpoint * chore: run all the e2e tests * chore: rglob metadata dir * chore: mark plan errors * chore: extra error msg * chore: use more recent fork of opendal * chore: hardcode ci-hub * chore: list available datafiles on error * chore: list available metadatafiles on error * chore: list available datafiles on error * chore: more debug * chore: show revision * chore: show response * chore: use refs/convert/parquet * chore: encode refs/convert/parquet * chore: update opendal revision * chore: read the offset index again * fix: percent encode slash in revision * chore: skip reading page index * deps(libviewer): update pyo3 arrow and parquet * chore: read page indices * chore: try out with large prefetch * fix(libviewer): gracefully handle corrupted pyarrow metadata * chore: some cleanup * chore: only initialize a single index depending on config * feat: return files_to_index to indicate whether page pruning was used or not * chore: ruff format * chore: cargo format * chore: adjust tests * feat: add progressbar to the libviewer indexer * chore: use global configuration through should_use_libviewer() * chore: ruff check * feat(worker): use libviewer to write metadata files * chore(worker): fix mypy checks * ci: add an e2e test with libviewer enabled * chore: make LibviewerConfig frozen * style: fix import order * ci: fix gha config * chore: fix first_rows errors * chore: define num_rows_total at the RowsIndex level * chore: forward hf_endpoint * chore: forward hf_endpoint * chore: mypy libcommon * chore: default for LIBVIEWER_ENABLE_FOR_DATASETS * chore: remove comment from dotenv * chore: only enable libviewer for the rows service * ci: don't run the two e2e tests in parallel * ci: don't pass the libviwer enable flag * ci: enable libviewer in worker * ci: enable libviewer in rows endpoint * ci: enable libviewer in both rows and worker * chore(worker): force to use old metadata writing * chore: hardcode ci-hub * chore: restore * test: add a libviewer test variant for worker parquet_metadata.py * test: fix mypy issue * fix admin ui for make dev-start * chore: disable python<->rust log bridge * ci: re-enable poetry cache * enable for a few datasets * minor * mypy --------- Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

References

#3244 - feat: primitive parquet reader with page pruning

Author

kszucs

Parents

00c87456

dataset-viewer de9116fb - feat: primitive parquet reader with page pruning (#3244)

dataset-viewer
de9116fb - feat: primitive parquet reader with page pruning (#3244)