feat: primitive parquet reader with page pruning (#3244)
* refactor(libcommon): remove unused `RowsIndex.partial` and `duckdb_index_is_partial()`
* chore: restore RowsIndex.partial
* build: use a single compose file with .env file
chore: add .env.debug configuration
chore: add .env.debug
feat: primitive parquet reader with page pruning
add poetry build for libviewer
add libviewer to rows
refactor: only extract metadata and don't try to calculate offset index
ci: update dockerfiles to include the rust toolchain and libviewer
chore: pin python to 3.12.11 in libviewer and update lockfile
feat: use PageIndexPolicy to optionally read offset index
feat: support querying RowsIndex with page pruning
build: add libviewer as a dependency to libcommon
style: ruff format libcommon changes
chore: use query_with_page_pruning from the rows endpoint
chore: fix mypy errors
style: import Sequence from collections.abc
build: don't use libviewer as an editable dependency
build: try to configure poetry to properly install libviewer
ci: temporarily disable poetry cache
style: fixx ruff check errors
build: relock projects depending on libcommon
build: add rust toolchain to more dockerfiles
build: copy the entire libviewer directory in dockerfiles because poetry install is called at the build phase
build: turn libviewer an optional dependency due to build difficulties
chore: missing api stage from dockerfile
ci: install libviewer extra in the libcommon build
style: fix ruff check error in parquet utils
ci: disable poetry cache
feat: raise TooBigRows exceptions if the scan size would exceed a limit
feat: implement binary truncation for page pruning reader
style: ignore variable shadowing ruff check
ci: install libviewer in the worker image
feat: pass hf_token to the opendal store
chore: remove files_to_index estimation
chore: poetry lock worker service
chore: remove reduntand gitignore entries from libviewer
ci: install libviewer in the worker build
style: fix mypy ignore
chore: cleanup the libviewer python code
style: try to please mypy due to missing import
style: make token optional
test: make the mocking compatible with the page pruning reader in test_first_rows
* test(libviewer): add a generic test case to exercise sync scanning
* ci(libviewer): try to add a github actions job for libviewer
* chore(libviewer): relock poetry
* chore(libviewer): add and install pytest as a dev dependency
* ci(libviewer): add style build for libviewer
* ci(libviewer): remove style build
* ci(libviewer): don't inherit secrets in the libviewer tests
* chore: debug
* chore: debug
* chore: debug
* chore: debug
* chore: debug
* chore: temp disable libviewer
* chore: don't pass file size to read_metadata
* chore: check that the metadata file exists
* chore: capture backtrace
* chore: capture backtrace
* chore: force capture backtrace
* chore: force dev profile
* chore: try not to load index
* deps(libviewer): use opendal fork supporting custom HF_ENDPOINT
* chore: debug e2e builds
* chore: run only the first rows test
* chore: run a single test for rows endpoint
* chore: run all the e2e tests
* chore: rglob metadata dir
* chore: mark plan errors
* chore: extra error msg
* chore: use more recent fork of opendal
* chore: hardcode ci-hub
* chore: list available datafiles on error
* chore: list available metadatafiles on error
* chore: list available datafiles on error
* chore: more debug
* chore: show revision
* chore: show response
* chore: use refs/convert/parquet
* chore: encode refs/convert/parquet
* chore: update opendal revision
* chore: read the offset index again
* fix: percent encode slash in revision
* chore: skip reading page index
* deps(libviewer): update pyo3 arrow and parquet
* chore: read page indices
* chore: try out with large prefetch
* fix(libviewer): gracefully handle corrupted pyarrow metadata
* chore: some cleanup
* chore: only initialize a single index depending on config
* feat: return files_to_index to indicate whether page pruning was used or not
* chore: ruff format
* chore: cargo format
* chore: adjust tests
* feat: add progressbar to the libviewer indexer
* chore: use global configuration through should_use_libviewer()
* chore: ruff check
* feat(worker): use libviewer to write metadata files
* chore(worker): fix mypy checks
* ci: add an e2e test with libviewer enabled
* chore: make LibviewerConfig frozen
* style: fix import order
* ci: fix gha config
* chore: fix first_rows errors
* chore: define num_rows_total at the RowsIndex level
* chore: forward hf_endpoint
* chore: forward hf_endpoint
* chore: mypy libcommon
* chore: default for LIBVIEWER_ENABLE_FOR_DATASETS
* chore: remove comment from dotenv
* chore: only enable libviewer for the rows service
* ci: don't run the two e2e tests in parallel
* ci: don't pass the libviwer enable flag
* ci: enable libviewer in worker
* ci: enable libviewer in rows endpoint
* ci: enable libviewer in both rows and worker
* chore(worker): force to use old metadata writing
* chore: hardcode ci-hub
* chore: restore
* test: add a libviewer test variant for worker parquet_metadata.py
* test: fix mypy issue
* fix admin ui for make dev-start
* chore: disable python<->rust log bridge
* ci: re-enable poetry cache
* enable for a few datasets
* minor
* mypy
---------
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>