feat: ๐ธ make the queue agnostic to the types of jobs (#608)
Important changes:
- the queue is now agnostic to the types of jobs. libqueue now also provides a Worker class that manages the jobs. The service `services/worker` does not exist anymore and is replaced with two small projects: `workers/splits` and `workers/first_rows`. It will make it easier to contribute a new worker.
- upgrade the datasets library
- for simplicity: we remove the concept of job "retries" because we never really evaluated if it helps
- remove the `created_at` field from /admin/pending_jobs for coherence
- docker: install libsndfile1 from the ubtuntu repository instead of building from source
- tests: fix a small issue with the authentication, see https://github.com/huggingface/datasets/issues/4875#issuecomment-1280744233
---
all commits:
* feat: ๐ธ make the queue agnostic to the types of jobs
Before we had two collections: for splits and for first-rows jobs. Now
only one collection name "jobs", with a field "type". Note that the job
arguments are still restricted to dataset (required) and optionally
config and split.
BREAKING CHANGE: ๐งจ two collections are removed and a new one is created. The function names
have changed too.
* feat: ๐ธ publish new version
* feat: ๐ธ upgrade to libqueue 0.3.0
* feat: ๐ธ remove created_at field in pending_jobs
* style: ๐ fix style
* feat: ๐ธ upgrade libqueue to 0.3.0
* refactor: ๐ก use an enum to prevent typos
* fix: ๐ fix mypy
* feat: ๐ธ upgrade libqueue to 0.3.0 and datasets
* test: ๐ fix test
* refactor: ๐ก pack the queue functions into a Queue class
* refactor: ๐ก use relative imports
* feat: ๐ธ upgrade to libqueue 0.3.1
* refactor: ๐ก use relative imports
* feat: ๐ธ upgrade to libqueue 0.3.1
* refactor: ๐ก use relative imports
* feat: ๐ธ upgrade to libqueue 0.3.1
* refactor: ๐ก use a common Worker class for the loop logic
* refactor: ๐ก simplify the code
* refactor: ๐ก factor process_job in Worker, and remove refresh
refresh... functions are now the "compute" abstract method
* test: ๐ temporarily disable unrelated failing tests
* test: ๐ fix tests
* refactor: ๐ก add Worker to libqueue
* chore: ๐ค install types
* feat: ๐ธ upgrade to libqueue 0.3.2
also: move types-requests dependency to dev dependencies.
* refactor: ๐ก new project isolating the /first-rows worker
Note: we removed apache-beam for now because of an issue with the installation
It must be added again later.
* feat: ๐ธ create a new project: worker_splits
it only contains the splits/ worker
* chore: ๐ค add the commented dependency to think to reinstall it
* feat: ๐ธ replace services/workers with the workers/
beware: the docker images don't exist, we will have to update
* ci: ๐ก fix argument name
* fix: ๐ fix details and upgrade docker images for admin and api
* feat: ๐ธ upgrade docker images for the two workers
* fix: ๐ reinstall apache beam, pinned to 2.41.0
* fix: ๐ use absolute imports, not relative imports
* fix: ๐ upgrade httplib2 to remove safety alert
* test: ๐ hack the tests order to fix the CI?
* test: ๐ restore the tests order to show the problem
if the tests fail, it means that a side effect occurs somewhere
* test: ๐ force datasets to patch csv for streaming every time
see
https://github.com/huggingface/datasets/issues/4875#issuecomment-1280821172
* style: ๐ fix style
* test: ๐ don't store the HF token on the disk
we explicitely pass it as an argument, so no need to store it on the
disk
* chore: ๐ค install libsndfile1 from repos instead of building it
now the current package version is 1.0.31, no need to build it from
source.
* chore: ๐ค add a missing package to have ICU work
* feat: ๐ธ update the docker images