Adds inspectai (#1022)
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance
- this allows for:
- better logs
- better paralelixzation
- easier to add tasks
tasks compatible with inspect ai (at term all the tasks will be compatible):
- gpqa (fewshot compatible)
- ifeval
- hle
- gsm8k (fewshot compatible)
- agieval
- aime24,25
### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`:
```
lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \
max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1
```
result:
```
| Model |agieval|aime25|gpqa|
|----------------------------------------------------------------------|------:|-----:|---:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.53| 0|0.33|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.71| 1|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.71| 0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.53| 0|0.20|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.65| 0|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.71| 0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.35| 0|0.25|
```
### compare few shots diff on gsm8k
```
lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gsm8k|0,lighteval|gsm8k|3" \
max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1
```
```
| Model |gsm8k|gsm8k_3_shots|
|----------------------------------------------------------------------|----:|------------:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.6| 0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.7| 0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.7| 0.8|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.6| 0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.5| 0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.7| 0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.4| 0.8|
```
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>