Legal NLP tasks on Swiss data (#1032)
* Legal NLP tasks on Swiss data
* refactor: split Swiss legal multilingual tasks into modular package
Move Swiss legal task definitions into prompts/metrics/main modules and keep backward compatibility via swiss_legal_evals re-export.
* refactor: update higher_is_better type in MetricGrouping
Changed the type of higher_is_better from a dictionary of callables to a dictionary of booleans for improved clarity and type safety. This is also how it has been used so far.
* refactor: Updated prompts and implementation to match the latest SwiLTra-Bench code
* refactor: Enhance COMET and GEMBA metric loading with error handling
- Introduced functions to load COMET and GEMBA metrics, providing clear error messages for missing dependencies.
- Disabled specific COMET metrics due to numpy version conflicts, with warnings logged for skipped metrics.
- Updated GPU metrics list to reflect the disabled COMET metrics.
- Improved code organization and clarity in metric processing functions.
* Add Gemba dependency for Swiss legal evaluations and remove `suite` parameter from TranslationTask constructor.
* Fix batched metric aggregation for grouped metric names
* Fixed missing system prompt
* Judge models now are used through OpenRouter
* fix reasoning model token handling when max_tokens is unset
* chore: trigger PR update
* fix: return raw score for BLEU, CHRF, and TER metrics instead of scaled values
* fix: replaced accidental default value assignment with intended type hint
* fix: add error handling for unsupported languages in Swiss Landmark Decision Summarization judge
* fix: avoid huge negative BERTScore from baseline rescaling
Default `rescale_with_baseline` to False in BertScoreMultilingual. With
near-1.0 baselines (e.g. German xlm-roberta-large layer-24 ≈ 0.98), the
(score - baseline) / (1 - baseline) formula amplifies deviations ~50x,
and the subsequent x100 scaling compounds it — empty/weak predictions
could yield scores like -5000.
Make rescaling opt-in, warn when enabled, and skip the x100 scaling in
that case since rescaled scores are not bounded to [0, 1].
---------
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>