Legal NLP tasks on Swiss data (#1032)

Commit

6 days ago

Legal NLP tasks on Swiss data (#1032) * Legal NLP tasks on Swiss data * refactor: split Swiss legal multilingual tasks into modular package Move Swiss legal task definitions into prompts/metrics/main modules and keep backward compatibility via swiss_legal_evals re-export. * refactor: update higher_is_better type in MetricGrouping Changed the type of higher_is_better from a dictionary of callables to a dictionary of booleans for improved clarity and type safety. This is also how it has been used so far. * refactor: Updated prompts and implementation to match the latest SwiLTra-Bench code * refactor: Enhance COMET and GEMBA metric loading with error handling - Introduced functions to load COMET and GEMBA metrics, providing clear error messages for missing dependencies. - Disabled specific COMET metrics due to numpy version conflicts, with warnings logged for skipped metrics. - Updated GPU metrics list to reflect the disabled COMET metrics. - Improved code organization and clarity in metric processing functions. * Add Gemba dependency for Swiss legal evaluations and remove `suite` parameter from TranslationTask constructor. * Fix batched metric aggregation for grouped metric names * Fixed missing system prompt * Judge models now are used through OpenRouter * fix reasoning model token handling when max_tokens is unset * chore: trigger PR update * fix: return raw score for BLEU, CHRF, and TER metrics instead of scaled values * fix: replaced accidental default value assignment with intended type hint * fix: add error handling for unsupported languages in Swiss Landmark Decision Summarization judge * fix: avoid huge negative BERTScore from baseline rescaling Default `rescale_with_baseline` to False in BertScoreMultilingual. With near-1.0 baselines (e.g. German xlm-roberta-large layer-24 ≈ 0.98), the (score - baseline) / (1 - baseline) formula amplifies deviations ~50x, and the subsequent x100 scaling compounds it — empty/weak predictions could yield scores like -5000. Make rescaling opt-in, warn when enabled, and skip the x100 scaling in that case since rescaled scores are not bounded to [0, 1]. --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

References

#1032 - Legal NLP tasks on Swiss data

Author

rolshoven

Parents

3fd15266

lighteval 8d29839e - Legal NLP tasks on Swiss data (#1032)

lighteval
8d29839e - Legal NLP tasks on Swiss data (#1032)