Add swiss legal evals as new community tasks #389
Add swiss legal evals as new community tasks
e2a27a72
Removed nltk and numpy dependencies.
aa409c83
Added short dataset descriptions.
a8ee2a5c
Merge branch 'main' into add_swiss_legal_evals
8f688444
Removed open judge models and added COMET and METEOR.
c7f70380
Merge branch 'main' into add_swiss_legal_evals
0ca5af6a
Merge branch 'main' into add_swiss_legal_evals
1d51a01f
Ran pre-commit hooks.
5d41ce0a
Changed prompt template.
81941254
Added legal translation specific judge prompt.
c58ae447
Improved judge prompt.
ff3705f9
Changed metric selection.
091ec113
Made generation_size dependent on the config.
5a479564
Fixed error in config.
6bf7fa24
Fixed error in config.
6cf1c2ac
Added support for multiple devices.
b5488017
Fixed some bugs for evaluation on GPUs.
ee2a83c0
Added batch inference for heavy metrics and multiplied each score by …
36b7e943
Added few shot examples and did some refactoring.
5ba218f8
Merge branch 'main' into add_swiss_legal_evals
8490841e
Switched to an own judge class.
576b847b
Fixed issue with judge metric not showing up in results.
41bb59ae
Fixed issue with evaluation on GPUs.
d82cd91a
Speed up metric computation on GPUs.
1b13d9fc
Added more logging.
df0f3f02
Switched to sample level scores for faster evaluation.
980c2571
Added rescale_with_baseline for BERTScore for better differentiation.
9a60dc0f
Merge branch 'main' into add_swiss_legal_evals
8c7814fc
Adapted metrics.
819b949c
Switched to sacrebleu implementation for sentence level translation m…
e758316f
Added more stop sequences.
d08163fa
Made stop_sequence level specific.
86c67bc3
Added gemba metric.
f1099455
Updated logging.
f357176e
Updated stop_sequence.
2d4c0ed8
Merge branch 'main' into add_swiss_legal_evals
44ad734c
Made metric selection easier.
7b779727
Fixed dict issue.
fcd95052
Added metric dependencies.
5a8ca464
Moving metrics to extended tasks.
bab94af4
Merge branch 'main' into add_swiss_legal_evals
37468493
Merge branch 'main' into add_swiss_legal_evals
ddaadbf2
Added support for judges from different providers.
09be56d8
Added additional system and user prompts and few shot examples.
0aa86077
Removed debug relics.
c49e1e23
Fixed issue in judge prompt.
4418e82b
Adapted getting predictions to new way for all metrics.
075ebd2e
Added gemba mqm metric by default.
8ee2dbc7
Fixed error in gemba score when errors are no dicts.
4408d0d0
Added different judge configurations for gpt 4o.
be6d9abe
Fixed typo.
c7ca83f5
Disabled short metrics for evaluation of longer sequences.
930cbc57
Added xcomet metrics to sentence level metrics.
61058b16
Fixed error in bleurt and enabled lazy loading of metrics to save on …
e043ee81
Refactored judge metric creation.
1c38c0ab
Improved detailed judge prompt and changed secondary judge models fro…
e05ac6a0
Changed judge order.
0aed0632
Merge branch 'main' into add_swiss_legal_evals
d9078a7f
Fixed stop sequence issue in press releases.
46eb62ae
Fixed error in xcomet scores.
a78bc03b
Made metric groups more easily configurable.
f6b50b4c
Made comet score more robust.
7f36065c
Moved unpack to the pipeline code.
cb6bfb41
Merge branch 'huggingface:main' into add_swiss_legal_evals
306ee766
Fixed bug in comet score.
866e7708
Added additional judge prompt configurations.
e7f9a096
Fixed judge setup.
186a6c83
Added more judge models.
c62647e8
Made the best judge the default.
2610c920
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub