PR #389 Add swiss legal evals as new community tasks

Add swiss legal evals as new community tasks #389

JoelNiklaus wants to merge 69 commits into huggingface:main from JoelNiklaus:add_swiss_legal_evals

Add swiss legal evals as new community tasks

e2a27a72

clefourrier requested a review from

hynky1999 1 year ago

clefourrier commented on 2024-11-12

Removed nltk and numpy dependencies.

aa409c83

Added short dataset descriptions.

a8ee2a5c

Merge branch 'main' into add_swiss_legal_evals

8f688444

Removed open judge models and added COMET and METEOR.

c7f70380

Merge branch 'main' into add_swiss_legal_evals

0ca5af6a

NathanHB commented on 2024-11-19

Merge branch 'main' into add_swiss_legal_evals

1d51a01f

Ran pre-commit hooks.

5d41ce0a

Changed prompt template.

81941254

Added legal translation specific judge prompt.

c58ae447

Improved judge prompt.

ff3705f9

Changed metric selection.

091ec113

Made generation_size dependent on the config.

5a479564

Fixed error in config.

6bf7fa24

Fixed error in config.

6cf1c2ac

Added support for multiple devices.

b5488017

Fixed some bugs for evaluation on GPUs.

ee2a83c0

Added batch inference for heavy metrics and multiplied each score by …

36b7e943

Added few shot examples and did some refactoring.

5ba218f8

Merge branch 'main' into add_swiss_legal_evals

8490841e

Switched to an own judge class.

576b847b

Fixed issue with judge metric not showing up in results.

41bb59ae

Fixed issue with evaluation on GPUs.

d82cd91a

Speed up metric computation on GPUs.

1b13d9fc

Added more logging.

df0f3f02

Switched to sample level scores for faster evaluation.

980c2571

Added rescale_with_baseline for BERTScore for better differentiation.

9a60dc0f

Merge branch 'main' into add_swiss_legal_evals

8c7814fc

Adapted metrics.

819b949c

Switched to sacrebleu implementation for sentence level translation m…

e758316f

Added more stop sequences.

d08163fa

Made stop_sequence level specific.

86c67bc3

Added gemba metric.

f1099455

Updated logging.

f357176e

Updated stop_sequence.

2d4c0ed8

Merge branch 'main' into add_swiss_legal_evals

44ad734c

Made metric selection easier.

7b779727

Fixed dict issue.

fcd95052

Added metric dependencies.

5a8ca464

Moving metrics to extended tasks.

bab94af4

Merge branch 'main' into add_swiss_legal_evals

37468493

Merge branch 'main' into add_swiss_legal_evals

ddaadbf2

Added support for judges from different providers.

09be56d8

Added additional system and user prompts and few shot examples.

0aa86077

Removed debug relics.

c49e1e23

Fixed issue in judge prompt.

4418e82b

Adapted getting predictions to new way for all metrics.

075ebd2e

Added gemba mqm metric by default.

8ee2dbc7

Fixed error in gemba score when errors are no dicts.

4408d0d0

Added different judge configurations for gpt 4o.

be6d9abe

Fixed typo.

c7ca83f5

Disabled short metrics for evaluation of longer sequences.

930cbc57

Added xcomet metrics to sentence level metrics.

61058b16

Fixed error in bleurt and enabled lazy loading of metrics to save on …

e043ee81

Refactored judge metric creation.

1c38c0ab

Improved detailed judge prompt and changed secondary judge models fro…

e05ac6a0

Changed judge order.

0aed0632

Merge branch 'main' into add_swiss_legal_evals

d9078a7f

Fixed stop sequence issue in press releases.

46eb62ae

Fixed error in xcomet scores.

a78bc03b

Made metric groups more easily configurable.

f6b50b4c

Made comet score more robust.

7f36065c

Moved unpack to the pipeline code.

cb6bfb41

Merge branch 'huggingface:main' into add_swiss_legal_evals

306ee766

Fixed bug in comet score.

866e7708

Added additional judge prompt configurations.

e7f9a096

Fixed judge setup.

186a6c83

Added more judge models.

c62647e8

Made the best judge the default.

2610c920

Reviewers

NathanHB

clefourrier

hynky1999

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

lighteval Add swiss legal evals as new community tasks #389 Open

Add swiss legal evals as new community tasks #389

lighteval
Add swiss legal evals as new community tasks
#389

Open