MMLU Redux and Fixing the caching (#883)
MMLU-Redux added, similar results to Qwen when using a generative metric.
3 changes to fix caching:
removed tokenization saving system since it was unused and bloating the code
added a hash for task configs, to make sure we actually compare generations from the same task version (for example, if you change task params it changes task hash). Side note: had to add a lot of str to get pretty prints for logged classes
separates samples from loglikelihood metrics and samples from generative metrics