Fixing mixeval (#1006)
* option1
* also debugging the judge
* also debugging the judge
* debug
* eval tracker fix 1
* likely fix for the GSM+ issue
* stringify model judge + change max_length to what's actually passed instead of setting a bunch of overwrites
* more memory for flow judge