Fix GPQA and index extractive metric (#829)
* too many false positives with the current gpqa metric extraction, making it more string
* fixing whitespace and instruction in prompt
* better to have a strict extraction for index extraction in general actually
* added comment
* fix tests, need to invert condition