Improve eval chain prompt (#2798)
Eval chain is currently very sensitive to differences in phrasing,
punctuation, and tangential information. This prompt has worked better
for me on my examples.
More general q: Do we have any framework for evaluating default prompt
changes? Could maybe start doing some regression testing?