[libc++] Improve error resilience when running historical benchmarks
In benchmark-historical, don't skip gathering the results when the
lit command has failed. Indeed, it is expected to fail as part of
normal operation since it should be pretty frequent for at least one
test to fail when running historical benchmarks. Instead, gather
whatever results we have.
Also, output the build log in spec.gen.py upon failure so we can see
the reason for the failure.