Parallelize integration tests for faster CI execution
- Split integration tests by file (6 files x 2 runtimes = 12 parallel jobs)
- Enable pytest-xdist within each job for intra-job parallelization
- Reduce runner size to 8-core since work is distributed across more jobs
- Add --maxfail=3 for quick failure feedback with valuable error messages
- Remove sequential execution flags (-x and PYTEST_XDIST_WORKER_COUNT=0)
- Expected speedup: ~30min → ~5-10min while maintaining error visibility