Add retries for get_workflow_job_id and try catch in upload_test_stats (#93401)
upload_test_stats keeps failing b/c it can't handle when the id is workflow-<workflow_id> so add a try catch for this.
Add retries to get_workflow_job_id to try and reduce the number of times the id can't be found
Failure to upload test stats and inability to get the job id cause our sharding infra and slow test infra (probably also flaky test detection) to be less effective. This does not completely resolve the issue since we do rely on the job id
Failure to get the workflow job id happens tragically often, hopefully retries will help
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93401
Approved by: https://github.com/huydhn