pytorch
238cc84a - [TD] Emit metrics to compare heuristic quality (#108192)

Commit View On GitHub

Commit

1 year ago

[TD] Emit metrics to compare heuristic quality (#108192) When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test. ## Context Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant. This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation. ## What's measured? The metrics this PR collects are designed to answer the following questions ### How comprehensive are the heuristics? - What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level) ### How precise are the heuristics? - What % of failed tests were prioritized by a given heuristic? What % was prioritized overall? - How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level) - What % of time was a given heuristic prioritizing a failing test higher than any other heuristic? Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192 Approved by: https://github.com/huydhn ghstack dependencies: #108117

Author

ZainRizvi

Committer

pytorchmergebot

Parents

d695486f

pytorch 238cc84a - [TD] Emit metrics to compare heuristic quality (#108192)

Commit

pytorch
238cc84a - [TD] Emit metrics to compare heuristic quality (#108192)