Add IoU-based accuracy checking for inductor tests segmentation models (#171927)
Summary:
# Add IoU-based accuracy checking for segmentation models
### Summary
Introduces IoU (Intersection over Union) metric for boolean mask accuracy checking in inductor benchmarks. This provides a more appropriate accuracy comparison for segmentation models like SAM that output boolean masks.
Those tests are viable/strict blocking, so there is an interest on maintaining its quality.
### Problem
The `sam` model was failing accuracy checks intermittently in CI (`inductor-test / test (inductor_torchbench, *, *, linux.g5.4xlarge.nvidia.gpu)`):
```
sam FAIL: accuracy=fail_accuracy, expected=pass
```
The error logs showed:
```
Accuracy failed: uint8 tensor did not match
Accuracy failed for key name masks
```
**Root cause:** Segmentation models output boolean masks that are derived by thresholding floating-point values. Small numerical differences (e.g., 0.4999 vs 0.5001) can cause pixels to flip between `True` and `False`. The existing accuracy check requires exact boolean matching, which is too strict for this use case.
### Solution
Instead of suppressing the failures (via `flaky_models` or `non_deterministic`), this PR implements a semantically appropriate comparison method:
- **IoU (Intersection over Union)** - A standard metric for comparing segmentation masks
- Models can be configured to use IoU ≥ 0.99 threshold for boolean tensor comparison
- This catches real accuracy problems while allowing minor pixel-level variations
### Changes
1. **`benchmarks/dynamo/torchbench.yaml`**
- Added `tolerance.use_iou_for_bool_masks` config list for models that should use IoU
2. **`benchmarks/dynamo/torchbench.py`**
- Added `use_iou_for_bool_accuracy()` method to `TorchBenchmarkRunner`
3. **`benchmarks/dynamo/common.py`**
- Added base `use_iou_for_bool_accuracy()` method to `BenchmarkRunner`
- Pass new flag to `same()` function
4. **`torch/_dynamo/utils.py`**
- Added `use_iou_for_bool` parameter to `same()` function
- Implemented IoU comparison logic for boolean tensors:
intersection = (ref & res).sum().float()
union = (ref | res).sum().float()
iou = intersection / union # Pass if IoU >= 0.99 (99% pixel agreement)
### Models enabled for IoU comparison
- `sam` - Segment Anything Model
- `sam_fast` - Fast variant of SAM
- `vision_maskrcnn` - Mask R-CNN (also outputs segmentation masks)
### Why IoU over alternatives?
| Approach | Pros | Cons |
|----------|------|------|
| `flaky_models` | Visible failures, doesn't block CI | Doesn't fix the underlying issue |
| `non_deterministic` | Simple config | Silently passes all failures, hides real problems |
| **IoU (this PR)** | Semantically correct metric, catches real bugs | Slightly more code |
### Test Plan
- Models in `use_iou_for_bool_masks` will use IoU ≥ 0.99 for boolean tensor comparison
- Real accuracy problems (IoU < 0.99) will still fail
- CI should no longer flake on `sam` model accuracy checks
```python
intersection = (ref & res).sum().float()
union = (ref | res).sum().float()
iou = intersection / union
# Pass if IoU >= 0.99 (99% pixel agreement)
```
### Models enabled for IoU comparison
- `sam` - Segment Anything Model
- `sam_fast` - Fast variant of SAM
- `vision_maskrcnn` - Mask R-CNN (also outputs segmentation masks)
### Why IoU over alternatives?
| Approach | Pros | Cons |
|----------|------|------|
| `flaky_models` | Visible failures, doesn't block CI | Doesn't fix the underlying issue |
| `non_deterministic` | Simple config | Silently passes all failures, hides real problems |
| **IoU (this PR)** | Semantically correct metric, catches real bugs | Slightly more code |
### Test Plan
- Models in `use_iou_for_bool_masks` will use IoU ≥ 0.99 for boolean tensor comparison
- Real accuracy problems (IoU < 0.99) will still fail
- CI should no longer flake on `sam` model accuracy checks
- `sam_fast` can now be verified for accuracy and we can detect regressions
X-link: https://github.com/pytorch/pytorch/pull/171927
Approved by: https://github.com/malfet, https://github.com/yangw-dev
Reviewed By: atalman
Differential Revision: D90364697
fbshipit-source-id: 9b25f18ebbe6079a245af1fc7c3660cb9f171e34