[GPU] Optimize MVN and reorders for nnUNet INT8 5D model (#34949)
### Summary
Optimizes the OpenVINO GPU plugin for nnUNet INT8 5D inference by
reducing redundant reorders, extending shape-agnostic coverage of
blocked reorder kernels, and preventing an oneDNN deconvolution fallback
to the reference kernel. On Intel Arc B390 (DUT4580PTLH), end-to-end
inference latency drops from *18509 ms → 3781 ms* (*4.89×*).
### Changes (11 commits, TEST → IMPL paired)
| Opt | TEST commit | IMPL commit | Scope |
|-----|-------------|-------------|-------|
| 1 | `44b7c06d89` | `5d720c278e` | Prevent oneDNN deconv from selecting
`ocl:ref` kernel |
| 2 | `6fc39924b7` | `2d2d2566bd` | MVN fsv16↔fsv32 cross-layout fusing;
dynamic-shape MVN b_fs_yx_fsv16; 5D int8 concat preferred format |
| 3 | `0452e8fe72` | `0a8ab0a467` | Dynamic-shape support for
`reorder_data_bfyx_to_blocked_format` |
| 4 | `87caeb366f` | `06fd68db38` + `e42585d73a` | New
`reorder_data_fsv` kernel for blocked↔blocked fsv conversion +
vload/vstore vectorization |
| 5 | `f1c3cc437d` | `2bbcf03107` | Rename `_imad` kernel to
`mvn_gpu_b_fs_yx_fsv16` (no longer int-only); extend dynamic reorder
registry with the blocked formats the new kernels serve |
### Graph-level impact (main program final stage)
| Stage | total nodes | reorder | mvn |
|-------|---:|---:|---:|
| master (baseline) | 311 | 56 | 22 |
| after Opt1 | 348 | 56 | 22 |
| after Opt2+ | **305** | **13** | 22 |
Opt2 removes 43 reorders by allowing MVN to accept cross-layout
fsv16/fsv32
input/output (consumer-direction rule is symmetric to the existing
producer-direction rule in `can_fuse_reorder_to_prev`).
### E2E latency on Intel Arc B390 GPU (96 CUs, 2500 MHz), nnUNet INT8 5D
| Build | Avg [ms] | Device total [s] | vs master |
|-------|---:|---:|---:|
| master | 18509 | 50.54 | 1.00× |
| + Opt1 | 16639 | 44.65 | 1.11× |
| + Opt2 | 6445 | 13.67 | 2.87× |
| + Opt3 | 6763 | 13.71 | 2.74× (noise) |
| + Opt4 | **3781** | **5.79** | **4.89×** |
| + Opt5 | 4137 | 6.84 | 4.47× (noise) |
### Tickets
- 182677
### AI Assistance
- AI assistance used: yes
- AI: root-cause analysis, patch generation, kernel vectorization
- User: design decisions, build, validation
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>