openvino
b344668c - Introduce new check for CompressFloatConstantsImpl (#35415)

Commit
35 days ago
Introduce new check for CompressFloatConstantsImpl (#35415) ### Details: This PR builds on top of #35106 and addresses the open review comments from @nshchego that were left unanswered there. ### Carried from #35106 Previously, `CompressFloatConstantsImpl` relied on checking whether f64/f32 Constant values fit into the finite f16 range. If 75% of the values did not fit, the Constant's conversion was skipped. However, even when f64/f32 values are inside the f16 range, they might not always be converted with high precision, as f16 has a narrower mantissa. The larger the values grow, the bigger the round-trip error becomes. For values > 1024 the absolute round-trip error can reach 1.0 or more and may cause significant accuracy degradation — this was observed for the LTX-Video model where the RoPE frequency values were compressed to f16 with substantial degradation. To prevent this, both absolute and relative error are now accounted for: 1. If a large absolute error (`> f16_compression_max_abs_error`, currently `1.0`) is observed for any in-range value of the Constant, the Constant is kept in its original precision (abs-error veto, `has_lossy=true`). 2. Values with a large relative round-trip error (`> f16_compression_max_rel_error`, currently `1e-4`) are accumulated into the same rejection count as true out-of-range values, and fed into the existing `f16_compression_keep_threshold` (75%) rule. The logic works uniformly for scalar and non-scalar Constants. This preserves precision-sensitive Constants (RoPE tables, attention scale factors like `log(16)`, …) while keeping typical dense weights compressed, and does not visibly affect compilation time or IR `.bin` size. On wwb the similarity on the generated IR reaches 0.98. ### Additional changes on top of #35106 1. In the slow (non-JIT) `change_constant_precision_to_fp16`, the `src_data[i]` value is cast to `double` before subtracting the f16 round-trip, so the absolute and relative error are measured in `double` precision (the original code lost f32 precision on the f32 source path). The subnormal branch now also writes `static_cast<float16>(src_data[i])` into `dst_data[i]` instead of relying on the Constant's zero-initialisation — this matches the x86 fast path, which produces an FP16 subnormal there. 2. A new `jit_check_f16_compression_avx512` JIT kernel is added. It processes 16 floats per iteration using `zmm` registers and opmasks, with an F16C `vcvtps2ph`/`vcvtph2ps` round-trip and a masked load for the tail. The selection cascade in `check_f16_compression()` becomes `avx512_core + fp16` → `avx2 + fp16` → C++ fallback. 3. The `vcvtps2ph` immediate is changed from `0` (use MXCSR) to `0x08` (force RNE and suppress all FP exceptions — equivalent to `_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC`) in both the AVX2 and AVX512 paths, and applied consistently across every `vcvtps2ph` call site in this file. This makes the JIT result independent of the caller's MXCSR state and bit-identical to `static_cast<float16>` used in the C++ fallback. Note: `0x04` is `_MM_FROUND_CUR_DIRECTION` (use MXCSR), not RNE — a common point of confusion flagged during review. 4. The AVX-512 kernel is written without `POPCNT` or `BMI2` (`bzhi`) instructions — neither is guaranteed by `mayiuse(avx512_core)` / `mayiuse(fp16)`. Lane counts use a branchless 16-bit SWAR popcount; the tail mask uses a baseline `SHL cl` / `dec`. 5. The back-compat symbol `count_out_of_f16_range()` is kept in `openvino/reference/convert.hpp` as a thin C++ wrapper — external developer-package consumers that linked against the pre-PR symbol stay source- and link-compatible. The in-tree compression path uses `check_f16_compression()`. 6. JIT register and vector aliases are declared as `const auto&` instead of `auto`, so the global Xbyak `Reg64`/`Ymm`/`Zmm` instances are taken by reference instead of copied. 7. Shared FP16 range / threshold constants (`kF16MaxPos`, `kF16MaxNeg`, `kF16MinPos`, `kF16MinNeg`, `kF16CompressionAbsErrVal`, `kF16CompressionRelErrVal`, `kAbsMaskVal`, `kVcvtps2phRneNoExc`) are lifted to the translation-unit anonymous namespace as `inline const` / `inline constexpr`, deduplicating the per-kernel copies. The shared `is_out_of_f16_range` predicate is used from both the `check_f16_compression` slow path and the `count_out_of_f16_range` back-compat shim. 8. `layer_tests/ovc_python_api_tests/test_pytorch.py::create_pytorch_module_convert_pytorch_frontend_oob` no longer uses unseeded `torch.rand` for the weight tensor. Random draws ~half the time produce uniformly-distributed values whose relative round-trip error exceeds the threshold for ≥75% of elements, which — correctly, under the new combined check — rejects compression and diverges from the reference model. Weights are now `torch.full([1, 3, 3, 3], 0.5)` (0.5 is exactly representable in FP16), so the outcome is deterministic regardless of RNG state. ### Tickets: - 180611
Author
Parents
Loading