Introduce new check for CompressFloatConstantsImpl (#35415)
### Details:
This PR builds on top of #35106 and addresses the open review comments
from @nshchego that were left unanswered there.
### Carried from #35106
Previously, `CompressFloatConstantsImpl` relied on checking whether
f64/f32 Constant values fit into the finite f16 range. If 75% of the
values did not fit, the Constant's conversion was skipped. However, even
when f64/f32 values are inside the f16 range, they might not always be
converted with high precision, as f16 has a narrower mantissa. The
larger the values grow, the bigger the round-trip error becomes. For
values > 1024 the absolute round-trip error can reach 1.0 or more and
may cause significant accuracy degradation — this was observed for the
LTX-Video model where the RoPE frequency values were compressed to f16
with substantial degradation.
To prevent this, both absolute and relative error are now accounted for:
1. If a large absolute error (`> f16_compression_max_abs_error`,
currently `1.0`) is observed for any in-range value of the Constant, the
Constant is kept in its original precision (abs-error veto,
`has_lossy=true`).
2. Values with a large relative round-trip error (`>
f16_compression_max_rel_error`, currently `1e-4`) are accumulated into
the same rejection count as true out-of-range values, and fed into the
existing `f16_compression_keep_threshold` (75%) rule.
The logic works uniformly for scalar and non-scalar Constants. This
preserves precision-sensitive Constants (RoPE tables, attention scale
factors like `log(16)`, …) while keeping typical dense weights
compressed, and does not visibly affect compilation time or IR `.bin`
size. On wwb the similarity on the generated IR reaches 0.98.
### Additional changes on top of #35106
1. In the slow (non-JIT) `change_constant_precision_to_fp16`, the
`src_data[i]` value is cast to `double` before subtracting the f16
round-trip, so the absolute and relative error are measured in `double`
precision (the original code lost f32 precision on the f32 source path).
The subnormal branch now also writes `static_cast<float16>(src_data[i])`
into `dst_data[i]` instead of relying on the Constant's
zero-initialisation — this matches the x86 fast path, which produces an
FP16 subnormal there.
2. A new `jit_check_f16_compression_avx512` JIT kernel is added. It
processes 16 floats per iteration using `zmm` registers and opmasks,
with an F16C `vcvtps2ph`/`vcvtph2ps` round-trip and a masked load for
the tail. The selection cascade in `check_f16_compression()` becomes
`avx512_core + fp16` → `avx2 + fp16` → C++ fallback.
3. The `vcvtps2ph` immediate is changed from `0` (use MXCSR) to `0x08`
(force RNE and suppress all FP exceptions — equivalent to
`_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC`) in both the AVX2 and
AVX512 paths, and applied consistently across every `vcvtps2ph` call
site in this file. This makes the JIT result independent of the caller's
MXCSR state and bit-identical to `static_cast<float16>` used in the C++
fallback. Note: `0x04` is `_MM_FROUND_CUR_DIRECTION` (use MXCSR), not
RNE — a common point of confusion flagged during review.
4. The AVX-512 kernel is written without `POPCNT` or `BMI2` (`bzhi`)
instructions — neither is guaranteed by `mayiuse(avx512_core)` /
`mayiuse(fp16)`. Lane counts use a branchless 16-bit SWAR popcount; the
tail mask uses a baseline `SHL cl` / `dec`.
5. The back-compat symbol `count_out_of_f16_range()` is kept in
`openvino/reference/convert.hpp` as a thin C++ wrapper — external
developer-package consumers that linked against the pre-PR symbol stay
source- and link-compatible. The in-tree compression path uses
`check_f16_compression()`.
6. JIT register and vector aliases are declared as `const auto&` instead
of `auto`, so the global Xbyak `Reg64`/`Ymm`/`Zmm` instances are taken
by reference instead of copied.
7. Shared FP16 range / threshold constants (`kF16MaxPos`, `kF16MaxNeg`,
`kF16MinPos`, `kF16MinNeg`, `kF16CompressionAbsErrVal`,
`kF16CompressionRelErrVal`, `kAbsMaskVal`, `kVcvtps2phRneNoExc`) are
lifted to the translation-unit anonymous namespace as `inline const` /
`inline constexpr`, deduplicating the per-kernel copies. The shared
`is_out_of_f16_range` predicate is used from both the
`check_f16_compression` slow path and the `count_out_of_f16_range`
back-compat shim.
8.
`layer_tests/ovc_python_api_tests/test_pytorch.py::create_pytorch_module_convert_pytorch_frontend_oob`
no longer uses unseeded `torch.rand` for the weight tensor. Random draws
~half the time produce uniformly-distributed values whose relative
round-trip error exceeds the threshold for ≥75% of elements, which —
correctly, under the new combined check — rejects compression and
diverges from the reference model. Weights are now `torch.full([1, 3, 3,
3], 0.5)` (0.5 is exactly representable in FP16), so the outcome is
deterministic regardless of RNG state.
### Tickets:
- 180611