[GPU] Fix oneDNN FP16 convolution format selection for channel expansion operations (#33131)
### Details:
- When FP16 dynamic convolution has small input channels (≤4) and large
output channels (e.g., 1024), the current format selection logic chooses
`bfyx → fsv16`, which triggers oneDNN reference kernel instead of
optimized JIT kernel, resulting in significant performance degradation.
- Override output format to planar (bfyx) when input channels are small
(≤ 16), and output channels are large (≥ 32)
**Current behavior:**
- Input: 3 channels → Converted to `bfyx`
- Output: 1024 channels → Remains `fsv16` (only changed when output ≤ 4)
- Result: `bfyx → fsv16` combination uses **reference kernel** (slow)
#### Root Cause
The fsv16 blocked format is optimized for reading many channels but
introduces overhead when used for writing outputs in channel-expansion
scenarios (small input → large output). oneDNN's reference kernel is
selected because:
1. **Inefficient write pattern**: fsv16 output requires interleaved
writes every 16 elements (non-contiguous)
2. **No optimized implementation**: oneDNN doesn't provide JIT-optimized
kernel for fsv16 output generation from small input channels
3. **Scatter write overhead**: Writing 1024 channels in fsv16 format
requires complex block-strided access
### Tickets:
- [CVS-177671](https://jira.devtools.intel.com/browse/CVS-177671)
Signed-off-by: Andrew Park <andrew.park@intel.com>