llama.cpp
560445bf - metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)

Commit

26 days ago

metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477) For a given output position j on the time axis, only input positions i such that i*s0 <= j < i*s0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.

References

#23143 - sync : ggml

Author

CrispStrobe

Committer

ggerganov

Parents

2eb3e6b2

llama.cpp 560445bf - metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)

llama.cpp
560445bf - metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)