PR #17584 Add support for CUMSUM and TRI for CUDA.

Add support for CUMSUM and TRI for CUDA. #17584

pwilkin merged 25 commits into ggml-org:master from pwilkin:tri_cumsum_cuda

Add support for CUMSUM and TRI for CUDA.

d138a03d

pwilkin requested a review from

ggerganov 21 days ago

github-actions added testing

github-actions added Nvidia GPU

github-actions added ggml

Minor optimizations.

67207d21

pwilkin requested a review from

am17an 21 days ago

pwilkin requested a review from

JohannesGaessler 21 days ago

Correct warp_prefix_inclusive_sum in float2 variant to return float2

fab00294

JohannesGaessler commented on 2025-11-29

Optimize TRI

51c40a5a

Whitespace

c30f5654

Fix strides.

31b55fab

Implement double loop

d1ca1c25

Whitespace

5289b530

Fix HIP compilation bugs

f422ba8e

gabe-l-hart commented on 2025-12-01

Optimizations + big case performance tests

df917ccf

Implement using CUB with fallback to custom kernel

76382d79

Remove error message.

01d4033e

am17an commented on 2025-12-03

Fixes from code review

10a2ea9d

Comment out CPU-unsupported F16/BF16 cases to fix CI

7a83b056

Fine, you win :P

bbe37435

CISC commented on 2025-12-04

am17an commented on 2025-12-03

Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

069413ab

Vary warp-size based on physical warp size

5aa7438e

Add GGML_UNUSED_VARS in tri as well

579eba6e

JohannesGaessler commented on 2025-12-04

Use constexpr and call prefix_inclusive with warp_size template param

08b3f2d2

Update ggml/src/ggml-cuda/cumsum.cu

9cd0eff1

Apply suggestions from code review

9574264c

Change to tid % warp_size

efd619a6

IMbackK requested changes on 2025-12-04

Fix strides; hardcode mask; add ggml_lane_mask_t

86a0853f

JohannesGaessler commented on 2025-12-04

Missing renames, remove unused get_warp_mask(), explicit calls to ggm…

de45c632

JohannesGaessler commented on 2025-12-04

Too hasty...

8a7375c8

IMbackK approved these changes on 2025-12-04

JohannesGaessler approved these changes on 2025-12-04

pwilkin merged 96fe9bad into master 15 days ago

Reviewers

JohannesGaessler

IMbackK

am17an

CISC

gabe-l-hart

ggerganov

Assignees

No one assigned

Labels

testing Nvidia GPU ggml

Milestone

No milestone

llama.cpp Add support for CUMSUM and TRI for CUDA. #17584 Merged

Add support for CUMSUM and TRI for CUDA. #17584

llama.cpp
Add support for CUMSUM and TRI for CUDA.
#17584

Merged