Add support for CUMSUM and TRI for CUDA. #17584
Add support for CUMSUM and TRI for CUDA.
d138a03d
Minor optimizations.
67207d21
Correct warp_prefix_inclusive_sum in float2 variant to return float2
fab00294
Optimize TRI
51c40a5a
Whitespace
c30f5654
Fix strides.
31b55fab
Implement double loop
d1ca1c25
Whitespace
5289b530
Fix HIP compilation bugs
f422ba8e
Optimizations + big case performance tests
df917ccf
Implement using CUB with fallback to custom kernel
76382d79
Remove error message.
01d4033e
am17an
commented
on 2025-12-03
Fixes from code review
10a2ea9d
Comment out CPU-unsupported F16/BF16 cases to fix CI
7a83b056
Fine, you win :P
bbe37435
CISC
commented
on 2025-12-04
am17an
commented
on 2025-12-03
Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS
069413ab
Vary warp-size based on physical warp size
5aa7438e
Add GGML_UNUSED_VARS in tri as well
579eba6e
Use constexpr and call prefix_inclusive with warp_size template param
08b3f2d2
Update ggml/src/ggml-cuda/cumsum.cu
9cd0eff1
Apply suggestions from code review
9574264c
Change to tid % warp_size
efd619a6
IMbackK
requested changes
on 2025-12-04
IMbackK
requested changes
on 2025-12-04
Fix strides; hardcode mask; add ggml_lane_mask_t
86a0853f
Missing renames, remove unused get_warp_mask(), explicit calls to ggm…
de45c632
Too hasty...
8a7375c8
IMbackK
approved these changes
on 2025-12-04
pwilkin
merged
96fe9bad
into master 15 days ago
Assignees
No one assigned
Labels
testing
Nvidia GPU
ggml
Login to write a write a comment.
Login via GitHub