llama.cpp
e6525661 - Readd `cub::DeviceScan::InclusiveSum`-based CumSum

Commit
9 days ago
Readd `cub::DeviceScan::InclusiveSum`-based CumSum For single rows and large columns doing a for-loop over the function `cub::DeviceScan::InclusiveSum` offered by CUB outperforms the `cumsum_cub_kernel` where `cub::BlockScan` is used. Numbers before this change Backend 1/3: CUDA0 Device description: NVIDIA RTX 6000 Ada Generation Device memory: 48510 MB (48039 MB free) CUMSUM(type=f32,ne=[128,128,4,4]): 311258 runs - 3.26 us/run - 2048 kB/run - 599.76 GB/s CUMSUM(type=f32,ne=[2048,16,5,4]): 229390 runs - 4.40 us/run - 5120 kB/run - 1110.23 GB/s CUMSUM(type=f32,ne=[20000,10,4,1]): 37583 runs - 29.63 us/run - 6250 kB/run - 201.18 GB/s CUMSUM(type=f32,ne=[128,1,1,1]): 892819 runs - 1.12 us/run - 1 kB/run - 0.85 GB/s CUMSUM(type=f32,ne=[1024,1,1,1]): 450505 runs - 2.25 us/run - 8 kB/run - 3.39 GB/s CUMSUM(type=f32,ne=[4096,1,1,1]): 155629 runs - 6.61 us/run - 32 kB/run - 4.62 GB/s CUMSUM(type=f32,ne=[8192,1,1,1]): 81910 runs - 12.60 us/run - 64 kB/run - 4.85 GB/s CUMSUM(type=f32,ne=[16384,1,1,1]): 49146 runs - 23.99 us/run - 128 kB/run - 5.09 GB/s CUMSUM(type=f32,ne=[32768,1,1,1]): 24573 runs - 47.10 us/run - 256 kB/run - 5.18 GB/s CUMSUM(type=f32,ne=[65536,1,1,1]): 16382 runs - 93.57 us/run - 512 kB/run - 5.22 GB/s CUMSUM(type=f32,ne=[131072,1,1,1]): 8191 runs - 184.79 us/run - 1024 kB/run - 5.29 GB/s CUMSUM(type=f32,ne=[200000,1,1,1]): 8191 runs - 280.43 us/run - 1562 kB/run - 5.31 GB/s CUMSUM(type=f32,ne=[2000000,1,1,1]): 2148 runs - 2771.23 us/run - 15625 kB/run - 5.38 GB/s CUMSUM(type=f32,ne=[128,4,1,1]): 458696 runs - 2.21 us/run - 4 kB/run - 1.73 GB/s CUMSUM(type=f32,ne=[1024,4,1,1]): 360404 runs - 2.82 us/run - 32 kB/run - 10.83 GB/s CUMSUM(type=f32,ne=[4096,4,1,1]): 147438 runs - 7.12 us/run - 128 kB/run - 17.15 GB/s CUMSUM(type=f32,ne=[8192,4,1,1]): 81910 runs - 12.90 us/run - 256 kB/run - 18.92 GB/s CUMSUM(type=f32,ne=[16384,4,1,1]): 49146 runs - 24.32 us/run - 512 kB/run - 20.08 GB/s CUMSUM(type=f32,ne=[32768,4,1,1]): 24573 runs - 47.28 us/run - 1024 kB/run - 20.66 GB/s CUMSUM(type=f32,ne=[65536,4,1,1]): 16382 runs - 93.21 us/run - 2048 kB/run - 20.96 GB/s CUMSUM(type=f32,ne=[131072,4,1,1]): 8191 runs - 185.04 us/run - 4096 kB/run - 21.11 GB/s CUMSUM(type=f32,ne=[200000,4,1,1]): 5369 runs - 282.08 us/run - 6250 kB/run - 21.13 GB/s CUMSUM(type=f32,ne=[2000000,4,1,1]): 537 runs - 2806.46 us/run - 62500 kB/run - 21.26 GB/s CUMSUM(type=f32,ne=[128,8,1,1]): 458696 runs - 2.20 us/run - 8 kB/run - 3.47 GB/s CUMSUM(type=f32,ne=[1024,8,1,1]): 360404 runs - 2.82 us/run - 64 kB/run - 21.66 GB/s CUMSUM(type=f32,ne=[4096,8,1,1]): 147438 runs - 7.12 us/run - 256 kB/run - 34.28 GB/s CUMSUM(type=f32,ne=[8192,8,1,1]): 81910 runs - 12.90 us/run - 512 kB/run - 37.84 GB/s CUMSUM(type=f32,ne=[16384,8,1,1]): 49146 runs - 24.32 us/run - 1024 kB/run - 40.15 GB/s CUMSUM(type=f32,ne=[32768,8,1,1]): 24573 runs - 47.28 us/run - 2048 kB/run - 41.31 GB/s CUMSUM(type=f32,ne=[65536,8,1,1]): 16382 runs - 93.20 us/run - 4096 kB/run - 41.92 GB/s CUMSUM(type=f32,ne=[131072,8,1,1]): 8194 runs - 185.05 us/run - 8192 kB/run - 42.22 GB/s CUMSUM(type=f32,ne=[200000,8,1,1]): 5370 runs - 282.15 us/run - 12500 kB/run - 42.26 GB/s CUMSUM(type=f32,ne=[2000000,8,1,1]): 269 runs - 4067.61 us/run - 125000 kB/run - 29.36 GB/s CUMSUM(type=f32,ne=[128,16,1,1]): 303067 runs - 3.32 us/run - 16 kB/run - 4.60 GB/s CUMSUM(type=f32,ne=[1024,16,1,1]): 303067 runs - 3.32 us/run - 128 kB/run - 36.76 GB/s CUMSUM(type=f32,ne=[4096,16,1,1]): 147438 runs - 7.17 us/run - 512 kB/run - 68.13 GB/s CUMSUM(type=f32,ne=[8192,16,1,1]): 81910 runs - 12.90 us/run - 1024 kB/run - 75.68 GB/s CUMSUM(type=f32,ne=[16384,16,1,1]): 49146 runs - 24.33 us/run - 2048 kB/run - 80.28 GB/s CUMSUM(type=f32,ne=[32768,16,1,1]): 24573 runs - 47.30 us/run - 4096 kB/run - 82.59 GB/s CUMSUM(type=f32,ne=[65536,16,1,1]): 12291 runs - 93.24 us/run - 8192 kB/run - 83.80 GB/s CUMSUM(type=f32,ne=[131072,16,1,1]): 6147 runs - 185.07 us/run - 16384 kB/run - 84.45 GB/s CUMSUM(type=f32,ne=[200000,16,1,1]): 4029 runs - 282.40 us/run - 25000 kB/run - 84.46 GB/s CUMSUM(type=f32,ne=[2000000,16,1,1]): 270 runs - 4118.40 us/run - 250000 kB/run - 58.11 GB/s Backend CUDA0: OK Backend 2/3: CUDA1 Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition Device memory: 97250 MB (96677 MB free) CUMSUM(type=f32,ne=[128,128,4,4]): 368595 runs - 2.73 us/run - 2048 kB/run - 715.83 GB/s CUMSUM(type=f32,ne=[2048,16,5,4]): 216282 runs - 4.72 us/run - 5120 kB/run - 1035.32 GB/s CUMSUM(type=f32,ne=[20000,10,4,1]): 32214 runs - 34.33 us/run - 6250 kB/run - 173.64 GB/s CUMSUM(type=f32,ne=[128,1,1,1]): 810909 runs - 1.24 us/run - 1 kB/run - 0.77 GB/s CUMSUM(type=f32,ne=[1024,1,1,1]): 401359 runs - 2.52 us/run - 8 kB/run - 3.03 GB/s CUMSUM(type=f32,ne=[4096,1,1,1]): 139247 runs - 7.44 us/run - 32 kB/run - 4.10 GB/s CUMSUM(type=f32,ne=[8192,1,1,1]): 73719 runs - 14.27 us/run - 64 kB/run - 4.28 GB/s CUMSUM(type=f32,ne=[16384,1,1,1]): 40955 runs - 27.24 us/run - 128 kB/run - 4.48 GB/s CUMSUM(type=f32,ne=[32768,1,1,1]): 24573 runs - 53.46 us/run - 256 kB/run - 4.57 GB/s CUMSUM(type=f32,ne=[65536,1,1,1]): 16382 runs - 105.29 us/run - 512 kB/run - 4.64 GB/s CUMSUM(type=f32,ne=[131072,1,1,1]): 8191 runs - 210.15 us/run - 1024 kB/run - 4.65 GB/s CUMSUM(type=f32,ne=[200000,1,1,1]): 8191 runs - 318.22 us/run - 1562 kB/run - 4.68 GB/s CUMSUM(type=f32,ne=[2000000,1,1,1]): 2148 runs - 3142.23 us/run - 15625 kB/run - 4.74 GB/s CUMSUM(type=f32,ne=[128,4,1,1]): 303067 runs - 3.34 us/run - 4 kB/run - 1.14 GB/s CUMSUM(type=f32,ne=[1024,4,1,1]): 253921 runs - 4.03 us/run - 32 kB/run - 7.58 GB/s CUMSUM(type=f32,ne=[4096,4,1,1]): 122865 runs - 8.20 us/run - 128 kB/run - 14.89 GB/s CUMSUM(type=f32,ne=[8192,4,1,1]): 73719 runs - 14.96 us/run - 256 kB/run - 16.32 GB/s CUMSUM(type=f32,ne=[16384,4,1,1]): 40955 runs - 28.66 us/run - 512 kB/run - 17.04 GB/s CUMSUM(type=f32,ne=[32768,4,1,1]): 24573 runs - 54.21 us/run - 1024 kB/run - 18.01 GB/s CUMSUM(type=f32,ne=[65536,4,1,1]): 16382 runs - 106.49 us/run - 2048 kB/run - 18.34 GB/s CUMSUM(type=f32,ne=[131072,4,1,1]): 8191 runs - 210.88 us/run - 4096 kB/run - 18.52 GB/s CUMSUM(type=f32,ne=[200000,4,1,1]): 5369 runs - 321.77 us/run - 6250 kB/run - 18.53 GB/s CUMSUM(type=f32,ne=[2000000,4,1,1]): 537 runs - 3191.79 us/run - 62500 kB/run - 18.69 GB/s CUMSUM(type=f32,ne=[128,8,1,1]): 376786 runs - 2.67 us/run - 8 kB/run - 2.86 GB/s CUMSUM(type=f32,ne=[1024,8,1,1]): 245730 runs - 4.10 us/run - 64 kB/run - 14.90 GB/s CUMSUM(type=f32,ne=[4096,8,1,1]): 122865 runs - 8.20 us/run - 256 kB/run - 29.79 GB/s CUMSUM(type=f32,ne=[8192,8,1,1]): 65528 runs - 16.38 us/run - 512 kB/run - 29.82 GB/s CUMSUM(type=f32,ne=[16384,8,1,1]): 40955 runs - 28.69 us/run - 1024 kB/run - 34.04 GB/s CUMSUM(type=f32,ne=[32768,8,1,1]): 24573 runs - 55.28 us/run - 2048 kB/run - 35.33 GB/s CUMSUM(type=f32,ne=[65536,8,1,1]): 16382 runs - 108.50 us/run - 4096 kB/run - 36.00 GB/s CUMSUM(type=f32,ne=[131072,8,1,1]): 8194 runs - 213.75 us/run - 8192 kB/run - 36.55 GB/s CUMSUM(type=f32,ne=[200000,8,1,1]): 5370 runs - 326.31 us/run - 12500 kB/run - 36.54 GB/s CUMSUM(type=f32,ne=[2000000,8,1,1]): 538 runs - 3252.68 us/run - 125000 kB/run - 36.72 GB/s CUMSUM(type=f32,ne=[128,16,1,1]): 303067 runs - 3.32 us/run - 16 kB/run - 4.60 GB/s CUMSUM(type=f32,ne=[1024,16,1,1]): 253921 runs - 4.06 us/run - 128 kB/run - 30.09 GB/s CUMSUM(type=f32,ne=[4096,16,1,1]): 122865 runs - 8.20 us/run - 512 kB/run - 59.57 GB/s CUMSUM(type=f32,ne=[8192,16,1,1]): 65528 runs - 16.38 us/run - 1024 kB/run - 59.63 GB/s CUMSUM(type=f32,ne=[16384,16,1,1]): 40955 runs - 28.69 us/run - 2048 kB/run - 68.09 GB/s CUMSUM(type=f32,ne=[32768,16,1,1]): 24573 runs - 55.28 us/run - 4096 kB/run - 70.67 GB/s CUMSUM(type=f32,ne=[65536,16,1,1]): 12291 runs - 108.50 us/run - 8192 kB/run - 72.02 GB/s CUMSUM(type=f32,ne=[131072,16,1,1]): 6147 runs - 213.60 us/run - 16384 kB/run - 73.17 GB/s CUMSUM(type=f32,ne=[200000,16,1,1]): 4029 runs - 326.04 us/run - 25000 kB/run - 73.15 GB/s CUMSUM(type=f32,ne=[2000000,16,1,1]): 270 runs - 5458.69 us/run - 250000 kB/run - 43.84 GB/s ---- Numbers after: Backend 1/3: CUDA0 Device description: NVIDIA RTX 6000 Ada Generation Device memory: 48510 MB (48039 MB free) CUMSUM(type=f32,ne=[128,128,4,4]): 311258 runs - 3.25 us/run - 2048 kB/run - 601.62 GB/s CUMSUM(type=f32,ne=[2048,16,5,4]): 229390 runs - 4.40 us/run - 5120 kB/run - 1110.14 GB/s CUMSUM(type=f32,ne=[20000,10,4,1]): 37583 runs - 29.67 us/run - 6250 kB/run - 200.89 GB/s CUMSUM(type=f32,ne=[128,1,1,1]): 892819 runs - 1.12 us/run - 1 kB/run - 0.85 GB/s CUMSUM(type=f32,ne=[1024,1,1,1]): 458696 runs - 2.21 us/run - 8 kB/run - 3.45 GB/s CUMSUM(type=f32,ne=[4096,1,1,1]): 376786 runs - 2.66 us/run - 32 kB/run - 11.46 GB/s CUMSUM(type=f32,ne=[8192,1,1,1]): 393168 runs - 2.59 us/run - 64 kB/run - 23.57 GB/s CUMSUM(type=f32,ne=[16384,1,1,1]): 393168 runs - 2.59 us/run - 128 kB/run - 47.15 GB/s CUMSUM(type=f32,ne=[32768,1,1,1]): 376786 runs - 2.69 us/run - 256 kB/run - 90.69 GB/s CUMSUM(type=f32,ne=[65536,1,1,1]): 327640 runs - 3.06 us/run - 512 kB/run - 159.65 GB/s CUMSUM(type=f32,ne=[131072,1,1,1]): 311258 runs - 3.28 us/run - 1024 kB/run - 297.77 GB/s CUMSUM(type=f32,ne=[200000,1,1,1]): 270303 runs - 3.74 us/run - 1562 kB/run - 398.14 GB/s CUMSUM(type=f32,ne=[2000000,1,1,1]): 137472 runs - 7.35 us/run - 15625 kB/run - 2026.94 GB/s CUMSUM(type=f32,ne=[128,4,1,1]): 876437 runs - 1.14 us/run - 4 kB/run - 3.33 GB/s CUMSUM(type=f32,ne=[1024,4,1,1]): 442314 runs - 2.28 us/run - 32 kB/run - 13.39 GB/s CUMSUM(type=f32,ne=[4096,4,1,1]): 155629 runs - 6.69 us/run - 128 kB/run - 18.24 GB/s CUMSUM(type=f32,ne=[8192,4,1,1]): 81910 runs - 12.53 us/run - 256 kB/run - 19.49 GB/s CUMSUM(type=f32,ne=[16384,4,1,1]): 49146 runs - 24.18 us/run - 512 kB/run - 20.20 GB/s CUMSUM(type=f32,ne=[32768,4,1,1]): 65528 runs - 15.34 us/run - 1024 kB/run - 63.66 GB/s CUMSUM(type=f32,ne=[65536,4,1,1]): 73719 runs - 14.76 us/run - 2048 kB/run - 132.35 GB/s CUMSUM(type=f32,ne=[131072,4,1,1]): 65528 runs - 16.01 us/run - 4096 kB/run - 244.07 GB/s CUMSUM(type=f32,ne=[200000,4,1,1]): 64428 runs - 16.51 us/run - 6250 kB/run - 360.97 GB/s CUMSUM(type=f32,ne=[2000000,4,1,1]): 33831 runs - 29.59 us/run - 62500 kB/run - 2016.08 GB/s CUMSUM(type=f32,ne=[128,8,1,1]): 868246 runs - 1.16 us/run - 8 kB/run - 6.59 GB/s CUMSUM(type=f32,ne=[1024,8,1,1]): 442314 runs - 2.28 us/run - 64 kB/run - 26.76 GB/s CUMSUM(type=f32,ne=[4096,8,1,1]): 155629 runs - 6.69 us/run - 256 kB/run - 36.48 GB/s CUMSUM(type=f32,ne=[8192,8,1,1]): 81910 runs - 12.53 us/run - 512 kB/run - 38.97 GB/s CUMSUM(type=f32,ne=[16384,8,1,1]): 49146 runs - 24.17 us/run - 1024 kB/run - 40.41 GB/s CUMSUM(type=f32,ne=[32768,8,1,1]): 24573 runs - 47.53 us/run - 2048 kB/run - 41.10 GB/s CUMSUM(type=f32,ne=[65536,8,1,1]): 16382 runs - 61.25 us/run - 4096 kB/run - 63.77 GB/s CUMSUM(type=f32,ne=[131072,8,1,1]): 32776 runs - 31.79 us/run - 8192 kB/run - 245.82 GB/s CUMSUM(type=f32,ne=[200000,8,1,1]): 32220 runs - 32.90 us/run - 12500 kB/run - 362.35 GB/s CUMSUM(type=f32,ne=[2000000,8,1,1]): 6725 runs - 151.99 us/run - 125000 kB/run - 785.77 GB/s CUMSUM(type=f32,ne=[128,16,1,1]): 851864 runs - 1.18 us/run - 16 kB/run - 12.97 GB/s CUMSUM(type=f32,ne=[1024,16,1,1]): 442314 runs - 2.30 us/run - 128 kB/run - 53.13 GB/s CUMSUM(type=f32,ne=[4096,16,1,1]): 155629 runs - 6.68 us/run - 512 kB/run - 73.13 GB/s CUMSUM(type=f32,ne=[8192,16,1,1]): 81910 runs - 12.68 us/run - 1024 kB/run - 77.00 GB/s CUMSUM(type=f32,ne=[16384,16,1,1]): 40955 runs - 24.56 us/run - 2048 kB/run - 79.53 GB/s CUMSUM(type=f32,ne=[32768,16,1,1]): 24573 runs - 47.52 us/run - 4096 kB/run - 82.21 GB/s CUMSUM(type=f32,ne=[65536,16,1,1]): 12291 runs - 93.44 us/run - 8192 kB/run - 83.62 GB/s CUMSUM(type=f32,ne=[131072,16,1,1]): 16392 runs - 63.36 us/run - 16384 kB/run - 246.68 GB/s CUMSUM(type=f32,ne=[200000,16,1,1]): 16116 runs - 65.25 us/run - 25000 kB/run - 365.53 GB/s CUMSUM(type=f32,ne=[2000000,16,1,1]): 3375 runs - 304.46 us/run - 250000 kB/run - 785.98 GB/s Backend CUDA0: OK Backend 2/3: CUDA1 Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition Device memory: 97250 MB (96677 MB free) CUMSUM(type=f32,ne=[128,128,4,4]): 376786 runs - 2.69 us/run - 2048 kB/run - 727.04 GB/s CUMSUM(type=f32,ne=[2048,16,5,4]): 216282 runs - 4.64 us/run - 5120 kB/run - 1053.30 GB/s CUMSUM(type=f32,ne=[20000,10,4,1]): 32214 runs - 34.21 us/run - 6250 kB/run - 174.27 GB/s CUMSUM(type=f32,ne=[128,1,1,1]): 819100 runs - 1.22 us/run - 1 kB/run - 0.78 GB/s CUMSUM(type=f32,ne=[1024,1,1,1]): 409550 runs - 2.47 us/run - 8 kB/run - 3.09 GB/s CUMSUM(type=f32,ne=[4096,1,1,1]): 303067 runs - 3.31 us/run - 32 kB/run - 9.21 GB/s CUMSUM(type=f32,ne=[8192,1,1,1]): 237539 runs - 4.33 us/run - 64 kB/run - 14.08 GB/s CUMSUM(type=f32,ne=[16384,1,1,1]): 237539 runs - 4.33 us/run - 128 kB/run - 28.17 GB/s CUMSUM(type=f32,ne=[32768,1,1,1]): 188393 runs - 5.37 us/run - 256 kB/run - 45.47 GB/s CUMSUM(type=f32,ne=[65536,1,1,1]): 188393 runs - 5.41 us/run - 512 kB/run - 90.20 GB/s CUMSUM(type=f32,ne=[131072,1,1,1]): 188393 runs - 5.41 us/run - 1024 kB/run - 180.41 GB/s CUMSUM(type=f32,ne=[200000,1,1,1]): 188393 runs - 5.41 us/run - 1562 kB/run - 275.27 GB/s CUMSUM(type=f32,ne=[2000000,1,1,1]): 128880 runs - 7.76 us/run - 15625 kB/run - 1920.33 GB/s CUMSUM(type=f32,ne=[128,4,1,1]): 802718 runs - 1.26 us/run - 4 kB/run - 3.03 GB/s CUMSUM(type=f32,ne=[1024,4,1,1]): 401359 runs - 2.51 us/run - 32 kB/run - 12.18 GB/s CUMSUM(type=f32,ne=[4096,4,1,1]): 139247 runs - 7.51 us/run - 128 kB/run - 16.26 GB/s CUMSUM(type=f32,ne=[8192,4,1,1]): 73719 runs - 14.17 us/run - 256 kB/run - 17.23 GB/s CUMSUM(type=f32,ne=[16384,4,1,1]): 40955 runs - 27.37 us/run - 512 kB/run - 17.84 GB/s CUMSUM(type=f32,ne=[32768,4,1,1]): 40955 runs - 26.33 us/run - 1024 kB/run - 37.10 GB/s CUMSUM(type=f32,ne=[65536,4,1,1]): 40955 runs - 26.19 us/run - 2048 kB/run - 74.59 GB/s CUMSUM(type=f32,ne=[131072,4,1,1]): 40955 runs - 26.35 us/run - 4096 kB/run - 148.26 GB/s CUMSUM(type=f32,ne=[200000,4,1,1]): 42952 runs - 24.18 us/run - 6250 kB/run - 246.51 GB/s CUMSUM(type=f32,ne=[2000000,4,1,1]): 32757 runs - 31.01 us/run - 62500 kB/run - 1923.68 GB/s CUMSUM(type=f32,ne=[128,8,1,1]): 786336 runs - 1.28 us/run - 8 kB/run - 5.95 GB/s CUMSUM(type=f32,ne=[1024,8,1,1]): 393168 runs - 2.57 us/run - 64 kB/run - 23.73 GB/s CUMSUM(type=f32,ne=[4096,8,1,1]): 131056 runs - 7.67 us/run - 256 kB/run - 31.82 GB/s CUMSUM(type=f32,ne=[8192,8,1,1]): 73719 runs - 14.43 us/run - 512 kB/run - 33.84 GB/s CUMSUM(type=f32,ne=[16384,8,1,1]): 40955 runs - 27.90 us/run - 1024 kB/run - 35.01 GB/s CUMSUM(type=f32,ne=[32768,8,1,1]): 24573 runs - 54.63 us/run - 2048 kB/run - 35.75 GB/s CUMSUM(type=f32,ne=[65536,8,1,1]): 16382 runs - 72.24 us/run - 4096 kB/run - 54.08 GB/s CUMSUM(type=f32,ne=[131072,8,1,1]): 20485 runs - 52.66 us/run - 8192 kB/run - 148.37 GB/s CUMSUM(type=f32,ne=[200000,8,1,1]): 21480 runs - 48.00 us/run - 12500 kB/run - 248.42 GB/s CUMSUM(type=f32,ne=[2000000,8,1,1]): 16140 runs - 61.99 us/run - 125000 kB/run - 1926.51 GB/s CUMSUM(type=f32,ne=[128,16,1,1]): 786336 runs - 1.28 us/run - 16 kB/run - 11.90 GB/s CUMSUM(type=f32,ne=[1024,16,1,1]): 393168 runs - 2.57 us/run - 128 kB/run - 47.57 GB/s CUMSUM(type=f32,ne=[4096,16,1,1]): 131056 runs - 7.65 us/run - 512 kB/run - 63.83 GB/s CUMSUM(type=f32,ne=[8192,16,1,1]): 73719 runs - 14.42 us/run - 1024 kB/run - 67.74 GB/s CUMSUM(type=f32,ne=[16384,16,1,1]): 40955 runs - 27.87 us/run - 2048 kB/run - 70.09 GB/s CUMSUM(type=f32,ne=[32768,16,1,1]): 24573 runs - 54.54 us/run - 4096 kB/run - 71.63 GB/s CUMSUM(type=f32,ne=[65536,16,1,1]): 12291 runs - 107.53 us/run - 8192 kB/run - 72.66 GB/s CUMSUM(type=f32,ne=[131072,16,1,1]): 10245 runs - 105.10 us/run - 16384 kB/run - 148.70 GB/s CUMSUM(type=f32,ne=[200000,16,1,1]): 10744 runs - 95.36 us/run - 25000 kB/run - 250.11 GB/s CUMSUM(type=f32,ne=[2000000,16,1,1]): 5400 runs - 186.97 us/run - 250000 kB/run - 1279.90 GB/s
Author
Committer
Parents
Loading