[JS/WebGPU] Improve MatMulNBits perf #19974
satyajandhyala
marked this pull request as ready for review 2 years ago
satyajandhyala
changed the title [WIP][JS/WebGPU] Improve MatMulNBits perf [JS/WebGPU] Improve MatMulNBits perf 2 years ago
Improve perf
9ee1edfd
Fix lint error.
fdbe3e34
Format
f2cf7345
Changes to make any combinations of components to work.
cb7256c6
Perform blockwise matmul
8e196cf0
format
8a87b0ca
Fixed some errors.
d4896808
Added workgroupSize and dispatchGroup.
09c9acac
Use bit operations instead of multiplications and divisions
57583401
Added maxComputeWorkgroupSizes function to get retrieve workgroup siz…
68ae511b
Added batch dim
688cc795
Added batch support
8ac464cc
Removed separate reduce step.
20863096
minor fix
cfd49ccb
WIP: adding components.
42f3ebbc
Format
9fe360ea
Added outputNumber back.
06285942
Only the leading shader in the workgroup needs to write outut.
9cbd993c
Prefetch necessary input tensor data
1c5b7a18
Unroll innermost loops to reduce loop overhead
a4ade113
Removed functional call overhead.
76926d02
Added getMaxWorkgroupStorageSize
1cc10115
Compute workgroupSizeX as multiple of nBlocksPerCol
14243243
Removed unused uniforms.
7842a41a
Removed outputNumber
ec871fc8
Removed block_size variable
18cac07d
Choose components based on memory availability and produced fatal error
681b9938
Reroll the last loop nest
5b8bbb4b
Added fallback option to blockwise matmulnbits
2a702ef6
Removed unused variable.
1dc620ae
typo
19cd478a
Temporary commmit
1f990080
Code optimization and clean up.
7ac51249
Modified getMaxComponents to accept arbitrary number of arguments.
81868ef4
Added rectangular output testcases.
ba93c4bf
Prefer using BlockwiseMatMulNBits.
8a7bc25b
Removed workgroup shared memory initialization to 0.
6bf16612
Performace tuning
6fd81d6c
Removed pre-fetching input data.
4da4a8e2
Re-roll the for loops.
f4de76ab
Prefer additions over multiplications.
df3688ea
Fixed hint for the fallback
8c78dae5
Use unpack4xU8
e3c858e3
Load 8 element of input at a time
1dd7a882
Fixed zero_point offset calculation.
4e9fd96a
Use near multiple of 4 when calculating components.
bd9fc91f
Deal with odd numbers.
811ce128
Renamed variable row and col instead of m and n
42e43223
Added processOneBlock to refactor code.
95ded112
Added bBlocksPerCol and blobSize to attributes to avoid recalculating.
ce73fc37
Added missing semicolon
63d13244
Simplified component calculation
5d37de2d
Cleaned-up uniforms
515091fc
Removed backup file added by mistake
087afc92
minor change
56a14290
Revert "Added bBlocksPerCol and blobSize to attributes to avoid recal…
546c26ed
Reverted changes to getMaxComponents.
a59d736d
Format
38e501ee
guschmue
approved these changes
on 2024-04-12
satyajandhyala
deleted the sajandhy/webgpu_matmulnbits_perf branch 1 year ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub