bitsandbytes
BitsandBytes Enablement on ROCm
#1207
Merged

BitsandBytes Enablement on ROCm #1207

pnunna93
pnunna931 year agoπŸŽ‰ 1❀ 1πŸš€ 1

Overview

This PR introduces bitsandbytes enablement on ROCm for AMD GPUs. It adds hipified versions of CUDA kernels and ops which allow the flow to route bitsandbytes API function calls to use optimized version of HIP kernels for AMD GPUs.

In the multi-backend-refactor branch, there is a proposal to separate various backends to support multiple GPUs/accelerators. The core of bitsandbytes is built on top of PyTorch and decides the API function call of individual GPUs/accelerators based on the device_type of the tensor as highlighted here. ROCm recognizes cuda device type in PyTorch and runs seamlessly without the need to change anything in the application code. Hence, this PR updates cuda backend in bitsandbytes to enable its functionality on ROCm for AMD GPUs. This PR also adds support for ROCm in the cmake build and enables key functionality of bitsandbytes on AMD GPUs.

Summary of Changes

  • Updated CUDA backend to work seamlessly on ROCm

  • Integrated HIP environment into bitsandbytes through hipified versions of CUDA kernels and ops

  • Cmake build updates for ROCm

  • Enabled key features in bitsandbytes functional and autograd api

Impact

It enables to build and support bitsandbytes on ROCm for AMD GPUs . Bitsandbytes users can port applications smoothly onto AMD gpus as it requires minimal changes on their front. In addition to this, it also ensures that ROCm changes do not affect CUDA environment, thereby not affecting existing CUDA users.

CC: @Titus-von-Koeller @matthewdouglas @arlo-phoenix

Lzy17 hipify the csrc repo
2e10f678
Lzy17 hipify pythoninterface
19289600
Lzy17 copy from agrocylo
8ca0b5ca
Lzy17 hipify cuparse and cublas calls
8acbcf24
Lzy17 fix compile error and Makefile
e80a60cd
Lzy17 fixed runtime error (low accuracy)
fb780a0a
Lzy17 FIX LOW ACCURACY
1048264b
Lzy17 Update README.md
c3300208
Lzy17 add benchmarks
fcee2d66
Lzy17 Update README.md
4c0ca08a
jpvillam-amd First draft, getting error
c7986168
jpvillam-amd Small transform fix, still errors on igemm
37045e51
pnunna93 create HIP_ENVIRONMENT variable
524fa573
pnunna93 Skip failing tests on rocm
d7f7a829
pnunna93 Add default value for HIP_ENVIRONMENT
28b80564
amathews-amd Merge pull request #1 from ROCmSoftwarePlatform/skip_rocm_failing_tests
9dca4fa3
pnunna93 skip failing triton tests on rocm
38c934ed
amathews-amd Merge pull request #2 from ROCmSoftwarePlatform/skip_triton
71bf2df6
pnunna93 Enable col to row transformation
657ca4bf
pnunna93 Add make functions for row to col transformation
a390e0c4
pnunna93 Update get_transform_buffer for row to col in HIP
99ad6b57
pnunna93 Update igemmlt for col format
039b8086
pnunna93 Unskip test_igemmlt_int on ROCm
1a052ee3
pnunna93 Update igemmlt_int test for col inputs
b7ca5cf7
pnunna93 Skip transpose igemmlt test on ROCm
a2cd90d1
pnunna93 Revert "Update igemmlt_int test for col inputs"
5b6c5ac3
pnunna93 Return nvidia_transform from transform for HIP
218bf662
pnunna93 Fix syntax error
8bb5c2f7
pnunna93 Add comment for shape change
eb2edf7e
pnunna93 Enable nvidia_transform tests
a38ea0fd
pnunna93 Merge branch 'fix_igemmlt_int' of https://github.com/pnunna93/bitsand…
fbacd7ac
pnunna93 Enable igemmlt_half tests
67c383bc
pnunna93 Revert col32 check in nvidia_transform test
42b860f3
amathews-amd Merge pull request #3 from pnunna93/fix_igemmlt_int
7198d6bb
pnunna93 Merge remote-tracking branch 'upstream/main' into IFU-master-2024-01-24
b1d484aa
Lzy17 Update README.md
c36085d6
pnunna93 Update hip files with upstream changes
0e91e481
pnunna93 Skip failing tests for now
1295d53c
amathews-amd Merge pull request #4 from ROCm/IFU-master-2024-01-24
48b7fa9a
iiisak ops.hip: adapt to enum naming changes in ROCm/hipBLASLt@95131d6 and R…
f1a0b8b3
Lzy17 fix wmma api parity
a84c369a
Lzy17 hipify wmma datatype
b044010a
pnunna93 Enable estimate quantile tests
7aa42bee
pnunna93 Merge pull request #5 from iiisak/rocm_enabled
85377e16
amathews-amd Merge pull request #7 from ROCm/fix_estimate_quantiles
ffb0c5db
Lzy17 Merge pull request #6 from ROCm/rocwmma_merge
2b77380c
pnunna93 Enable transpose flag for row to col transform
fad79188
pnunna93 Update descriptors for transpose flag
e3021ee0
pnunna93 revert nvidia_transform to transform
8c3476f2
update changes
5e1b152d
pnunna93 Merge pull request #8 from ROCm/enable_transform_with_transpose
386e16c2
fixed minor mistakes
389bb7d0
pnunna93 Merge pull request #9 from ROCm/rocm_enabled_fix_bfloat16
b6770bff
pnunna93 remove blocksize 64 on rocm
fa288281
pnunna93 remove block size 64 and enable remaining tests
d86d24cb
pnunna93 Fix cuda build errors
cf4a5066
pnunna93 remove workspace in igemmlt
70771956
pnunna93 Enabled igemmlt in matmul
ec32fc1c
pnunna93 Fix shape issue in transform function
4536b251
pnunna93 Enable igemmlt int8 output
66e34c18
pnunna93 Add col format for extract outliers
7e5e2231
pnunna93 Enable dequant_mm
2e42adb8
pnunna93 Enable matmullt tests
e32d2770
pnunna93 Enabled linear_serialization tests
8206bd18
pnunna93 fix error with dequant_mm change
973a9f8c
pnunna93 Enable extract outliers test
387a9b79
pnunna93 Enable test overflow
93dfb51a
pnunna93 Skip overflow and linear serialization for now
90bbdc60
pnunna93 Merge pull request #10 from ROCm/remove_blocksize_64
9890d5d4
pnunna93 Merge pull request #11 from ROCm/fix_cuda_build_errs
1b6dd482
pnunna93 Merge pull request #12 from ROCm/igemm_workspace
fc9bf4d7
pnunna93 Merge pull request #13 from ROCm/enable_matmul
f30dc38d
improve the gemv 4bit accuracy by forcing the hipcub to 32
3dc14e85
Lzy17 Merge pull request #14 from ROCm/fix_gemv_4bit
f4ac9ac1
pnunna93 Update skip comment
485ba8f8
pnunna93 Merge pull request #15 from ROCm/gemv_skip_comment
a36bd1d2
pnunna93 Merge remote-tracking branch 'upstream/main' into IFU-master-2024-03-28
a551c160
update instructions
a2672217
amathews-amd Merge pull request #19 from ROCm/updated_readme
bcdcc0b4
pnunna93 Update README.md
ff333714
pnunna93 Merge branch 'rocm_enabled' into IFU-master-2024-03-28
1157e734
pnunna93 fix PEP errors
702ca1ae
pnunna93 Fix typos
8c23dc01
pnunna93 Merge branch 'IFU-master-2024-03-28' of https://github.com/ROCm/bitsa…
971f4b1d
pnunna93 Fix formatting in README file
4d6408a6
pnunna93 Update gpu arch setting
79cb5548
pnunna93 Add ROCM_PATH variable
5c0414e2
pnunna93 Add HIP_VERSION variable
47795f55
pnunna93 Add BNB_HIP_VERSION variable
6d904524
pnunna93 Update supports igemmlt based on HIP version
049a2dc5
pnunna93 Skip failing tests based on HIP version
47a0bc3b
pnunna93 pre-commit fixes
1b2a0951
pnunna93 Update README file
4515a218
pnunna93 Update default arch list
e7ef75fc
pnunna93 update readme
c0d244c9
lcskrishna Merge pull request #17 from ROCm/IFU-master-2024-03-28
c037a306
pnunna93 Merge remote-tracking branch 'TD_BnB/multi-backend-refactor' into dev…
73f4f059
pnunna93 update igemmlt for hip
79652a58
pnunna93 Update mm_dequant for hip
aedfa8fa
pnunna93 Update transform function for hip
7835282a
adding arch detection for test_gemv_eye_4bit
60d7560a
implement get_rocm_gpu_arch
cae33c38
fixing lint
da53f39f
fixing lint
ae4dcec5
correct lint error
21d5ff60
pnunna93 Merge pull request #21 from ROCm/rocm_enabled_arch_detect
5bada9ba
pnunna93 Merge branch 'rocm_enabled' into device_abstraction
01abfdeb
lcskrishna update extract_outliers, quantize_4bit, dequantize_4bit
765bfc83
lcskrishna minor fixes for extract_outliers
d00c026a
lcskrishna update blocksizes for quantize and dequantize
e5574bdc
Merge branch 'rocm_enabled' of https://github.com/ROCm/bitsandbytes i…
a00bd1f2
lcskrishna update reg expression for detecting arch
7ab3a054
lcskrishna linter updates
9cd1d8c7
lcskrishna Merge branch 'device_abstraction' into cl/update-device-abs
62f8ed96
pnunna93 Merge pull request #23 from ROCm/cl/update-device-abs
d9e48034
pnunna93 Merge remote-tracking branch 'upstream/multi-backend-refactor' into d…
2af8568d
pnunna93 skip linear no igemmlt test
06f6b251
pnunna93 Remove archive functional file
2359452d
pnunna93 Sync README with upstream
f76d6abc
pnunna93 Remove bnb_accuracy file
576b62cd
pnunna93 Remove cuda_setup
dfb531b7
pnunna93 Remove test_delete_later.c
31b1cbc5
pnunna93 Sync with upstream
ed774769
pnunna93 Sync files with upstream
943c57a2
pnunna93 Fix lint errors
71d17023
pnunna93 Exclude hip files from typo checks
6886bc8f
pnunna93 update ops.hip
0d445f4f
lcskrishna Merge pull request #27 from ROCm/dev_abs_IFU
bc6d0b7a
pnunna93 Add install steps for ROCm
15c7f779
pnunna93 Fix lint error
d62c8358
lcskrishna Merge pull request #28 from ROCm/dev_abs_add_install_steps
8aae7c95
Titus-von-Koeller Titus-von-Koeller assigned Titus-von-Koeller Titus-von-Koeller 1 year ago
matthewdouglas
matthewdouglas commented on 2024-05-14
bitsandbytes/backends/cuda.py
419444 quant_type: Literal["fp4", "nf4"] = "fp4",
420445 quant_storage=torch.uint8,
421446 ) -> Tuple[torch.Tensor, QuantState]:
447
if blocksize is None:
448
blocksize = 64 if not HIP_ENVIRONMENT else 128
matthewdouglas1 year ago❀ 1

Is there a short explanation we can add here to explain why this is the default, and likewise below why 64 is not supported?

pnunna931 year agoπŸ‘ 2

Its because of warpsize difference between AMD and NVIDIA GPUs. I have added comments - 410f499

tpimh
tpimh1 year agoπŸ‘€ 1

Issue #149: can Intel Arc GPUs be supported in a similar manner?

pnunna93 Add comments for HIP changes
410f4998
matthewdouglas
matthewdouglas1 year agoπŸ‘ 1

Issue #149: can Intel Arc GPUs be supported in a similar manner?

@tpimh There's separate work in progress for Intel. So far there's been work on CPU with IPEX (#1178, #1206) and separately a SYCL port: #747.

tpimh
tpimh1 year ago

Thanks! This looks promising.

I will try on both AMD and Intel Arc.

Titus-von-Koeller
Titus-von-Koeller363 days ago❀ 1

Dear @pnunna93,

thanks to you and your team for the amazing work. We're super excited about this and I'm very happy with what I'm seeing at an initial superficial review.

It would be great to have the AMD runner available relatively soon, otherwise it remains quite messy and work intensive to keep track of the correctness of the various backend implementations. Please let me know what I can do to help and I'll make sure to pull the right strings.

Regarding the review, as communicated in Slack, I have to first focus on wrapping up my deep dive in evaluating tensor-driven dispatch by integration with the PyTorch dispatcher via the torch.library APIs. I don't see any reason to not merge your PR, but I need to take another thorough look and I think it would be helpful for everyone to have clarity on the backend abstraction / dispatch mechanism asap and am therefore prioritizing that; so everyone can then refactor their code to account for that.

In that context, one important question came up:

Our paged optimizers use CUDA unified memory, as described in detail here.

Is that feature available on ROCm devices in one way or another? This would be quite important to understand for my analysis, as the handling of unified memory in relation to PyTorch is one of my last open questions. It's quite a special case, because it's a cornerstone of preventing OOMs in low resource environments -- a key feature for our user group -- and is not implemented/ accounted for in PyTorch and, therefore, we use that feature directly through CUDA related APIs: The underlying CUDA function is cudaMemPrefetchAsync AFAICT.

Thanks πŸ€— and congrats on the great work in this PR, we're super excited about this ❀️

Titus-von-Koeller
Titus-von-Koeller363 days ago❀ 2

Dear @pnunna93 et al,

Unfortunately we're (mostly me alone) quite resource constrained and humbled by the workload associated with the multi-backend-refactor. I just talked with my colleague @younesbelkada about the topic how to best handle the next steps.

We both took a look at this PR and the one from Intel and think that at first glance everything looks really good. At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete.

However, for the purpose of iterative progress (as agreed in our prior conversations), we've decided to already go ahead and merge both the open Intel and AMD branches into multi-backend-refactor, where interested parties can then compile from source and give the new functionality (we're so excited and grateful about this!) a thorough testing.

Once we've made some progress on the torch.library based refactor, I'll next focus on enabling the nightly releases for that branch as well. We're also looking forward to your feedback on the this torch.library / tensor-driven dispatch topic once the code is there on the basis of which to discuss (and refactor the backend specific code towards that new target, after we agreed with all of you that this is the right path).

Among other things, there's also been extensive ongoing work in the background on things like moving BNB to a new independent/non-profit Github org, but under the umbrella of Hugging Face and the support of their infra team for managing the complexities of the CI/CD backend and runners. Also, we're working to make Github runners for the different hardware platforms a reality (thanks for your help on that!).

Thanks again for the good work and active collaboration! ❀️ πŸš€

Titus-von-Koeller Titus-von-Koeller merged eb3b816e into multi-backend-refactor 363 days ago
Titus-von-Koeller
Titus-von-Koeller363 days ago (edited 363 days ago)

P.S. Also see this: README: asking for help from volunteer alpha testers

Let us know if you have further thoughts on this and how you think it's best to communicate about this.

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
Labels
Milestone