PR #1207 BitsandBytes Enablement on ROCm

pnunna931 year ago🎉 1❤ 1🚀 1

Overview

This PR introduces bitsandbytes enablement on ROCm for AMD GPUs. It adds hipified versions of CUDA kernels and ops which allow the flow to route bitsandbytes API function calls to use optimized version of HIP kernels for AMD GPUs.

In the multi-backend-refactor branch, there is a proposal to separate various backends to support multiple GPUs/accelerators. The core of bitsandbytes is built on top of PyTorch and decides the API function call of individual GPUs/accelerators based on the device_type of the tensor as highlighted here. ROCm recognizes cuda device type in PyTorch and runs seamlessly without the need to change anything in the application code. Hence, this PR updates cuda backend in bitsandbytes to enable its functionality on ROCm for AMD GPUs. This PR also adds support for ROCm in the cmake build and enables key functionality of bitsandbytes on AMD GPUs.

Summary of Changes

Updated CUDA backend to work seamlessly on ROCm
Integrated HIP environment into bitsandbytes through hipified versions of CUDA kernels and ops
Cmake build updates for ROCm
Enabled key features in bitsandbytes functional and autograd api

Impact

It enables to build and support bitsandbytes on ROCm for AMD GPUs . Bitsandbytes users can port applications smoothly onto AMD gpus as it requires minimal changes on their front. In addition to this, it also ensures that ROCm changes do not affect CUDA environment, thereby not affecting existing CUDA users.

CC: @Titus-von-Koeller @matthewdouglas @arlo-phoenix

hipify the csrc repo

2e10f678

hipify pythoninterface

19289600

copy from agrocylo

8ca0b5ca

hipify cuparse and cublas calls

8acbcf24

fix compile error and Makefile

e80a60cd

fixed runtime error (low accuracy)

fb780a0a

FIX LOW ACCURACY

1048264b

Update README.md

c3300208

add benchmarks

fcee2d66

Update README.md

4c0ca08a

First draft, getting error

c7986168

Small transform fix, still errors on igemm

37045e51

create HIP_ENVIRONMENT variable

524fa573

Skip failing tests on rocm

d7f7a829

Add default value for HIP_ENVIRONMENT

28b80564

Merge pull request #1 from ROCmSoftwarePlatform/skip_rocm_failing_tests

9dca4fa3

skip failing triton tests on rocm

38c934ed

Merge pull request #2 from ROCmSoftwarePlatform/skip_triton

71bf2df6

Enable col to row transformation

657ca4bf

Add make functions for row to col transformation

a390e0c4

Update get_transform_buffer for row to col in HIP

99ad6b57

Update igemmlt for col format

039b8086

Unskip test_igemmlt_int on ROCm

1a052ee3

Update igemmlt_int test for col inputs

b7ca5cf7

Skip transpose igemmlt test on ROCm

a2cd90d1

Revert "Update igemmlt_int test for col inputs"

5b6c5ac3

Return nvidia_transform from transform for HIP

218bf662

Fix syntax error

8bb5c2f7

Add comment for shape change

eb2edf7e

Enable nvidia_transform tests

a38ea0fd

Merge branch 'fix_igemmlt_int' of https://github.com/pnunna93/bitsand…

fbacd7ac

Enable igemmlt_half tests

67c383bc

Revert col32 check in nvidia_transform test

42b860f3

Merge pull request #3 from pnunna93/fix_igemmlt_int

7198d6bb

Merge remote-tracking branch 'upstream/main' into IFU-master-2024-01-24

b1d484aa

Update README.md

c36085d6

Update hip files with upstream changes

0e91e481

Skip failing tests for now

1295d53c

Merge pull request #4 from ROCm/IFU-master-2024-01-24

48b7fa9a

ops.hip: adapt to enum naming changes in ROCm/hipBLASLt@95131d6 and R…

f1a0b8b3

fix wmma api parity

a84c369a

hipify wmma datatype

b044010a

Enable estimate quantile tests

7aa42bee

Merge pull request #5 from iiisak/rocm_enabled

85377e16

Merge pull request #7 from ROCm/fix_estimate_quantiles

ffb0c5db

Merge pull request #6 from ROCm/rocwmma_merge

2b77380c

Enable transpose flag for row to col transform

fad79188

Update descriptors for transpose flag

e3021ee0

revert nvidia_transform to transform

8c3476f2

update changes

5e1b152d

Merge pull request #8 from ROCm/enable_transform_with_transpose

386e16c2

fixed minor mistakes

389bb7d0

Merge pull request #9 from ROCm/rocm_enabled_fix_bfloat16

b6770bff

remove blocksize 64 on rocm

fa288281

remove block size 64 and enable remaining tests

d86d24cb

Fix cuda build errors

cf4a5066

remove workspace in igemmlt

70771956

Enabled igemmlt in matmul

ec32fc1c

Fix shape issue in transform function

4536b251

Enable igemmlt int8 output

66e34c18

Add col format for extract outliers

7e5e2231

Enable dequant_mm

2e42adb8

Enable matmullt tests

e32d2770

Enabled linear_serialization tests

8206bd18

fix error with dequant_mm change

973a9f8c

Enable extract outliers test

387a9b79

Enable test overflow

93dfb51a

Skip overflow and linear serialization for now

90bbdc60

Merge pull request #10 from ROCm/remove_blocksize_64

9890d5d4

Merge pull request #11 from ROCm/fix_cuda_build_errs

1b6dd482

Merge pull request #12 from ROCm/igemm_workspace

fc9bf4d7

Merge pull request #13 from ROCm/enable_matmul

f30dc38d

improve the gemv 4bit accuracy by forcing the hipcub to 32

3dc14e85

Merge pull request #14 from ROCm/fix_gemv_4bit

f4ac9ac1

Update skip comment

485ba8f8

Merge pull request #15 from ROCm/gemv_skip_comment

a36bd1d2

Merge remote-tracking branch 'upstream/main' into IFU-master-2024-03-28

a551c160

update instructions

a2672217

Merge pull request #19 from ROCm/updated_readme

bcdcc0b4

Update README.md

ff333714

Merge branch 'rocm_enabled' into IFU-master-2024-03-28

1157e734

fix PEP errors

702ca1ae

Fix typos

8c23dc01

Merge branch 'IFU-master-2024-03-28' of https://github.com/ROCm/bitsa…

971f4b1d

Fix formatting in README file

4d6408a6

Update gpu arch setting

79cb5548

Add ROCM_PATH variable

5c0414e2

Add HIP_VERSION variable

47795f55

Add BNB_HIP_VERSION variable

6d904524

Update supports igemmlt based on HIP version

049a2dc5

Skip failing tests based on HIP version

47a0bc3b

pre-commit fixes

1b2a0951

Update README file

4515a218

Update default arch list

e7ef75fc

update readme

c0d244c9

Merge pull request #17 from ROCm/IFU-master-2024-03-28

c037a306

Merge remote-tracking branch 'TD_BnB/multi-backend-refactor' into dev…

73f4f059

update igemmlt for hip

79652a58

Update mm_dequant for hip

aedfa8fa

Update transform function for hip

7835282a

adding arch detection for test_gemv_eye_4bit

60d7560a

implement get_rocm_gpu_arch

cae33c38

fixing lint

da53f39f

fixing lint

ae4dcec5

correct lint error

21d5ff60

Merge pull request #21 from ROCm/rocm_enabled_arch_detect

5bada9ba

Merge branch 'rocm_enabled' into device_abstraction

01abfdeb

update extract_outliers, quantize_4bit, dequantize_4bit

765bfc83

minor fixes for extract_outliers

d00c026a

update blocksizes for quantize and dequantize

e5574bdc

Merge branch 'rocm_enabled' of https://github.com/ROCm/bitsandbytes i…

a00bd1f2

update reg expression for detecting arch

7ab3a054

linter updates

9cd1d8c7

Merge branch 'device_abstraction' into cl/update-device-abs

62f8ed96

Merge pull request #23 from ROCm/cl/update-device-abs

d9e48034

Merge remote-tracking branch 'upstream/multi-backend-refactor' into d…

2af8568d

skip linear no igemmlt test

06f6b251

Remove archive functional file

2359452d

Sync README with upstream

f76d6abc

Remove bnb_accuracy file

576b62cd

Remove cuda_setup

dfb531b7

Remove test_delete_later.c

31b1cbc5

Sync with upstream

ed774769

Sync files with upstream

943c57a2

Fix lint errors

71d17023

Exclude hip files from typo checks

6886bc8f

update ops.hip

0d445f4f

Merge pull request #27 from ROCm/dev_abs_IFU

bc6d0b7a

Add install steps for ROCm

15c7f779

Fix lint error

d62c8358

Merge pull request #28 from ROCm/dev_abs_add_install_steps

8aae7c95

Titus-von-Koeller assigned

Titus-von-Koeller 1 year ago

matthewdouglas commented on 2024-05-14

bitsandbytes/backends/cuda.py

419	444	quant_type: Literal["fp4", "nf4"] = "fp4",
420	445	quant_storage=torch.uint8,
421	446	) -> Tuple[torch.Tensor, QuantState]:
	447	if blocksize is None:
	448	blocksize = 64 if not HIP_ENVIRONMENT else 128

matthewdouglas1 year ago❤ 1

Is there a short explanation we can add here to explain why this is the default, and likewise below why 64 is not supported?

pnunna931 year ago👍 2

Its because of warpsize difference between AMD and NVIDIA GPUs. I have added comments - 410f499

tpimh1 year ago👀 1

Issue #149: can Intel Arc GPUs be supported in a similar manner?

Add comments for HIP changes

410f4998

matthewdouglas1 year ago👍 1

Issue #149: can Intel Arc GPUs be supported in a similar manner?

@tpimh There's separate work in progress for Intel. So far there's been work on CPU with IPEX (#1178, #1206) and separately a SYCL port: #747.

tpimh1 year ago

Thanks! This looks promising.

I will try on both AMD and Intel Arc.

Titus-von-Koeller363 days ago❤ 1

Dear @pnunna93,

thanks to you and your team for the amazing work. We're super excited about this and I'm very happy with what I'm seeing at an initial superficial review.

It would be great to have the AMD runner available relatively soon, otherwise it remains quite messy and work intensive to keep track of the correctness of the various backend implementations. Please let me know what I can do to help and I'll make sure to pull the right strings.

Regarding the review, as communicated in Slack, I have to first focus on wrapping up my deep dive in evaluating tensor-driven dispatch by integration with the PyTorch dispatcher via the torch.library APIs. I don't see any reason to not merge your PR, but I need to take another thorough look and I think it would be helpful for everyone to have clarity on the backend abstraction / dispatch mechanism asap and am therefore prioritizing that; so everyone can then refactor their code to account for that.

In that context, one important question came up:

Our paged optimizers use CUDA unified memory, as described in detail here.

Is that feature available on ROCm devices in one way or another? This would be quite important to understand for my analysis, as the handling of unified memory in relation to PyTorch is one of my last open questions. It's quite a special case, because it's a cornerstone of preventing OOMs in low resource environments -- a key feature for our user group -- and is not implemented/ accounted for in PyTorch and, therefore, we use that feature directly through CUDA related APIs: The underlying CUDA function is cudaMemPrefetchAsync AFAICT.

Thanks 🤗 and congrats on the great work in this PR, we're super excited about this ❤️

Titus-von-Koeller363 days ago❤ 2

Dear @pnunna93 et al,

Unfortunately we're (mostly me alone) quite resource constrained and humbled by the workload associated with the multi-backend-refactor. I just talked with my colleague @younesbelkada about the topic how to best handle the next steps.

We both took a look at this PR and the one from Intel and think that at first glance everything looks really good. At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete.

However, for the purpose of iterative progress (as agreed in our prior conversations), we've decided to already go ahead and merge both the open Intel and AMD branches into multi-backend-refactor, where interested parties can then compile from source and give the new functionality (we're so excited and grateful about this!) a thorough testing.

Once we've made some progress on the torch.library based refactor, I'll next focus on enabling the nightly releases for that branch as well. We're also looking forward to your feedback on the this torch.library / tensor-driven dispatch topic once the code is there on the basis of which to discuss (and refactor the backend specific code towards that new target, after we agreed with all of you that this is the right path).

Among other things, there's also been extensive ongoing work in the background on things like moving BNB to a new independent/non-profit Github org, but under the umbrella of Hugging Face and the support of their infra team for managing the complexities of the CI/CD backend and runners. Also, we're working to make Github runners for the different hardware platforms a reality (thanks for your help on that!).

Thanks again for the good work and active collaboration! ❤️ 🚀

Titus-von-Koeller merged eb3b816e into multi-backend-refactor 363 days ago

Titus-von-Koeller363 days ago (edited 363 days ago)

P.S. Also see this: README: asking for help from volunteer alpha testers

Let us know if you have further thoughts on this and how you think it's best to communicate about this.

bitsandbytes
BitsandBytes Enablement on ROCm
#1207

Merged

BitsandBytes Enablement on ROCm #1207

Overview

Summary of Changes

Impact

bitsandbytes BitsandBytes Enablement on ROCm #1207 Merged

BitsandBytes Enablement on ROCm #1207

Overview

Summary of Changes

Impact

bitsandbytes
BitsandBytes Enablement on ROCm
#1207

Merged