PR #6864 Add required libcuda.so

sdake298 days ago (edited 298 days ago)

It is necessary to add libcuda.so so that vllm will run against rebuilds of pytorch. I built my own version of pytorch (v2.3.1) to workaround an issue with GPU compute versions to run Neural Magic's dynamic quantization model. The reason for my recompile, although not super relevant, was the related to BLAS in PyTorch failing with and A40 (compute 8.6) and only expecting compute>8.9, which is unrelated to this issue.

I then tried something less exoticfacebookresearch/opt125m, and a problem was exposed showing that cuTensorMapEncodeTiled could not be loaded as part of import vllm._C. The problem resulted in failure of the llvm python process. The cuTensorMapEncodedTiled symbol is provided by cutlass and used by pytorch. the
_C.abi3.so file shows the symbol is undefined (nm -g /home/sdake/v-llm/lib/python3.11/site-packages/vllm/_C.abi3.so | grep cuTensorMapEncodeTiled shows U meaning the symbol isn't available via the system's dynamic loader and this is caused by the symbol not being linked into the dymamic library. I tried patchelf to add the dynamic library to _C.abi3.so which works.

I could use the patchelf workaround, however, a more pressing problem is that when cloning pytorch v2.3.1, the symbol is consumed. I am not clear why this doesn't show up in broader community testing.

Running this build script with this Dockerfile will reproduce the problem nicely. After building, a pytorch wheel is output to ${CWD}/target/torch-2.3.1-cp311-cp311-linux_x86_64.whl

Add required libcuda.so

9efb1d63

github-actions298 days ago

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

sdake298 days ago

/ready

github-actions added ready

cliffwoolley226 days ago

This looks correct to us on the NVIDIA side. We're testing it locally and @kushanam will report back. Thanks @sdake !

cliffwoolley226 days ago

LGTM

sdake224 days ago

fwiw I am now uncertain my conclusion on the is PR was formed correctly. This shared object should only be needed if the cuda driver library is in use. When the cuda runtime library is in use, the cuda library dependency comes in another way. The reason for my doubt is that I didn’t clearly understand what was dependent upon the cuda driver library. Ideally we would always use the cuda runtime library and the cuda driver library would not be part of the link. Cliff, does this match your understanding?

…

On Mon, Oct 7, 2024 at 9:47 PM Cliff Woolley ***@***.***> wrote: LGTM — Reply to this email directly, view it on GitHub <#6864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFYRCIA3C2I7LDWCOFT62DZ2NPW5AVCNFSM6AAAAABLR444ACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJYHAZDMNJSGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kushanam224 days ago

Some of the vLLM abi for custom ops require functionality from the CUDA driver rather than the runtime. e.g: cuTensorMapEncodeTiled. The library seems to get loaded when built with default packages (possibly by one of the older versions of the dependencies), but upgrading them would result in custom ops not getting built correctly.

cliffwoolley224 days ago

There are two driver API functions (not runtime API) in use, and that definitely requires linking directly to libcuda or else dlopen/dlsym of the same so that the symbols can be resolved. This dependency is coming in from CUTLASS, it looked like to me.

DarkLight1337 requested a review from

youkaichao 203 days ago

DarkLight1337203 days ago

@youkaichao is this still relevant?

mergify added ci/build

mergify203 days ago

This pull request has merge conflicts that must be resolved before it can be
merged. @sdake please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify added needs-rebase

cliffwoolley203 days ago

Yes this is still relevant. It's needed because of https://github.com/vllm-project/vllm/blob/main/CMakeLists.txt#L399 .

cliffwoolley approved these changes on 2024-11-01

youkaichao201 days ago

we usually use the libcuda.so brought by pytorch. if you build pytorch from scratch, chances are pytorch statically link against libcuda.so , and you have to make it available in the linker's path.

I don't think this is needed for the general public.

cliffwoolley201 days ago

There is no static linking to libcuda.so (the CUDA usermode driver). Anyway this is about linkage correctness, not about packaging. The PR is correct.

…

On Fri, Nov 1, 2024, 6:43 PM youkaichao ***@***.***> wrote: we usually use the libcuda.so brought by pytorch. if you build pytorch from scratch, chances are pytorch statically link against libcuda.so , and you have to make it available in the linker's path. I don't think this is needed for the general public. — Reply to this email directly, view it on GitHub <#6864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJ33XQJ3S6OJDWW2JAIQ33Z6QU47AVCNFSM6AAAAABLR444ACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSG44TAOBTGU> . You are receiving this because you commented.Message ID: ***@***.***>

simon-mo requested a review from

tlrmchlsmth 177 days ago

cliffwoolley118 days ago

This is still needed -- can we go ahead and get it merged please?

tlrmchlsmth117 days ago

It's needed because of https://github.com/vllm-project/vllm/blob/main/CMakeLists.txt#L399 .

@cliffwoolley could you say a bit more about why it's needed for this?

kushanam117 days ago👍 2

It's needed because of https://github.com/vllm-project/vllm/blob/main/CMakeLists.txt#L399 .

@cliffwoolley could you say a bit more about why it's needed for this?

Hi Tyler, the reason this doesn't throw an error by default is that PyTorch loads libcuda.so, and it happens to be loaded before vLLM makes any calls to CUDA. However, if that order is altered for any reason, lack of it leads to issues.

cliffwoolley117 days ago (edited 117 days ago)👍 1

Because

It's needed because of https://github.com/vllm-project/vllm/blob/main/CMakeLists.txt#L399 .

@cliffwoolley could you say a bit more about why it's needed for this?

Where we are using the CUDA driver API -- which the linked line instructs CUTLASS to do -- then we have to have some form of linkage to the CUDA driver library, or else we can get an undefined symbol error like in the OP at runtime. This can be addressed in one of two ways: one is to directly link to the CUDA driver, which is what this PR accomplishes. The other way is via dlopen/dlsym -- or the corresponding convenience wrapper from the CUDA Runtime API that is cudaGetDriverEntryPoint. The latter might be preferable for executables that still have to load without CUDA (perhaps CPU-only, or on some other device type), though it would require some care to ensure that the symbol loaded is given the same name and signature as the CUDA driver would provide while being local/private to to vLLM (so as not to conflict with the actual CUDA driver library, which may also be loaded).

We get away without this only when something else in the address space -- e.g. libtorch -- has already loaded libcuda.so into the process and the dynamic loader can resolve our symbol even though there wasn't actual linkage to it. But it's possible in some cases to call into parts of vLLM that don't happen fo first call through PyTorch to initialize the GPU, and then the problem appears.

tlrmchlsmth117 days ago

@kushanam and @cliffwoolley that is pretty convincing, thanks for the thorough explanation. Any reason not to do the samefor the other CUDA .so files as well (_moe_C.abi3.so and cumem_allocator.abi3.so)?

cliffwoolley117 days ago

It's best to limit this to places where it's strictly needed. You can find that out by looking at nm -D on whichever library and looking for U symbols whose names start like cuSomething (rather than cudaSomething, which are from the runtime API, aka libcudart).

tlrmchlsmth approved these changes on 2025-01-24

mergify removed needs-rebase

tlrmchlsmth closed this 117 days ago

vllm
Add required libcuda.so
#6864

Closed

Add required libcuda.so #6864

vllm Add required libcuda.so #6864 Closed

Add required libcuda.so #6864

vllm
Add required libcuda.so
#6864

Closed