pytorch
8bce88d9 - [caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382)

Commit View On GitHub

Commit

1 year ago

[caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382) Summary: My team has been hitting a mysterious crash for a few months on a windows binary that uses Caffe2 inside a worker thread. When this thread gets destroyed, there is an error at this line in context_gpu.h where the state of this operation gives CUDNN_STATUS_INTERNAL_ERROR instead of CUDNN_STATUS_SUCCESS. When enabling cudnn debug logs (via the env variables nvidia specifies), I can see that the context is destroyed twice, even though this code only destroys it once, so something mysterious is causing a double free. This seems very very similar to the issue/fix described here for pytorch: https://github.com/pytorch/pytorch/issues/17658 https://github.com/apache/tvm/pull/8267 And pytorch handles this in the same way, by just not calling cudnnDestroy This seems to have become an issue with cuda11, but I tested cuda12 as well and found that the issue persists so this needs to be somehow fixed. Test Plan: CI I checked that the specific windows binary I am using is able to create and drestroy caffe2-invoking threads without causing the application to crash. buck run arvr/mode/win/cuda11/opt //arvr/projects/nimble/prod/tools/MonoHandTrackingVis Differential Revision: D43538017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95382 Approved by: https://github.com/malfet

Author

fuzic

Committer

pytorchmergebot

Parents

76cac709

pytorch 8bce88d9 - [caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382)

Commit

pytorch
8bce88d9 - [caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382)