pytorch
54ffb05e - better error message between C2 and glow (#41603)

Commit View On GitHub

Commit

4 years ago

better error message between C2 and glow (#41603) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41603 Pull Request resolved: https://github.com/pytorch/glow/pull/4704 Previously in the glow onnxifi path, when an error is encountered, we log it to stderr then just return ONNXIFI_STATUS_INTERNAL_ERROR to C2. C2 then does CAFFE2_ENFORCE_EQUAL(return_code, ONNXIFI_STATUS_SUCCESS). The error message that eventually went to the user is something like [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0 This diff adds plumbing to get human readable error message out of glow into C2. Test Plan: Run ads replayer. Overload it with traffic. Now the error message sent back to the client used to be E0707 00:57:45.697196 3709559 Caffe2DisaggAcceleratorTask.cpp:493] During running REMOTE_OTHER net: [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0 (Error from operator:.... Now it's ``` E0707 16:46:48.366263 1532943 Client.cpp:966] Exception when calling caffe2_run_disagg_accelerator on remote predictor for model 190081310_0 : apache::thrift::TApplicationException: c10::Error: [enforce fail at onnxifi_op.cc:556] . Error code: RUNTIME_REQUEST_REFUSED Error message: The number of allowed queued requests has been exceeded. queued requests: 100 allowed requests: 100 Error return stack: glow/glow/lib/Runtime/HostManager/HostManager.cpp:673 glow/glow/lib/Onnxifi/HostMana (Error from operator:... ``` Reviewed By: gcatron, yinghai Differential Revision: D22416857 fbshipit-source-id: 564bc7644d9666eb660725c2dca5637affae9b73

Author

tracelogfb

Committer

facebook-github-bot

Parents

aa4e91a6

pytorch 54ffb05e - better error message between C2 and glow (#41603)

Commit

pytorch
54ffb05e - better error message between C2 and glow (#41603)