pytorch
d1dcd5f2 - [fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding (#51004)

Commit

3 years ago

[fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding (#51004) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51004 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/493 Follow up on the failure case on FP16 stochastic rounding: - https://github.com/pytorch/pytorch/pull/50148 - D26006041 From Natalia: - https://github.com/pytorch/pytorch/pull/50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Test Plan: CI Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4

Author

jianyuh

Committer

facebook-github-bot

Parents

0e1c5cb3

pytorch d1dcd5f2 - [fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding (#51004)

pytorch
d1dcd5f2 - [fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding (#51004)