pytorch
be3808d3 - Migrate `smooth_l1_loss` from the TH to Aten (CPU & CUDA) (#27962)

Commit

5 years ago

Migrate `smooth_l1_loss` from the TH to Aten (CPU & CUDA) (#27962) Summary: This is a port of the TH `SmoothL1Criterion` to ATen using TensorIterator. The forward implementation has been placed in BinaryOpsKernel.cpp/.cu while the backward version was added to PointwiseOpsKernel.cpp/.cu. CPU performance has improved for both forward & backward path. With CUDA the performance of the forward pass has slightly degraded compared to the TH implementation (see benchmark results). ### Questions: 1. Is the storage location of the implementation ok (I followed https://github.com/pytorch/pytorch/pull/26529) or should we create a separate .cpp/.h file pair for each operator implementation (e.g. to keep things together)? 2. The GPU forward-pass now seems to take consistently longer than the old version. Any ideas what we could try to bring it on par with the old impl? ## WITH patch benchmark result: ``` CPU warmup 1000 took 0.00018124299822375178 CPU warmup 10000 took 0.00021713999740313739 CPU warmup 100000 took 0.0016273759974865243 CPU warmup TOTAL time 0.0020758909959113225 CPU forward 1000 took 6.229899736354128e-05 CPU forward 10000 took 0.00013340599980438128 CPU forward 100000 took 0.0008730469999136403 CPU forward 1000000 took 0.011010036003426649 CPU forward 10000000 took 0.11133221499767387 CPU forward 100000000 took 1.0425375220002024 CPU forward TOTAL time 1.1660894790038583 CPU for- & backward 1000 took 0.0002662249971763231 CPU for- & backward 10000 took 0.00023712700203759596 CPU for- & backward 100000 took 0.002531945996452123 CPU for- & backward 1000000 took 0.010394354998425115 CPU for- & backward 10000000 took 0.23814761800167616 CPU for- & backward 100000000 took 1.2651235049997922 CPU for- & backward TOTAL time 1.516897434994462 GPU warmup 1000 took 0.00020941899856552482 GPU warmup 10000 took 8.128300396492705e-05 GPU warmup 100000 took 8.551499922759831e-05 GPU warmup TOTAL time 0.0004199420000077225 GPU forward 1000 took 7.060499774524942e-05 GPU forward 10000 took 7.116600318113342e-05 GPU forward 100000 took 9.825800225371495e-05 GPU forward 1000000 took 0.000499356996442657 GPU forward 10000000 took 0.002032470001722686 GPU forward 100000000 took 0.018638986002770253 GPU forward TOTAL time 0.02148268099699635 GPU for- & backward 1000 took 0.00035967300209449604 GPU for- & backward 10000 took 0.00032710300001781434 GPU for- & backward 100000 took 0.0003689270015456714 GPU for- & backward 1000000 took 0.0007732619997113943 GPU for- & backward 10000000 took 0.02127284000016516 GPU for- & backward 100000000 took 0.2022330649997457 GPU for- & backward TOTAL time 0.2254496300010942 ``` ## WITHOUT patch benchmark result: ``` CPU warmup 1000 took 0.00011545199959073216 CPU warmup 10000 took 0.00016227000014623627 CPU warmup 100000 took 0.0013456509987008758 CPU warmup TOTAL time 0.001648657998885028 CPU forward 1000 took 2.627600042615086e-05 CPU forward 10000 took 0.00015939700097078457 CPU forward 100000 took 0.001139313004387077 CPU forward 1000000 took 0.013769682998827193 CPU forward 10000000 took 0.13163026500114938 CPU forward 100000000 took 1.321879123999679 CPU forward TOTAL time 1.4687001089987461 CPU for- & backward 1000 took 0.0002569290008977987 CPU for- & backward 10000 took 0.00033315900509478524 CPU for- & backward 100000 took 0.0016096779945655726 CPU for- & backward 1000000 took 0.014474845003860537 CPU for- & backward 10000000 took 0.1564881520025665 CPU for- & backward 100000000 took 1.5787935900007142 CPU for- & backward TOTAL time 1.7521004869995522 GPU warmup 1000 took 0.00025611399905756116 GPU warmup 10000 took 0.00014123699656920508 GPU warmup 100000 took 0.00012580600014189258 GPU warmup TOTAL time 0.0005591579974861816 GPU forward 1000 took 0.00031183200189843774 GPU forward 10000 took 0.00011483799607958645 GPU forward 100000 took 0.00010807999933604151 GPU forward 1000000 took 0.0007842139966669492 GPU forward 10000000 took 0.0017624700049054809 GPU forward 100000000 took 0.01519905700115487 GPU forward TOTAL time 0.018341148999752477 GPU for- & backward 1000 took 0.00047569099842803553 GPU for- & backward 10000 took 0.0003539700046530925 GPU for- & backward 100000 took 0.000808880002296064 GPU for- & backward 1000000 took 0.001639469999645371 GPU for- & backward 10000000 took 0.021154599002329633 GPU for- & backward 100000000 took 0.19268552300491137 GPU for- & backward TOTAL time 0.2172460189976846 ``` ### Code used for perforrmance testing ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.smooth_l1_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() y = None a = None for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) y = fn(a, b) y.cpu() # get result (potentially wait for gpu) if a.grad is not None: a.grad.cpu() end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) return out def fwd_bck(a, b): out = loss_fn(a, b) out.backward() return out def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.randn(16, requires_grad=True, device=device) b = torch.randn(16, device=device) * 2 out = loss_fn(a, b) print('out', out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) if torch.cuda.is_available(): sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() if torch.cuda.is_available(): run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/27962 Differential Revision: D18061942 Pulled By: ezyang fbshipit-source-id: 0d1fc528b59d47d4773b03240c3368db021cb9db

Author

andreaskoepf

Committer

facebook-github-bot

Parents

ee920b92

pytorch be3808d3 - Migrate `smooth_l1_loss` from the TH to Aten (CPU & CUDA) (#27962)

pytorch
be3808d3 - Migrate `smooth_l1_loss` from the TH to Aten (CPU & CUDA) (#27962)