Migrate nll_loss from TH to ATen (CPU) (#28270)
Summary:
This is a port of the negative log likelihood TH loss implementation to ATen which is used by `torch.nn.functional.nll_loss()` for 2d inputs (N, C).
## Performance Impact
I measured no significant performance-difference of the port compared to the original implementation when using this [benchmark test script](https://gist.github.com/andreaskoepf/3c8e3698607773db2788dfd8885a9ed9).
### WITH PR applied:
```
CPU forward 1000 took 2.5290995836257935e-05
CPU forward 10000 took 5.757302278652787e-05
CPU forward 100000 took 0.0004873779835179448
CPU forward 1000000 took 0.0051894880016334355
CPU forward 10000000 took 0.026263039995683357
CPU forward TOTAL time 0.8068871730065439
CPU for- & backward 1000 took 0.00018794499919749796
CPU for- & backward 10000 took 0.0002642899926286191
CPU for- & backward 100000 took 0.0011828370043076575
CPU for- & backward 1000000 took 0.01250307000009343
CPU for- & backward 10000000 took 0.11453165800776333
CPU for- & backward TOTAL time 0.824805997981457
```
### Original TH version:
```
CPU forward 1000 took 2.1958985598757863e-05
CPU forward 10000 took 6.608400144614279e-05
CPU forward 100000 took 0.0004632119962479919
CPU forward 1000000 took 0.005477247992530465
CPU forward 10000000 took 0.02681165697867982
CPU forward TOTAL time 0.8073387439944781
CPU for- & backward 1000 took 0.00020634100656025112
CPU for- & backward 10000 took 0.00031720998231321573
CPU for- & backward 100000 took 0.0011843869870062917
CPU for- & backward 1000000 took 0.010876987013034523
CPU for- & backward 10000000 took 0.09893897600704804
CPU for- & backward TOTAL time 0.8271351839939598
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28270
Differential Revision: D18009584
Pulled By: ezyang
fbshipit-source-id: 77daf47c61a9dd9bb3b5a8d3e48585bbb665e979