Port of multilabel_margin_loss from TH to ATen (CPU) (#28205)
Summary:
This is a port of the CPU version of TH MultiLabelMarginCriterion to ATen.
Benchmark results ([source of script used](https://gist.github.com/andreaskoepf/ce96eedb09e9480ae2263d31822ef26e)):
Slightly slower forward (probably acceptable), slightly faster forward & backward combination.
### WITH patch:
```
CPU forward 1000 took 0.0002544010058045387
CPU forward 10000 took 0.0022866200015414506
CPU forward 100000 took 0.02240650000749156
CPU forward 1000000 took 0.22985397902084514
CPU forward 10000000 took 2.227811124001164
CPU forward TOTAL time 4.282580643019173
CPU for- & backward 1000 took 0.0006969539972487837
CPU for- & backward 10000 took 0.004804529016837478
CPU for- & backward 100000 took 0.07736711099278182
CPU for- & backward 1000000 took 0.5985556179948617
CPU for- & backward 10000000 took 4.761040163983125
CPU for- & backward TOTAL time 7.318476865999401
```
### WITHOUT patch:
```
CPU forward 1000 took 0.00026982801500707865
CPU forward 10000 took 0.002569925010902807
CPU forward 100000 took 0.024335263995453715
CPU forward 1000000 took 0.2151200629887171
CPU forward 10000000 took 2.114590842014877
CPU forward TOTAL time 4.184845258976566
CPU for- & backward 1000 took 0.0007158009975682944
CPU for- & backward 10000 took 0.005468863993883133
CPU for- & backward 100000 took 0.05931608600076288
CPU for- & backward 1000000 took 0.5732014369859826
CPU for- & backward 10000000 took 5.2500802429858595
CPU for- & backward TOTAL time 7.7646528169861995
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28205
Differential Revision: D18001407
Pulled By: ezyang
fbshipit-source-id: 68cbd9ce0aacf99dd8c44fb4da9c09b3ffc1e59a