port masked_fill from TH to ATen (#33330)
Summary:
port `masked_fill` from TH to ATen with TensorIterator.
single core performance roughly stays the same, single socket performance has **3~16x** boost.
`masked_fill` is missing from https://github.com/pytorch/pytorch/issues/24507
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33330
Differential Revision: D20098812
Pulled By: VitalyFedyunin
fbshipit-source-id: ff20712ffc00cc665550997abcfdfb8916c18e40