Intrusive_ptr implementation slower than shared_ptr (#30810)
Summary:
It was a random coding exercise so I wasn't putting much effort into it; but, I was like "hey is the current intrusive_ptr implementation optimized enough?" so I compared it with shared_ptr (using std::shared_from_this).
My benchmark result shows that intrusive_ptr is actually slower. On my macbook the speed is:
```
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
BM_IntrusivePtrCtorDtor 14 ns 14 ns 52541902
BM_SharedPtrCtorDtor 10 ns 10 ns 71898849
BM_IntrusivePtrArray 14285 ns 14112 ns 49775
BM_SharedPtrArray 13821 ns 13384 ns 51602
```
Wanted to share the results so someone could probably take a look if interested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30810
Reviewed By: yinghai
Differential Revision: D18828785
Pulled By: bddppq
fbshipit-source-id: 202e9849c9d8a3da17edbe568572a74bb70cb6c5