Improve flops profiler functionality (#1065)
* use the original function's name as the key to old_functions dict
* update profile output format
* print at global rank 0
* add flops calculation in bwd pass using time from ds timers
* improve aggregated profiling out to show all depth
* print samples/second
* update readme and examples
* update docs
* fix typo and reorder printing
* fix format