Add debug option PYTORCH_NVFUSER_DUMP=perf_debug_verbose (#1657)
At the time of running a fusion/segment it prints:
1. Where the segment begins execution, where it ends execution
2. FusionIR associated with the operation
3. All input tensor metadata
4. Compiler log with register usage
5. Scheduler parameters
6. Launch Parameters
7. Bytes processed by fusion
8. Kernel execution time
9. Effective GB/s