[fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67663
mostly follow the example of quantized::linear (D28428734 (https://github.com/pytorch/pytorch/commit/4d7abdbdadd440cb4b8412f1e309cae14a687b49)) to enable out-variant for quantized::linear_dynamic_fp16.
Reason being from MP tab ctr pytorch model migration, we observe quantized::linear_dynamic_fp16 operator has highest cost but not enable out-variant yet https://fburl.com/phabricator/b5juus2d
Test Plan:
buck build mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench
sudo watch -n 20 /usr/local/fbprojects/dynamoserver/bin/turboDriver disable
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/home/bwen/models/991103061_4/991103061_4.predictor --pt_inputs=/home/bwen/models/991103061_4/pt_inputs --method_name=forward --pt_cleanup_activations=1 --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=1000 --num_threads=1 --repetitions=3 --do_profile=1 --do_benchmark=1 --set_compatibility=1 --compare_results=1 --pt_enable_static_runtime 2>&1 | pastry
before: P465201159
0.929067 ms. 31.808%. quantized::linear_dynamic_fp16 (16 nodes)
0.921679 ms. 31.7324%. quantized::linear_dynamic_fp16 (16 nodes)
0.919127 ms. 31.7404%. quantized::linear_dynamic_fp16 (16 nodes)
after: P465203015
0.90898 ms. 31.0205%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
0.9127 ms. 30.62%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
0.879148 ms. 31.0161%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
unit test logic refers https://fburl.com/code/vv0rry13
buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: hlu1
Differential Revision: D32001168
fbshipit-source-id: 873d9f77434b9c4bafb298c871173f9a560dd2a3