pytorch
7a4a8f32 - Add new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)

Commit View On GitHub

Commit

2 years ago

Add new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578) This PR does not include an NVFuser frontend cache but it decouples the backed Fusion IR exposure and instead builds it as needed, if there was a cache, by recording the requested definition for replay to start the process of building a Fusion if it doesn't already exist. Another PR will be put up to include the actual caching. The main change in the Python Frontend is that the NVFuser Fusion IR is not directly defined by the interface. Currently, there is direct connection between the Python API and the creation of the Fusion IR and Object. This means the user defines TensorViews, Scalars, and calls Arith Functions (IR Expressions) on those IR Values. The goal is to disconnect the Python API from directly specifying the Fusion IR and enable caching of the IR so a Fusion Object is not necessarily built every time a Fusion Definition is seen. The FusionDefinition in Python will mostly look the same except the Definition is now being recorded in a light weight representation called a "Recording" of Records. If the Description is not already cached, the Records are executed to build the Fusion IR. Initially, there is no caching because I am trying to bring up the representation first and get it correctly working. This is what the Records look like. The records are functors that are called if it is necessary to build the Fusion IR torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h **Tensor Definition Record** _Note: The Tensor Definition will change for runtime contiguity caching, I am just matching what is already there for now._ ``` InputTensorRecord( std::vector<size_t> _outputs, std::vector<int64_t> _symbolic_sizes, std::vector<bool> _contiguous_info, NvfDataType _dtype) : RecordFunctor({}, std::move(_outputs)), symbolic_sizes(std::move(_symbolic_sizes)), contiguous_info(std::move(_contiguous_info)), dtype(_dtype) {} void operator()(FusionDefinition& fd) final { auto tv = TensorViewBuilder() .ndims(symbolic_sizes.size()) .contiguity(contiguous_info) .shape(symbolic_sizes) .dtype(dtype) .build(); fd.fusion_state.at(outputs.at(0)) = tv; fd.addInput(tv); } std::vector<int64_t> symbolic_sizes; std::vector<bool> contiguous_info; NvfDataType dtype; }; ``` **Generic Templatized Op Record Definition** Op Records are notable because they record Fusion IR arith functions as the `fusion_op_`. ``` template <class OutType, class... ArgTypes> struct OpRecord : RecordFunctor { OpRecord( std::vector<size_t> _args, std::vector<size_t> _outputs, std::function<OutType(ArgTypes...)> fusion_op) : RecordFunctor(std::move(_args), std::move(_outputs)), fusion_op_(fusion_op) {} template <class TupleType, std::size_t... Is> OutType opFunc( FusionDefinition& fd, TupleType& tp, std::index_sequence<Is...>) { return fusion_op_( dynamic_cast<typename std::tuple_element<Is, TupleType>::type>( fd.fusion_state.at(args.at(Is)))...); } void operator()(FusionDefinition& fd) final { using arg_tuple_t = std::tuple<ArgTypes...>; auto indices = std::make_index_sequence<std::tuple_size<arg_tuple_t>::value>(); arg_tuple_t inputs; auto output = opFunc(fd, inputs, indices); fd.fusion_state.at(outputs.at(0)) = output; } private: std::function<OutType(ArgTypes...)> fusion_op_; }; ``` Perhaps the most confusing aspect of the Python Frontend is the `FusionDefinition`. The C++ Class that is bound to is very light weight, purposely. In an attempt to make sure users don't have to touch more than one file when adding new ops, assuming an appropriate Record has already been defined, the Python bindings effectively create functions that act on the FusionDefinition and appear as part of the class in Python but are not part of the class in C++. Here is an example of a Unary Op Macro. It is creating the binding to a lambda function that effectively appears as a FusionDefinition operation in Python. The other way to do this would have been to create a class method directly in the `FusionDefinition` C++ and have a separate binding to that method. ``` #define NVFUSER_PYTHON_BINDING_UNARY_OP(op_str, op_name) \ nvf_ops.def( \ op_str, \ [](nvfuser::FusionDefinition::Operators& self, \ nvfuser::Tensor* input) -> nvfuser::Tensor* { \ nvfuser::Tensor* output = new nvfuser::Tensor( \ self.fusion_definition->recording_state.size()); \ self.fusion_definition->recording_state.emplace_back(output); \ self.fusion_definition->recording.emplace_back( \ new nvfuser::OpRecord<NvfTensorView*, NvfTensorView*>( \ {input->index}, \ {output->index}, \ static_cast<NvfTensorView* (*)(NvfTensorView*)>( \ torch::jit::fuser::cuda::op_name))); \ return output; \ }, \ py::return_value_policy::reference); \ ``` Here is the `FusionDefinition` class edited for brevity. The playing of the records will be found under the `exit()` method where exit refers to exiting of the Python Context Manager. A `FusionDefinition` is captured through a context manager like the following: ``` fusion = Fusion() with FusionDefinition(fusion) as fd : t0 = fd.define_tensor(sizes=[5], strides=[1]) t1 = fd.ops.abs(t0) fd.add_output(t1) ``` ``` class FusionDefinition { public: FusionDefinition(FusionOwner* fusion_owner) : fusion_owner_(fusion_owner), prev_fusion_(nullptr), recording(), recording_state(), fusion_state(), ops(this) {} // Context Manager Methods FusionDefinition* enter() { prev_fusion_ = FusionGuard::getCurFusion(); FusionGuard::setCurFusion(fusionPtr()); return this; } void exit() { // Found in the Python Bindings, currently. //for (auto& record : recording) { // auto functor = record.get(); // (*functor)(self); //} FusionGuard::setCurFusion(prev_fusion_); prev_fusion_ = nullptr; } void addInput(torch::jit::fuser::cuda::Val* input) { fusionPtr()->addInput(input); } void addOutput(torch::jit::fuser::cuda::Val* output) { fusionPtr()->addOutput(output); } Fusion* fusionPtr() { return fusion_owner_->fusionPtr(); } private: FusionOwner* fusion_owner_; Fusion* prev_fusion_; public: std::vector<std::unique_ptr<RecordFunctor>> recording; std::vector<std::unique_ptr<State>> recording_state; std::vector<NvfVal*> fusion_state; struct Operators { Operators(FusionDefinition* fd) : fusion_definition(fd) {} // Python operations are effectively bound here. FusionDefinition* fusion_definition; }; Operators ops; }; ``` The Fusion IR doesn’t have `define_tensor` or `define_scalar` functions. I made them up and the name for the Python `FusionDefinition` as a more understandable/convenient way to define input tensors and scalars. `TensorView` objects and Fusion IR `Val` objects are not typically defined outside of a Fusion IR `Expr` output (typically arith function outputs) except for inputs to a graph. Mechanically speaking, there are two things you need to do to define the input in the Fusion IR. You need to define the IR `TensorView`/`Val` object and then record that the IR `TensorView`/`Val` object is an input in the `Fusion` Object that encapsulates the Fusion IR. Since the `FusionDefinition` does not correspond one-to-one with the Fusion IR and `define_tensor` and `define_scalar` are made up functions, I decided to combine the `Val` Object creation and recording of the input in the `Fusion` object in one step to reduce the amount of syntax required to define a Fusion in the python interface. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81578 Approved by: https://github.com/jjsjann123, https://github.com/IvanYashchuk, https://github.com/SherlockNoMad

Author

kevinstephano

Committer

pytorchmergebot

Parents

0b0dbc59

pytorch 7a4a8f32 - Add new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)

Commit

pytorch
7a4a8f32 - Add new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)