📚 新 IR 适配编译器方案思路 

## 一、CINN 流程

**当前主框架下放到CINN后端的主要逻辑是**：
```
framework::ProgramDesc  =>  ir::Graph   =>  frontend::Program (NetBuilder 层)   =>  hlir::Graph =>    
Compute/Schedule()   =>  AST 层面   =>  Module::Builder  =>  CodeGen+NVRTC =>  Runtime::Program
```

![image](https://github.com/PaddlePaddle/Paddle/assets/9301846/6a870d9c-c6a9-425f-bc59-9b989c7edfb4)

新的优化流程示意图 （家人们，点击图片查看大图）

![image](https://github.com/PaddlePaddle/Paddle/assets/9301846/0a84ac3d-fa6c-4880-8255-5cf4172d7a08)


```plain
这里有两个大的设计原则：template <typename T, typename Context>
void ConvKernel(const Context& dev_ctx,
                const DenseTensor& input,
                const DenseTensor& filter,
                const std::vector<int>& strides,
                const std::vector<int>& paddings,
                const std::string& padding_algorithm,
                const std::vector<int>& dilations,
                int groups,
                const std::string& data_format,
                DenseTensor* out) {
      /*
      */
}
```

1. 新IR为了保证整体流程的稳定，要保证整个图的op和周围相互连接的； 如果一张大图能够表示整个计算网络，不建议切开，通过name 或者某种机制来做这种“soft”连接，在设计上，就要求fusion merge出来子图（含子图内包含的op）跟整体是相互连接的
2. 根据之前的讨论评审，cinn生成的cuda c kernel，会放入到 CINN JIT Instruction 中来执行，由 CINN JIT Instruction来负责kernel输入、输出的准备，整体的gc由执行器统一管理

关于fusion merge pass实现的细节， 详细参考：[Group融合Pass流程简述]()

gusion merge架构会升级下下面的结构；

![image](https://github.com/PaddlePaddle/Paddle/assets/9301846/03dcb6d6-cf87-4b0a-8195-9aa906b338d6)


**阶段一：** `ProgramDesc` => `ir::Graph` => `CinnCompiler`，由 `build_cinn_pass` 承担

```cpp
    auto compilation_key = cinn_compiler->AddGraph(std::move(subgraph));
    VLOG(4) << "Compilation Key:\n"
            << cinn_compiler->ReadableKey(compilation_key);

    // Replace the found cluster to a new cinn op node
    ReplaceSubGraphWithCinnOpNode(cluster_set,
                                  cluster_inputs,
                                  cluster_outputs,
                                  cluster_internals,
                                  compilation_key,
                                  graph);
```

**阶段二：** `ir::Graph`  =>  `frontend::Program` => `hlir::Graph`，由 `CinnCompiler` 承担

```cpp
    const CinnCompiledObject &CinnCompiler::Compile(
        const Graph &graph,
        const std::map<std::string, const phi::DenseTensor *> &input_tensors,
        const Target &target,
        void *stream) {
        
        auto compiled_res =
          CompileGraph(graph, input_tensors, target, compiled_num, stream);
          
        }
   
     std::unique_ptr<CinnCompiledObject> CinnCompiler::CompileGraph(....){
      CinnGraphSymbolization symbol{compiled_num, graph, target, input_tensors};
       auto frontend_program = symbol();  // <----- 重点 1
       
       auto cinn_graph = Optimize(&frontend_program, fetch_ids, target); // <---- 重点 2
       
       auto graph_compiler = std::make_unique<GraphCompiler>(target, scope, cinn_graph); // <--- 重点 3
       
       auto compiled_res = graph_compiler->Build(options, std::move(fetch_ids), stream);

 }
```

**阶段三：** `hlir::Graph` => `Compute/Schedule` => `AST 层面` =>  `Module::Builder`  => `CodeGen+NVRTC` => `Runtime::Program`，由 `GraphCompiler` 承担

```cpp
     GraphCompiler::CompilationResult GraphCompiler::Build(const GraphCompiler::CompileOptions& options,
                                                      std::unordered_set<std::string>&& fetch_var_ids,
                                                      void* stream) {
    
       for (int i = 0; i < groups.size(); i++) {
          if (groups[i].size() == 1) {
            lowered_func = GetOpFunc(groups[i][0]);  // 此处会调用 impl->fcompute、impl->fschedule，返回lang::LowerVec 【重要】这里已经是 AST 层面了
          } else {
            lowered_func = GetOpFunc(groups[i]);
          }
          local_lowered_funcs.emplace_back(std::move(lowered_func));
        }
        
       for (auto&& lowered_func : lowered_funcs) {
              this->ProcessFunction(lowered_func);   // 此处会调用 m_builder_.AddFunction(func); 【重要】这里已经是 AST 层面了
        }
 
        auto build_module = m_builder_.Build(); // ir::Module::Builder
        
        
        auto out = codegen.Compile(build_module, CodeGenC::OutputKind::CImpl); // CodeGen
        compiler_ = backends::Compiler::Create(target_);
        compiler_->Build(build_module, options.attached_code);  // NVRT compiler
        
        auto instructions = BuildInstructions(groups, options.groups.empty() ? graph_->fusion_groups : options.groups);
        result.runtime_program.reset(new Program(scope_, std::move(instructions)));
}
```

## 二、适配方案

若要从：framework::ProgramDesc  →ir::Graph → frontend::Program (NetBuilder 层)→ hlir::Graph → Compute/Schedule → AST 层面 → Module::Builder  → CodeGen+NVRTC → Runtime::Program

迁移到：framework::ProgramDesc → ProgramTranslator → New IR → New IR Graph → Compute/Schedule → AST 层面 → Module::Builder  → CodeGen+NVRTC → Runtime::Program

<p align="center">
<img width=60%" alt="image" src="https://github.com/PaddlePaddle/Paddle/assets/9301846/714e60c2-0dae-4e9f-ad9a-9da3463a9fd7">
</p>

**则各个模块的角色变动如下：**

- 统一主框架 ir::Graph、CINN 中的 frontend::Program、hlir::Graph 到 New IR 上，同时表示Program、Graph 的概念
- 统一 NetBuilder 组件。CINN 中的核心API和`Build()`动态生成的都将是 New  IR，而非 frontend::Program，目前主框架已经有此组件，要考虑如何兼顾Phi和CINN，进一步抽象支持扩展性
- 完善 New IR Graph 的定义，考虑是否有必要新增Dialect 或者代理组件。CINN 现有逻辑是借助了 hlir::Graph 来进行lower的准备工作（如 inputs 和 outputs 的prepare），然后调用 Compute、Schedule 下沉到 AST 层面
- Module::Builder 的角色是否需要保留。个人倾向于保留，Builder、Codegen、NVRTC 都是中间处理模块，与 IR 层面解耦，是横跨「编译期」与「执行期」的桥梁，暂时不需要调整。
- Runtime::Program 可以类比于 Phi Kernel，是运行期概念，可能有必要迁移定义为一个Dialect，以回归到 New IR 体系下，方便交给执行引擎来执行。

**方案思路：**

- 前期为了工作解耦，先依赖 ProgramDesc + ProgramTranslator 来作为 New IR 的输入入口
- build_cinn_pass 迁移为「可行性验证阶段」的「非必要」依赖项，但验证阶段也留意和评估后续实施路径
- New IR 在技术设计上，同时承载了 op-by-op 和 Graph 的语义，CINN里强依赖后者，需要驱动实践和完善
- 前期计划以`GraphCompiler`为核心切入点，以单测驱动机制验证工作



## 二、一些开发经验

- 使用 `builder.Build<paddle::dialect::XX>` 要包含头文件`#include "paddle/fluid/ir/dialect/pd_op.h"`
- 在新 IR 遇到 Instruction.Run() 时会报「非法内存地址访问」，修复PR：
  - 这种问题一般是：①数组访问越界 ②访问了不属于此进程空间的地址（如野指针）
  - 经过分析，是因为在执行生成的 full kernel时，函数入参的out_ptr是个野指针，并没有调用 cudaMalloc.
  - BuildScope 中虽然入参里包含了 target，但却没有用到。那Scope->Var(Tensor).Resize() 时是否会申请显存？答案：不会，只会设置shape
  - `InsertBufferHandlers`里会动态地掺入一些内存malloc和free的额外Instruction。
- 在 CINN 中 ir::Tensor、Expr 之前是是什么关系？代码中经常看到 as_tensor、as_expr、as_tensor_ref等接口，分别面对什么场景？
- OpLowerer 的新旧IR 隔离实现

<img width="801" alt="image" src="https://github.com/PaddlePaddle/Paddle/assets/9301846/5bfd9939-0d4b-43f0-82c2-4f36f52522b3">

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📚 新 IR 适配编译器方案思路 #56879

一、CINN 流程

二、适配方案

二、一些开发经验

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

📚 新 IR 适配编译器方案思路 #56879

Description

一、CINN 流程

二、适配方案

二、一些开发经验

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions