-
Notifications
You must be signed in to change notification settings - Fork 838
[Docathon][Add Overview Doc No.15-17] add doc of docathon 15-17 #6595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docathon][Add Overview Doc No.15-17] add doc of docathon 15-17 #6595
Conversation
…tella_docathon_branch
感谢你贡献飞桨文档,文档预览构建中,Docs-New 跑完后即可预览,预览链接:http://preview-pr-6595.paddle-docs-preview.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html |
@@ -59,6 +60,7 @@ Fleet 分布式高层 API | |||
" :ref:`destroy_process_group <cn_api_paddle_distributed_destroy_process_group>` ", "销毁分布式通信组" | |||
" :ref:`get_backend <cn_api_paddle_distributed_get_backend>` ", "获取指定分布式通信组后端的名称" | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以把这行删去
@@ -123,6 +125,9 @@ Stream 集合通信高级 API | |||
" :ref:`stream.reduce_scatter <cn_api_paddle_distributed_stream_reduce_scatter>` ", "规约一组 tensor,随后将规约结果分发到每个进程" | |||
" :ref:`stream.send <cn_api_paddle_distributed_stream_send>` ", "发送一个 tensor 到指定进程" | |||
" :ref:`stream.recv <cn_api_paddle_distributed_stream_recv>` ", "接收一个来自指定进程的 tensor" | |||
" :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "初始化 ``GLOO`` 上下文用于 CPU 间的通信" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个API的名称错了叭是这个"gloo_init_parallel_env"
@@ -123,6 +125,9 @@ Stream 集合通信高级 API | |||
" :ref:`stream.reduce_scatter <cn_api_paddle_distributed_stream_reduce_scatter>` ", "规约一组 tensor,随后将规约结果分发到每个进程" | |||
" :ref:`stream.send <cn_api_paddle_distributed_stream_send>` ", "发送一个 tensor 到指定进程" | |||
" :ref:`stream.recv <cn_api_paddle_distributed_stream_recv>` ", "接收一个来自指定进程的 tensor" | |||
" :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "初始化 ``GLOO`` 上下文用于 CPU 间的通信" | |||
" :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "使用初始化的 gloo 上下文直接调用基于 gloo 封装的 barrier 函数" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是哪个API捏
" :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor" | ||
" :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片" | ||
" :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型" | ||
" :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略" | ||
" :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例" | ||
" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个API新开一个 ”数据分片“
" :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor" | ||
" :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片" | ||
" :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型" | ||
" :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略" | ||
" :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个API应放在 集合通信 API 下
@@ -155,7 +160,13 @@ RPC API | |||
:widths: 20, 50 | |||
|
|||
" :ref:`shard_tensor <cn_api_paddle_distributed_shard_tensor>` ", "创建带有分布式切分信息的分布式 Tensor" | |||
" :ref:`dtensor_from_fn <cn_api_paddle_distributed_dtensor_from_fn>` ", "通过一个 paddle API 结合分布式属性 placements 创建一个带分布式属性的 Tensor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个API应放在 自动并行下
" :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor" | ||
" :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片" | ||
" :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型" | ||
" :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略" | ||
" :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例" | ||
" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | ||
" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | ||
" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" | ||
" :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式" | ||
" :ref:`shard_optimizer <cn_api_paddle_distributed_shard_optimizer>` ", "将单卡视角的优化器转变为分布式视角" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
放 ” 自动并行“里
" :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例" | ||
" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | ||
" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" | ||
" :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
放 ” 自动并行“里
…tella_docathon_branch
@@ -100,6 +101,7 @@ Fleet 分布式高层 API | |||
" :ref:`send <cn_api_paddle_distributed_send>` ", "发送一个 tensor 到指定进程" | |||
" :ref:`recv <cn_api_paddle_distributed_recv>` ", "接收一个来自指定进程的 tensor" | |||
" :ref:`barrier <cn_api_paddle_distributed_barrier>` ", "同步路障,阻塞操作以实现组内进程同步" | |||
" :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分放在 环境配置和训练启动管理
比较合适,因为用于获取已创建通信组的实例,算是分布式环境配置的一部分。 放在 new_group
下面吧
" :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "初始化 ``GLOO`` 上下文用于 CPU 间的通信" | ||
" :ref:`gloo_barrier <cn_api_paddle_distributed_gloo_barrier>` ", "使用初始化的 gloo 上下文直接调用基于 gloo 封装的 barrier 函数" | ||
" :ref:`gloo_release <cn_api_paddle_distributed_gloo_release>` ", "释放当前并行环境的 gloo 上下文" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gloo_init_parallel_env
和 gloo_release
应该被分类在 环境配置和训练启动管理
,因为是分别用于初始化和释放一个特定于 Gloo 的并行环境
gloo_barrier
则更适合归类于 集合通信算法 API
。
" :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor" | ||
" :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片" | ||
" :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型" | ||
" :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略" | ||
" :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式" | |
" :ref:`DistAttr <cn_api_paddle_distributed_DistAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式" |
数据分片 API | ||
:::::::::::::::::::::::::: | ||
|
||
" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | ||
" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
数据分片 API | |
:::::::::::::::::::::::::: | |
" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | |
" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" | |
Sharding API | |
:::::::::::::::::::::::::: | |
.. csv-table:: | |
:header: "API 名称", "API 功能" | |
:widths: 20, 50 | |
" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | |
" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" |
…,gloo_barrier,DistAttr和Sharding API
…tella_docathon_branch
" :ref:`group_sharded_parallel <cn_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | ||
" :ref:`save_group_sharded_model <cn_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:widths: 20, 50 | ||
|
||
" :ref:`sharding.group_sharded_parallel <cn_api_paddle_distributed_sharding_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置" | ||
" :ref:`sharding.save_group_sharded_model <cn_api_paddle_distributed_sharding_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存" | ||
" :ref:`split <cn_api_paddle_distributed_split>` ", "切分指定操作的参数到多个设备,并且并行计算得到结果" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个怎么在 sharding 下了?原本应该在 自动并行 下吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
15 paddle.distributed.get_group、paddle.distributed.group_sharded_parallel、paddle.distributed.save_group_sharded_model
16 paddle.distributed.gloo_init_parallel_env、paddle.distributed.gloo_barrier、paddle.distributed.gloo_release
17 paddle.distributed.is_initialized、paddle.distributed.is_initialized、paddle.distributed.DistAttr、paddle.distributed.dtensor_from_fn_cn、paddle.distributed.shard_optimizer
中文文档链接:#6427
@sunzhongkai588 @Turingg