[Docathon][Add Overview Doc No.15-17] add doc of docathon 15-17 #6595

StellaZYing · 2024-04-06T14:01:45Z

15 paddle.distributed.get_group、paddle.distributed.group_sharded_parallel、paddle.distributed.save_group_sharded_model

16 paddle.distributed.gloo_init_parallel_env、paddle.distributed.gloo_barrier、paddle.distributed.gloo_release

17 paddle.distributed.is_initialized、paddle.distributed.is_initialized、paddle.distributed.DistAttr、paddle.distributed.dtensor_from_fn_cn、paddle.distributed.shard_optimizer

中文文档链接：#6427

@sunzhongkai588 @Turingg

…tella_docathon_branch

paddle-bot · 2024-04-06T14:01:49Z

感谢你贡献飞桨文档，文档预览构建中，Docs-New 跑完后即可预览，预览链接：http://preview-pr-6595.paddle-docs-preview.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html
预览工具的更多说明，请参考：飞桨文档预览工具

Turingg · 2024-04-08T05:21:09Z

docs/api/paddle/distributed/Overview_cn.rst

@@ -59,6 +60,7 @@ Fleet 分布式高层 API
    " :ref:`destroy_process_group <cn_api_paddle_distributed_destroy_process_group>` ", "销毁分布式通信组"
    " :ref:`get_backend <cn_api_paddle_distributed_get_backend>` ", "获取指定分布式通信组后端的名称"

+


可以把这行删去

Turingg · 2024-04-15T08:55:58Z

docs/api/paddle/distributed/Overview_cn.rst

@@ -123,6 +125,9 @@ Stream 集合通信高级 API
    " :ref:`stream.reduce_scatter <cn_api_paddle_distributed_stream_reduce_scatter>` ", "规约一组 tensor，随后将规约结果分发到每个进程"
    " :ref:`stream.send <cn_api_paddle_distributed_stream_send>` ", "发送一个 tensor 到指定进程"
    " :ref:`stream.recv <cn_api_paddle_distributed_stream_recv>` ", "接收一个来自指定进程的 tensor"
+    " :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "初始化 ``GLOO`` 上下文用于 CPU 间的通信"


这个API的名称错了叭是这个"gloo_init_parallel_env"

Turingg · 2024-04-15T08:57:27Z

docs/api/paddle/distributed/Overview_cn.rst

@@ -123,6 +125,9 @@ Stream 集合通信高级 API
    " :ref:`stream.reduce_scatter <cn_api_paddle_distributed_stream_reduce_scatter>` ", "规约一组 tensor，随后将规约结果分发到每个进程"
    " :ref:`stream.send <cn_api_paddle_distributed_stream_send>` ", "发送一个 tensor 到指定进程"
    " :ref:`stream.recv <cn_api_paddle_distributed_stream_recv>` ", "接收一个来自指定进程的 tensor"
+    " :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "初始化 ``GLOO`` 上下文用于 CPU 间的通信"
+    " :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "使用初始化的 gloo 上下文直接调用基于 gloo 封装的 barrier 函数"


这个是哪个API捏

Turingg · 2024-04-15T09:04:58Z

docs/api/paddle/distributed/Overview_cn.rst

    " :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor"
    " :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片"
    " :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型"
    " :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略"
+    " :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例"
+    " :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"


这个API新开一个 ”数据分片“

Turingg · 2024-04-15T09:05:44Z

docs/api/paddle/distributed/Overview_cn.rst

    " :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor"
    " :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片"
    " :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型"
    " :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略"
+    " :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例"


这个API应放在集合通信 API 下

Turingg · 2024-04-15T09:11:26Z

docs/api/paddle/distributed/Overview_cn.rst

@@ -155,7 +160,13 @@ RPC API
    :widths: 20, 50

    " :ref:`shard_tensor <cn_api_paddle_distributed_shard_tensor>` ", "创建带有分布式切分信息的分布式 Tensor"
+    " :ref:`dtensor_from_fn <cn_api_paddle_distributed_dtensor_from_fn>` ", "通过一个 paddle API 结合分布式属性 placements 创建一个带分布式属性的 Tensor"


这个API应放在自动并行下

Turingg · 2024-04-15T09:15:51Z

docs/api/paddle/distributed/Overview_cn.rst

    " :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor"
    " :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片"
    " :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型"
    " :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略"
+    " :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例"
+    " :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"
+    " :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"


Turingg · 2024-04-15T09:16:08Z

docs/api/paddle/distributed/Overview_cn.rst

+    " :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"
+    " :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"
+    " :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式"
+    " :ref:`shard_optimizer <cn_api_paddle_distributed_shard_optimizer>` ", "将单卡视角的优化器转变为分布式视角"


放 ” 自动并行“里

Turingg · 2024-04-15T09:18:03Z

docs/api/paddle/distributed/Overview_cn.rst

+    " :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例"
+    " :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"
+    " :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"
+    " :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式"


放 ” 自动并行“里

…下，新开数据分片API

…tella_docathon_branch

sunzhongkai588 · 2024-04-16T08:49:25Z

docs/api/paddle/distributed/Overview_cn.rst

@@ -100,6 +101,7 @@ Fleet 分布式高层 API
    " :ref:`send <cn_api_paddle_distributed_send>` ", "发送一个 tensor 到指定进程"
    " :ref:`recv <cn_api_paddle_distributed_recv>` ", "接收一个来自指定进程的 tensor"
    " :ref:`barrier <cn_api_paddle_distributed_barrier>` ", "同步路障，阻塞操作以实现组内进程同步"
+    " :ref:`get_group <cn_api_paddle_distributed_get_group>` ", "通过通信组 id 获取通信组实例"


这部分放在 环境配置和训练启动管理 比较合适，因为用于获取已创建通信组的实例，算是分布式环境配置的一部分。放在 new_group 下面吧

sunzhongkai588 · 2024-04-16T09:07:48Z

docs/api/paddle/distributed/Overview_cn.rst

+    " :ref:`gloo_init_parallel <cn_api_paddle_distributed_gloo_init_parallel>` ", "初始化 ``GLOO`` 上下文用于 CPU 间的通信"
+    " :ref:`gloo_barrier <cn_api_paddle_distributed_gloo_barrier>` ", "使用初始化的 gloo 上下文直接调用基于 gloo 封装的 barrier 函数"
+    " :ref:`gloo_release <cn_api_paddle_distributed_gloo_release>` ", "释放当前并行环境的 gloo 上下文"


gloo_init_parallel_env 和 gloo_release 应该被分类在 环境配置和训练启动管理，因为是分别用于初始化和释放一个特定于 Gloo 的并行环境
gloo_barrier 则更适合归类于 集合通信算法 API。

sunzhongkai588 · 2024-04-16T09:11:13Z

docs/api/paddle/distributed/Overview_cn.rst

    " :ref:`shard_layer <cn_api_paddle_distributed_shard_layer>` ", "按照指定方式将 Layer 中的参数转换为分布式 Tensor"
    " :ref:`reshard <cn_api_paddle_distributed_reshard>`", "对一个带有分布式信息的 Tensor 重新进行分布/切片"
    " :ref:`to_static <cn_api_paddle_distributed_to_static>`", "将带有分布式切分信息的动态图模型转换为静态图分布式模型"
    " :ref:`Strategy <cn_api_paddle_distributed_Strategy>`", "配置静态图分布式训练时所使用的并行策略和优化策略"
+    " :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式"


Suggested change

" :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式"

" :ref:`DistAttr <cn_api_paddle_distributed_DistAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式"

sunzhongkai588 · 2024-04-16T09:21:11Z

docs/api/paddle/distributed/Overview_cn.rst

+数据分片 API
+::::::::::::::::::::::::::
+
+    " :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"
+    " :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"


Suggested change

数据分片 API

::::::::::::::::::::::::::

" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"

" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"

Sharding API

::::::::::::::::::::::::::

.. csv-table::

:header: "API 名称", "API 功能"

:widths: 20, 50

" :ref:`group_sharded_parallel <an_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"

" :ref:`save_group_sharded_model <an_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"

…,gloo_barrier,DistAttr和Sharding API

…tella_docathon_branch

docs/api/paddle/distributed/Overview_cn.rst

sunzhongkai588 · 2024-04-22T03:51:10Z

docs/api/paddle/distributed/Overview_cn.rst

+    " :ref:`group_sharded_parallel <cn_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"
+    " :ref:`save_group_sharded_model <cn_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"


这边 `group_sharded_parallel` 和 `save_group_sharded_model` 都是 sharding 目录下的，前面需要加sharding，如 `sharding.group_sharded_parallel` ，后面 <> 内的也要修改下，参考 rpc 的 API，请严格按照目录。

sunzhongkai588 · 2024-04-24T03:20:14Z

docs/api/paddle/distributed/Overview_cn.rst

+    :widths: 20, 50
+
+    " :ref:`sharding.group_sharded_parallel <cn_api_paddle_distributed_sharding_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"
+    " :ref:`sharding.save_group_sharded_model <cn_api_paddle_distributed_sharding_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"
    " :ref:`split <cn_api_paddle_distributed_split>` ", "切分指定操作的参数到多个设备，并且并行计算得到结果"


这个怎么在 sharding 下了？原本应该在自动并行下吧

sunzhongkai588

LGTM

StellaZYing added 2 commits April 6, 2024 21:51

add overview 15-17

d990677

Merge branch 'develop' of https://github.com/PaddlePaddle/docs into S…

f231f3d

…tella_docathon_branch

paddle-bot bot added the contributor label Apr 6, 2024

luotao1 mentioned this pull request Apr 7, 2024

【Docathon】补充Overview文档相关API描述 #6427

Closed

luotao1 added the HappyOpenSource 快乐开源活动issue与PR label Apr 7, 2024

luotao1 assigned luotao1 and sunzhongkai588 Apr 8, 2024

Turingg reviewed Apr 9, 2024

View reviewed changes

Turingg reviewed Apr 15, 2024

View reviewed changes

StellaZYing added 2 commits April 16, 2024 12:38

删去63行，修改gloo_init_parallel_envå和nv和gloo_barrier名称,将get_group改放集合通信API…

97cb0fd

…下，新开数据分片API

Merge branch 'develop' of https://github.com/PaddlePaddle/docs into S…

d68f923

…tella_docathon_branch

sunzhongkai588 reviewed Apr 16, 2024

View reviewed changes

StellaZYing and others added 3 commits April 16, 2024 19:10

get_group改放环境配置和训练启动ç®修改get_gruop,gloo_init_parallel_env,gloo_release…

6a31843

…,gloo_barrier,DistAttr和Sharding API

Merge branch 'develop' of https://github.com/PaddlePaddle/docs into S…

f35cb6c

…tella_docathon_branch

Update Overview_cn.rst

91a0ade

sunzhongkai588 reviewed Apr 17, 2024

View reviewed changes

docs/api/paddle/distributed/Overview_cn.rst Outdated Show resolved Hide resolved

Update docs/api/paddle/distributed/Overview_cn.rst

c6ac026

sunzhongkai588 reviewed Apr 22, 2024

View reviewed changes

Merge branch 'develop' into Stella_docathon_branch

2175351

sunzhongkai588 reviewed Apr 24, 2024

View reviewed changes

修改185行，改入“自动并行”

c255c2d

This was referenced May 6, 2024

[WeeklyReport] StellaZYing 2024.04.22~2024.05.06 PFCCLab/Starter#213

Merged

[WeeklyReports] 2024.04.22~2024.05.06 周报收集 PFCCLab/Starter#194

Closed

sunzhongkai588 approved these changes May 10, 2024

View reviewed changes

luotao1 merged commit 790a384 into PaddlePaddle:develop May 11, 2024

		@@ -59,6 +60,7 @@ Fleet 分布式高层 API
		" :ref:`destroy_process_group <cn_api_paddle_distributed_destroy_process_group>` ", "销毁分布式通信组"
		" :ref:`get_backend <cn_api_paddle_distributed_get_backend>` ", "获取指定分布式通信组后端的名称"

	" :ref:`DisAttr <cn_api_paddle_distributed_DisAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式"
	" :ref:`DistAttr <cn_api_paddle_distributed_DistAttr>` ", "指定 Tensor 在 ProcessMesh 上的分布或切片方式"

		" :ref:`group_sharded_parallel <cn_api_paddle_distributed_group_sharded_parallel>`", "对模型、优化器和 GradScaler 做 group sharded 配置"
		" :ref:`save_group_sharded_model <cn_api_paddle_distributed_save_group_sharded_model>`", "对 group_sharded_parallel 配置后的模型和优化器状态进行保存"

[Docathon][Add Overview Doc No.15-17] add doc of docathon 15-17 #6595

[Docathon][Add Overview Doc No.15-17] add doc of docathon 15-17 #6595

Uh oh!

Conversation

StellaZYing commented Apr 6, 2024

Uh oh!

paddle-bot bot commented Apr 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunzhongkai588 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!