Featrure/resume #1141

SAKURA-CAT · 2025-07-03T04:02:23Z

Description

本PR完成了一次全新的、巨大的功能更新：Resume，这将允许用户实现类似“断点续训”的功能，在已完成的实验上继续上传指标。

closes: #1054

API接口

Resume 对于用户而言的改动非常小，只需要在init函数上传递以下参数即可实现断点续训：

import swanlab

run = swanlab.init(resume="allow", id="xxxx")

print(run.id)  # 21位`a-z,0-9`字符串

其中id为云端实验标识，目前必须为21位a-z,0-9字符串；resume参数支持如下几种：

Argument	描述	Run ID 存在时	Run ID 不存在时	使用场景
"must"	要求SwanLab必须恢复由指定 Run ID 标识的实验。	SwanLab将恢复具有相同 Run ID 的实验。	SwanLab抛出错误。	恢复一个云端已存在的实验，并确保不会自动创建新实验
"allow"	如果 Run ID 存在，允许SwanLab恢复实验。	SwanLab 恢复具有相同 Run ID 的实验。	SwanLab 使用指定的 Run ID 初始化新运行。	在不强制要求 Run ID 唯一的情况下恢复实验。
"never"	不允许SwanLab恢复指定 Run ID 的实验。	SwanLab 抛出错误。	SwanLab 使用指定的 Run ID 初始化新实验。	确保始终启动新的实验，而不恢复现有实验。

有以下的额外规则：

不设置 resume 参数时，等价于 never，这意味着 id 不允许被传递
只有 mode=cloud 时设置 never 以外的 resume 参数
resume 支持设置 True 与 False，前者等价于 allow，后者等价于 never

使用案例

一个典型代码案例如下：

import swanlab


run = swanlab.init()

swanlab.log({"loss": 2, "acc":0.4})

run.finish()

run = swanlab.init(resume="must", id=run.id)


swanlab.log({"loss": 0.2, "acc": 0.9})

loss和acc将在同一实验被聚合：

注意事项

如果resume为must，id必须传递；如果resume为never，id不允许传递
id为21位a-z,0-9字符串
如果在某一实验运行时新开进程resume此实验，旧的实验将不再上传指标，但本地日志依旧会被记录
resume时，实验名称、tags、描述都不会被更新（暂时）
resume时，硬件监控不会开启（暂时）
resume时，不会采集环境信息（暂时）
resume时，自动更新终端日志

Modified Client's HTTP methods (post, put, get, patch) to return both the decoded response and the raw response object. Updated internal usage and tests to accommodate this change. Added new properties to ExperimentInfo for flag_id, config, root_proj_cuid, and root_exp_cuid.

Added new properties and improved error handling in the Client class for experiment and project mounting. The mount_exp and update_state methods now support additional parameters and more robust error reporting. Introduced comprehensive unit tests for experiment-related functionality, including edge cases for project mounting and experiment existence.

Centralized metrics upload logic into a new trace_metrics function that checks the client's pending state before uploading. Removed post_metrics from Client and updated all uploader functions to use trace_metrics, ensuring uploads are skipped if the client is pending. Updated CloudPyCallback to warn and skip state updates if pending, improving robustness against concurrent session issues.

Updated cloud callback to avoid saving login info when retrieving it. Added tests to ensure that re-initializing experiments after terminal or code login does not require re-authentication.

Refactored CloudPyCallback to use a static method for client creation, supporting optional login info and improved error handling. Added comprehensive unit tests for cloud callback client creation scenarios. Minor logic update in SwanLabInitializer for reinit check. Updated related tests to cover new behaviors.

Introduced generate_run_id in namer.py to create 21-character lowercase alphanumeric run IDs, and check_run_id_format in formatter.py to validate them. Updated all callbackers to assign run_id on initialization, and enforced run_id presence in SwanLabInitializer. Added corresponding unit tests for run_id generation and validation.

Replaces the 'allow_exist' parameter with 'must_exist' in Client.mount_exp and updates related logic to enforce experiment existence when resuming. Adds 'resume' and 'run_id' parameters to the initializer, with validation and propagation through the run store. Updates tests and callback logic to reflect the new resume behavior and error handling.

Introduces a run_id property to SwanLabRun for retrieving the unique run identifier in cloud mode. Updates initialization logic to assert run_id presence, adds tests for run_id behavior across modes, and adjusts test utilities to set run_id explicitly.

Added experiment session flagId to client HTTP headers for better tracking. Refactored uploader functions to consistently return None and improved assertion message in DataPorter for metric error handling.

Moved the SwanLabKey class from exp.py to a new key.py module, improving code organization and encapsulation. Updated SwanLabExp to use the new SwanLabKey signature. Added get_class method to DataWrapper for type consistency checks. Refactored unit tests to use UseMockRunState context manager for better test isolation and reliability.

Introduced the mock_from_remote class method to SwanLabKey for creating mock key objects from remote data, primarily for resume and error marking scenarios. Refactored internal type checks to use chart_type from ColumnInfo, and added comprehensive unit tests for the new method and related behaviors.

Introduces a 'new' attribute to RunStore to indicate if an experiment is newly created or existing. Updates Client.mount_exp to return this status, propagates it through all callbackers, and adds assertions in SwanLabInitializer to ensure correct state handling. Also adds a unit test to verify experiment existence logic. Update docstrings in SwanLabInitializer class Clarified the descriptions for 'resume' and 'reinit' parameters in the SwanLabInitializer class docstring to improve documentation accuracy.

Added the is_system_key utility function to the hardware module's exports and included it in __all__. This allows other modules to check if a key is a system key by importing is_system_key directly from the hardware package.

Implements logic to restore experiment state from the cloud, including remote metric metadata and log epoch, when resuming an experiment. Refactors SwanLabExp to initialize keys from remote metrics, updates CloudPyCallback to fetch and parse remote metric summaries and columns, and adjusts SwanLabKey to support remote instantiation. Also updates log proxy to accept an epoch parameter and clarifies config type checks in the initializer.

Moved http.update_state to occur after session closure in CloudPyCallback. Updated test_key.py to handle new return values and parameters for SwanLabKey.mock_from_remote, and adjusted step-related assertions. Removed unused TestResumeNever class from test_main.py and added a placeholder TestResume class in test_sdk.py.

Copilot

Pull Request Overview

This PR implements a “resume” feature, enabling users to continue logging to existing runs by passing resume and run_id to swanlab.init. It adds validation and storage of resume settings, synchronizes remote metrics and configs when resuming, and updates callback logic and client APIs accordingly.

Extend init() API with resume/run_id parameters and enforce resume rules
Enhance RunStore, callbackers, and Client to handle resumed runs (config, metrics, log epoch)
Add tests for run ID formatting, resume behavior, and mock run state

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tutils/setup.py	Allow injecting `run_id` into `UseMockRunState`
test/unit/data/test_sdk.py	Add tests for re-init and resume init behavior
test/unit/data/test_namer.py	Add tests for `generate_run_id`
test/unit/data/test_formatter.py	Add `check_run_id_format` tests
test/unit/data/run/test_main.py	Wrap `SwanLabRun` tests with `UseMockRunState` and add run_id tests
test/unit/data/run/test_key.py	New tests for `SwanLabKey` mock-from-remote
test/unit/data/run/test_config.py	Wrap config tests with `UseMockRunState` and add config format tests
test/unit/data/callbacker/test_cloud.py	Add cloud resume callback tests
test/unit/core_python/test_client.py	Extend client tests with resume and exp suite
swanlab/log/log.py	Add `epoch` parameter to `start_proxy`
swanlab/data/store.py	Extend `RunStore` with resume, config, metrics, log_epoch
swanlab/data/sdk.py	Add `resume`/`run_id` parameters and validation in `init`
swanlab/data/run/metadata/hardware/utils.py	Add `is_system_key` utility
swanlab/data/run/metadata/hardware/init.py	Export `is_system_key`
swanlab/data/run/main.py	Populate `run_id` in `SwanLabRun` and adjust monitoring logic
swanlab/data/run/key.py	Complete `SwanLabKey` to support remote mock columns
swanlab/data/run/exp.py	Init experiment with existing metrics on resume
swanlab/data/run/config.py	Rename `__fmt_config`, add `revert_config`
swanlab/data/porter/init.py	Clarify assert message in `trace_metric`
swanlab/data/namer.py	Add `generate_run_id()`
swanlab/data/modules/wrapper.py	Add `get_class()` method
swanlab/data/formatter.py	Add `check_run_id_format()`
swanlab/data/callbacker/offline.py	Set `run_id` and `new` on init
swanlab/data/callbacker/local.py	Set `run_id` and `new` on init
swanlab/data/callbacker/disabled.py	Set `run_id` and `new` on init
swanlab/data/callbacker/cloud.py	Implement resume mount logic and fetch remote data
swanlab/data/callbacker/callback.py	Pass `run_store.log_epoch` into terminal proxy
swanlab/core_python/uploader/upload.py	Adapt `create_data` and add `trace_metrics`
swanlab/core_python/client/model.py	Expose `flag_id`, `config`, `root_proj_cuid`, `root_exp_cuid`
swanlab/core_python/client/init.py	Propagate `flagId`, adjust `post/put/get` return values and mount_exp logic

Comments suppressed due to low confidence (4)

tutils/setup.py:48

UseMockRunState now sets up only run_dir, media_dir, and log_dir, but misses creating console_dir and file_dir. Tests or code referencing those will fail—add os.mkdir(self.store.console_dir) and os.mkdir(self.store.file_dir).

        os.mkdir(self.store.run_dir)

swanlab/log/log.py:135

AtomicCounter is used here but not imported at the top of the file. Add from .atomic_counter import AtomicCounter or the correct import to avoid a NameError.

        if epoch is not None:

swanlab/data/formatter.py:179

re is not imported in this module. Please add import re at the top of formatter.py.

    if not re.match(r"^[a-z0-9]{21}$", run_id_str):

test/unit/core_python/test_client.py:145

is_skip_cloud_test is not defined in this scope. You likely meant T.is_skip_cloud_test or need to import it from tutils.

@pytest.mark.skipif(is_skip_cloud_test, reason="skip cloud test")

Replaces all occurrences of the 'run_id' parameter with 'id' in the SwanLabInitializer class and updates related documentation and logic. This change improves consistency and clarity in parameter naming.

Renamed the 'run_id' property to 'id' in SwanLabRun for consistency. Added run id format validation using check_run_id_format in SwanLabInitializer. Updated error message in check_run_id_format for clarity.

Zeyi-Lin · 2025-07-04T09:18:32Z

LGTM

Changed the expected error message in test assertions from 'run_id' to 'id' to match updated exception messages in run ID format validation tests.

Replaces references to run.run_id with run.id in test_cloud.py and test_main.py to align with updated attribute naming. Ensures tests use the correct property for run identification.

Introduces RUN_ID and RESUME environment variables to SwanLabEnv and updates SwanLabInitializer to load them from the environment. Also fixes potential issues with summary parsing in cloud callback and always updates experiment state on close.

Corrects the logic for determining proj_id by using root_proj_cuid instead of root_exp_cuid, ensuring data is uploaded to the correct project.

Introduces test cases for the 'allow' and 'must' resume modes in swanlab. The tests verify correct behavior when resuming runs, handling duplicate steps, and error states for both modes.

Expanded test coverage for the resume feature in various modes (never, allow, must) in test_sdk.py, including parameter validation and error handling. Added time delays in resume tests to simulate real-world scenarios. Updated comments in key.py for clarity on step handling after resume.

Updated ValueError messages in the Client class to provide clearer information when resuming cloned experiments or when experiment-project mismatches occur. This enhances clarity for users encountering these errors.

Merged the split ValueError message into a single string when raising an error for cloned experiments that cannot be resumed.

SAKURA-CAT and others added 13 commits June 26, 2025 16:09

tmp

c9523c6

Fix login info handling and add re-init tests

cc98a8e

Updated cloud callback to avoid saving login info when retrieving it. Added tests to ensure that re-initializing experiments after terminal or code login does not require re-authentication.

tmp stash

1d25951

Refactor upload functions and add flagId to session headers

3a4ab9a

Added experiment session flagId to client HTTP headers for better tracking. Refactored uploader functions to consistently return None and improved assertion message in DataPorter for metric error handling.

SAKURA-CAT self-assigned this Jul 3, 2025

SAKURA-CAT added the 💪 enhancement New feature or request label Jul 3, 2025

SAKURA-CAT force-pushed the fearure/resume branch from 282cc1b to 9a3c8cc Compare July 3, 2025 06:16

SAKURA-CAT added 3 commits July 4, 2025 16:13

Export is_system_key and add to __all__ in hardware module

4ca178c

Added the is_system_key utility function to the hardware module's exports and included it in __all__. This allows other modules to check if a key is a system key by importing is_system_key directly from the hardware package.

SAKURA-CAT marked this pull request as ready for review July 4, 2025 08:52

SAKURA-CAT requested review from Zeyi-Lin and Copilot July 4, 2025 08:55

Copilot AI reviewed Jul 4, 2025

View reviewed changes

SAKURA-CAT added 2 commits July 4, 2025 17:01

Rename run_id parameter to id in SwanLabInitializer

c638a31

Replaces all occurrences of the 'run_id' parameter with 'id' in the SwanLabInitializer class and updates related documentation and logic. This change improves consistency and clarity in parameter naming.

Refactor run_id to id and validate run id format

2d8bf09

Renamed the 'run_id' property to 'id' in SwanLabRun for consistency. Added run id format validation using check_run_id_format in SwanLabInitializer. Updated error message in check_run_id_format for clarity.

Zeyi-Lin approved these changes Jul 4, 2025

View reviewed changes

SAKURA-CAT added 2 commits July 4, 2025 17:19

Update error message regex in run ID format tests

26e52a7

Changed the expected error message in test assertions from 'run_id' to 'id' to match updated exception messages in run ID format validation tests.

Update usage of run_id to run.id in tests

a728546

Replaces references to run.run_id with run.id in test_cloud.py and test_main.py to align with updated attribute naming. Ensures tests use the correct property for run identification.

SAKURA-CAT changed the title ~~Fearure/resume~~ Featrure/resume Jul 4, 2025

SAKURA-CAT and others added 6 commits July 4, 2025 21:45

Fix project ID assignment in create_data function

b2d81f4

Corrects the logic for determining proj_id by using root_proj_cuid instead of root_exp_cuid, ensuring data is uploaded to the correct project.

Add tests for 'allow' and 'must' resume modes

333ebe4

Introduces test cases for the 'allow' and 'must' resume modes in swanlab. The tests verify correct behavior when resuming runs, handling duplicate steps, and error states for both modes.

Improve error messages for experiment resume failures

0a12b8d

Updated ValueError messages in the Client class to provide clearer information when resuming cloned experiments or when experiment-project mismatches occur. This enhances clarity for users encountering these errors.

Combine ValueError message for cloned experiment

7eb8216

Merged the split ValueError message into a single string when raising an error for cloned experiments that cannot be resumed.

SAKURA-CAT merged commit 08f0545 into main Jul 5, 2025
5 checks passed

SAKURA-CAT deleted the fearure/resume branch July 5, 2025 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Featrure/resume #1141

Featrure/resume #1141

Uh oh!

SAKURA-CAT commented Jul 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Zeyi-Lin commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Featrure/resume #1141

Featrure/resume #1141

Uh oh!

Conversation

SAKURA-CAT commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

API接口

使用案例

注意事项

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Zeyi-Lin commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

SAKURA-CAT commented Jul 3, 2025 •

edited

Loading