Skip to content

Conversation

SAKURA-CAT
Copy link
Member

@SAKURA-CAT SAKURA-CAT commented Jul 3, 2025

Description

本PR完成了一次全新的、巨大的功能更新:Resume,这将允许用户实现类似“断点续训”的功能,在已完成的实验上继续上传指标。

closes: #1054

API接口

Resume 对于用户而言的改动非常小,只需要在init函数上传递以下参数即可实现断点续训:

import swanlab

run = swanlab.init(resume="allow", id="xxxx")

print(run.id)  # 21位`a-z,0-9`字符串

其中id为云端实验标识,目前必须为21位a-z,0-9字符串;resume参数支持如下几种:

Argument 描述 Run ID 存在时 Run ID 不存在时 使用场景
"must" 要求SwanLab必须恢复由指定 Run ID 标识的实验。 SwanLab将恢复具有相同 Run ID 的实验。 SwanLab抛出错误。 恢复一个云端已存在的实验,并确保不会自动创建新实验
"allow" 如果 Run ID 存在,允许SwanLab恢复实验。 SwanLab 恢复具有相同 Run ID 的实验。 SwanLab 使用指定的 Run ID 初始化新运行。 在不强制要求 Run ID 唯一的情况下恢复实验。
"never" 不允许SwanLab恢复指定 Run ID 的实验。 SwanLab 抛出错误。 SwanLab 使用指定的 Run ID 初始化新实验。 确保始终启动新的实验,而不恢复现有实验。

有以下的额外规则:

  1. 不设置 resume 参数时,等价于 never,这意味着 id 不允许被传递
  2. 只有 mode=cloud 时设置 never 以外的 resume 参数
  3. resume 支持设置 True 与 False,前者等价于 allow,后者等价于 never

使用案例

一个典型代码案例如下:

import swanlab


run = swanlab.init()

swanlab.log({"loss": 2, "acc":0.4})

run.finish()

run = swanlab.init(resume="must", id=run.id)


swanlab.log({"loss": 0.2, "acc": 0.9})

loss和acc将在同一实验被聚合:

image

注意事项

  1. 如果resumemustid必须传递;如果resumeneverid不允许传递
  2. id为21位a-z,0-9字符串
  3. 如果在某一实验运行时新开进程resume此实验,旧的实验将不再上传指标,但本地日志依旧会被记录
  4. resume时,实验名称、tags、描述都不会被更新(暂时)
  5. resume时,硬件监控不会开启(暂时)
  6. resume时,不会采集环境信息(暂时)
  7. resume时,自动更新终端日志

SAKURA-CAT and others added 13 commits June 26, 2025 16:09
Modified Client's HTTP methods (post, put, get, patch) to return both the decoded response and the raw response object. Updated internal usage and tests to accommodate this change. Added new properties to ExperimentInfo for flag_id, config, root_proj_cuid, and root_exp_cuid.
Added new properties and improved error handling in the Client class for experiment and project mounting. The mount_exp and update_state methods now support additional parameters and more robust error reporting. Introduced comprehensive unit tests for experiment-related functionality, including edge cases for project mounting and experiment existence.
Centralized metrics upload logic into a new trace_metrics function that checks the client's pending state before uploading. Removed post_metrics from Client and updated all uploader functions to use trace_metrics, ensuring uploads are skipped if the client is pending. Updated CloudPyCallback to warn and skip state updates if pending, improving robustness against concurrent session issues.
Updated cloud callback to avoid saving login info when retrieving it. Added tests to ensure that re-initializing experiments after terminal or code login does not require re-authentication.
Refactored CloudPyCallback to use a static method for client creation, supporting optional login info and improved error handling. Added comprehensive unit tests for cloud callback client creation scenarios. Minor logic update in SwanLabInitializer for reinit check. Updated related tests to cover new behaviors.
Introduced generate_run_id in namer.py to create 21-character lowercase alphanumeric run IDs, and check_run_id_format in formatter.py to validate them. Updated all callbackers to assign run_id on initialization, and enforced run_id presence in SwanLabInitializer. Added corresponding unit tests for run_id generation and validation.
Replaces the 'allow_exist' parameter with 'must_exist' in Client.mount_exp and updates related logic to enforce experiment existence when resuming. Adds 'resume' and 'run_id' parameters to the initializer, with validation and propagation through the run store. Updates tests and callback logic to reflect the new resume behavior and error handling.
Introduces a run_id property to SwanLabRun for retrieving the unique run identifier in cloud mode. Updates initialization logic to assert run_id presence, adds tests for run_id behavior across modes, and adjusts test utilities to set run_id explicitly.
Added experiment session flagId to client HTTP headers for better tracking. Refactored uploader functions to consistently return None and improved assertion message in DataPorter for metric error handling.
Moved the SwanLabKey class from exp.py to a new key.py module, improving code organization and encapsulation. Updated SwanLabExp to use the new SwanLabKey signature. Added get_class method to DataWrapper for type consistency checks. Refactored unit tests to use UseMockRunState context manager for better test isolation and reliability.
Introduced the mock_from_remote class method to SwanLabKey for creating mock key objects from remote data, primarily for resume and error marking scenarios. Refactored internal type checks to use chart_type from ColumnInfo, and added comprehensive unit tests for the new method and related behaviors.
@SAKURA-CAT SAKURA-CAT self-assigned this Jul 3, 2025
@SAKURA-CAT SAKURA-CAT added the 💪 enhancement New feature or request label Jul 3, 2025
Introduces a 'new' attribute to RunStore to indicate if an experiment is newly created or existing. Updates Client.mount_exp to return this status, propagates it through all callbackers, and adds assertions in SwanLabInitializer to ensure correct state handling. Also adds a unit test to verify experiment existence logic.

Update docstrings in SwanLabInitializer class

Clarified the descriptions for 'resume' and 'reinit' parameters in the SwanLabInitializer class docstring to improve documentation accuracy.
Added the is_system_key utility function to the hardware module's exports and included it in __all__. This allows other modules to check if a key is a system key by importing is_system_key directly from the hardware package.
Implements logic to restore experiment state from the cloud, including remote metric metadata and log epoch, when resuming an experiment. Refactors SwanLabExp to initialize keys from remote metrics, updates CloudPyCallback to fetch and parse remote metric summaries and columns, and adjusts SwanLabKey to support remote instantiation. Also updates log proxy to accept an epoch parameter and clarifies config type checks in the initializer.
Moved http.update_state to occur after session closure in CloudPyCallback. Updated test_key.py to handle new return values and parameters for SwanLabKey.mock_from_remote, and adjusted step-related assertions. Removed unused TestResumeNever class from test_main.py and added a placeholder TestResume class in test_sdk.py.
@SAKURA-CAT SAKURA-CAT marked this pull request as ready for review July 4, 2025 08:52
@SAKURA-CAT SAKURA-CAT requested review from Zeyi-Lin and Copilot July 4, 2025 08:55
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a “resume” feature, enabling users to continue logging to existing runs by passing resume and run_id to swanlab.init. It adds validation and storage of resume settings, synchronizes remote metrics and configs when resuming, and updates callback logic and client APIs accordingly.

  • Extend init() API with resume/run_id parameters and enforce resume rules
  • Enhance RunStore, callbackers, and Client to handle resumed runs (config, metrics, log epoch)
  • Add tests for run ID formatting, resume behavior, and mock run state

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tutils/setup.py Allow injecting run_id into UseMockRunState
test/unit/data/test_sdk.py Add tests for re-init and resume init behavior
test/unit/data/test_namer.py Add tests for generate_run_id
test/unit/data/test_formatter.py Add check_run_id_format tests
test/unit/data/run/test_main.py Wrap SwanLabRun tests with UseMockRunState and add run_id tests
test/unit/data/run/test_key.py New tests for SwanLabKey mock-from-remote
test/unit/data/run/test_config.py Wrap config tests with UseMockRunState and add config format tests
test/unit/data/callbacker/test_cloud.py Add cloud resume callback tests
test/unit/core_python/test_client.py Extend client tests with resume and exp suite
swanlab/log/log.py Add epoch parameter to start_proxy
swanlab/data/store.py Extend RunStore with resume, config, metrics, log_epoch
swanlab/data/sdk.py Add resume/run_id parameters and validation in init
swanlab/data/run/metadata/hardware/utils.py Add is_system_key utility
swanlab/data/run/metadata/hardware/init.py Export is_system_key
swanlab/data/run/main.py Populate run_id in SwanLabRun and adjust monitoring logic
swanlab/data/run/key.py Complete SwanLabKey to support remote mock columns
swanlab/data/run/exp.py Init experiment with existing metrics on resume
swanlab/data/run/config.py Rename __fmt_config, add revert_config
swanlab/data/porter/init.py Clarify assert message in trace_metric
swanlab/data/namer.py Add generate_run_id()
swanlab/data/modules/wrapper.py Add get_class() method
swanlab/data/formatter.py Add check_run_id_format()
swanlab/data/callbacker/offline.py Set run_id and new on init
swanlab/data/callbacker/local.py Set run_id and new on init
swanlab/data/callbacker/disabled.py Set run_id and new on init
swanlab/data/callbacker/cloud.py Implement resume mount logic and fetch remote data
swanlab/data/callbacker/callback.py Pass run_store.log_epoch into terminal proxy
swanlab/core_python/uploader/upload.py Adapt create_data and add trace_metrics
swanlab/core_python/client/model.py Expose flag_id, config, root_proj_cuid, root_exp_cuid
swanlab/core_python/client/init.py Propagate flagId, adjust post/put/get return values and mount_exp logic
Comments suppressed due to low confidence (4)

tutils/setup.py:48

  • UseMockRunState now sets up only run_dir, media_dir, and log_dir, but misses creating console_dir and file_dir. Tests or code referencing those will fail—add os.mkdir(self.store.console_dir) and os.mkdir(self.store.file_dir).
        os.mkdir(self.store.run_dir)

swanlab/log/log.py:135

  • AtomicCounter is used here but not imported at the top of the file. Add from .atomic_counter import AtomicCounter or the correct import to avoid a NameError.
        if epoch is not None:

swanlab/data/formatter.py:179

  • re is not imported in this module. Please add import re at the top of formatter.py.
    if not re.match(r"^[a-z0-9]{21}$", run_id_str):

test/unit/core_python/test_client.py:145

  • is_skip_cloud_test is not defined in this scope. You likely meant T.is_skip_cloud_test or need to import it from tutils.
@pytest.mark.skipif(is_skip_cloud_test, reason="skip cloud test")

Replaces all occurrences of the 'run_id' parameter with 'id' in the SwanLabInitializer class and updates related documentation and logic. This change improves consistency and clarity in parameter naming.
Renamed the 'run_id' property to 'id' in SwanLabRun for consistency. Added run id format validation using check_run_id_format in SwanLabInitializer. Updated error message in check_run_id_format for clarity.
@Zeyi-Lin
Copy link
Member

Zeyi-Lin commented Jul 4, 2025

LGTM

Changed the expected error message in test assertions from 'run_id' to 'id' to match updated exception messages in run ID format validation tests.
Replaces references to run.run_id with run.id in test_cloud.py and test_main.py to align with updated attribute naming. Ensures tests use the correct property for run identification.
@SAKURA-CAT SAKURA-CAT changed the title Fearure/resume Featrure/resume Jul 4, 2025
SAKURA-CAT and others added 6 commits July 4, 2025 21:45
Introduces RUN_ID and RESUME environment variables to SwanLabEnv and updates SwanLabInitializer to load them from the environment. Also fixes potential issues with summary parsing in cloud callback and always updates experiment state on close.
Corrects the logic for determining proj_id by using root_proj_cuid instead of root_exp_cuid, ensuring data is uploaded to the correct project.
Introduces test cases for the 'allow' and 'must' resume modes in swanlab. The tests verify correct behavior when resuming runs, handling duplicate steps, and error states for both modes.
Expanded test coverage for the resume feature in various modes (never, allow, must) in test_sdk.py, including parameter validation and error handling. Added time delays in resume tests to simulate real-world scenarios. Updated comments in key.py for clarity on step handling after resume.
Updated ValueError messages in the Client class to provide clearer information when resuming cloned experiments or when experiment-project mismatches occur. This enhances clarity for users encountering these errors.
Merged the split ValueError message into a single string when raising an error for cloned experiments that cannot be resumed.
@SAKURA-CAT SAKURA-CAT merged commit 08f0545 into main Jul 5, 2025
5 checks passed
@SAKURA-CAT SAKURA-CAT deleted the fearure/resume branch July 5, 2025 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[REQUEST] resume 功能
2 participants