Skip to content

Conversation

SAKURA-CAT
Copy link
Member

@SAKURA-CAT SAKURA-CAT commented Jul 18, 2025

本 PR 完善了 sync 功能的逻辑,封装原本 resume 的代码为 Mounter (挂载器),可在 sync 和 init 部分复用。

描述

本PR允许sync功能结合断点续训一起使用,经典场景为:

  1. 用户使用 offline 功能,通过 nas 等网盘同步日志,此时可通过不断地sync上传实验指标信息
  2. 用户在训练时网络中断,通过sync功能上传未被上传的日志

以上场景在底层被认为是resume功能的复用,在代码实现上也是如此

关于测试

除了常规单元测试以外,在项目 test/sync 目录下新增纯python的sync测试和jupyter测试,具体可看文件注释

API

从产品设计角度出发,在执行 sync 时依旧创建新的实验,断点续训被认为是可选操作,为此我们新增 --id 参数,他有以下可选值:

  1. None:默认行为,等价于 new
  2. new:创建一个新的实验完成sync,等价于 resume=never
  3. auto:使用日志文件中配置的实验id完成sync,等价于 resume=allow
  4. str:其他字符串,则被认为是实验id,此时等价于 resume=must

注意事项

  • 新版本日志与旧版本日志不兼容,这意味着新(旧)版本日志无法 sync 旧(新)版本日志文件
  • 由于目前resume的技术限制,当 --idauto 且不为新实验时,不会同步实验运行时间(这与目前resume逻辑一致)

日志文件兼容表

Log Version SwanLab Version
0 0.6.2 ~ 0.6.7
1 0.6.8 ~ latest

文件兼容列表应该更新至官方文档,方便查询


closes: #1156

closes: #1136

@SAKURA-CAT SAKURA-CAT self-assigned this Jul 18, 2025
@SAKURA-CAT SAKURA-CAT added the 💪 enhancement New feature or request label Jul 18, 2025
Expanded the ValidationError exception docstring to include both backend token/api key validation failures and local log file integrity issues.
Replaced assertion with ValidationError for record checksum validation in DataStore. Updated DataPorter to handle ValidationError in parse method with a strict mode. Changed RunStore.run_colors type to Tuple. Added and refactored tests for DataStore validation, moving and expanding test coverage to test/unit/data/porter/test_datastore.py.
Relocated swanlab/data/formatter.py to swanlab/formatter.py and updated all import statements accordingly. Also moved the corresponding test file to match the new structure. This improves project organization by placing shared utilities at the root level.
Added --id and --resume options to the sync CLI command with input validation and error handling. Updated sync logic to support these options and refactored parameter names for clarity. Introduced unit tests to verify correct behavior and error cases for the new options.
Moved LogContent TypedDict from swanlab/log/type.py to swanlab/core_python/uploader/model.py for better modularity. Updated all relevant imports and usages to reference the new location, ensuring type consistency across modules.
Moved experiment mounting and sync logic in CloudPyCallback to use the new Mounter class. Added filter utility functions for metrics, columns, and epochs in swanlab.data.porter.utils, and updated DataPorter to use these for selective uploads. Added unit tests for the new filter utilities. Improved error message in DataStore for unsupported backup versions.
Assigns 'auto' to the id variable if the --resume flag is set, ensuring correct behavior when resuming a sync operation.
Refactored DataPorter to include experiment id and colors, updated parse logic to skip invalid records, and ensured experiment state is updated after synchronization. Mounter no longer handles cleanup and now sets run_colors only if not already set. The sync entrypoint now uses Mounter to set up run_store from parsed data. Updated proto models to require certain fields. Also renamed a test file for clarity and improved the UseMockRunState utility for more flexible test setup.
Replaced references to self.run_store with self._run_store in DataPorter to ensure correct attribute usage. Updated Mounter to handle None config values by defaulting to an empty dict when reverting config.
Added Jupyter notebooks and scripts under test/sync to test the synchronization feature, including run and sync workflows. Updated swanlab.sync.__init__.py to remove unused success state update after synchronization.
@SAKURA-CAT SAKURA-CAT marked this pull request as ready for review July 20, 2025 11:39
@SAKURA-CAT
Copy link
Member Author

@Zeyi-Lin Ready

部分单测还未写,但是已经可以做端到端测试了

@SAKURA-CAT SAKURA-CAT requested review from Zeyi-Lin and Copilot July 20, 2025 11:40
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the sync functionality by introducing resumable training capabilities and refactoring the mount logic into a reusable Mounter component. The key changes enable sync to work with existing experiments and improve code reusability between sync and init operations.

  • Adds support for resuming existing experiments during sync operations with --resume and --id CLI options
  • Extracts mounting logic from the cloud callback into a new reusable Mounter class
  • Updates log file format version and improves data validation with better error handling

Reviewed Changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
swanlab/data/porter/mounter.py New Mounter class that handles project/experiment mounting logic previously embedded in cloud callback
swanlab/cli/commands/sync/init.py Enhanced sync CLI with resume functionality and parameter validation
swanlab/sync/init.py Updated sync function to use new Mounter class and support experiment ID parameters
swanlab/data/porter/init.py Refactored DataPorter to use filtering utilities and improved synchronization logic
swanlab/data/porter/datastore.py Updated log file version and improved validation error handling
swanlab/toolkit/model.py New LogContent TypedDict definition
test/unit/data/porter/ Comprehensive test coverage for new porter functionality

Removed execution outputs and metadata from run.ipynb and sync.ipynb for cleaner version control. Added a README.md to the Jupyter sync test directory. Also added a missing return type annotation to filter_epoch in porter/utils.py.
Added comments to clarify the sequence of parameter generation and experiment mounting in the Mounter class. This improves code readability and maintainability.
Refactored the sync command to remove the --resume option and replace it with an --id option that accepts 'auto' for resuming runs. Updated parameter validation and login handling, and improved experiment ID and resume mode logic in the sync implementation. Adjusted tests and scripts to use the new --id 'auto' pattern and updated error messages for clarity.
Removed the return value from DataPorter.synchronize and updated its usage in swanlab.sync. The method now directly updates the client state based on the footer, improving clarity and reducing unnecessary return value propagation.
Updated run_colors assignment in swanlab.sync.__init__ to use only the first two elements of exp.colors. Removed unused or obsolete CLI sync tests from test_cli_sync.py.
Moved run store setup logic from swanlab/sync/__init__.py to a new utility function set_run_store in swanlab/sync/sync_utils.py for better modularity and reuse. Added comprehensive unit tests for set_run_store. Removed obsolete CLI sync test.
@SAKURA-CAT SAKURA-CAT merged commit 40f04ec into main Jul 25, 2025
5 checks passed
@SAKURA-CAT SAKURA-CAT deleted the feat/sync-pro branch July 25, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[REQUEST] swanlab sync pro [BUG] swanlab sync同步offline logdir问题
2 participants