-
Notifications
You must be signed in to change notification settings - Fork 142
feature: sync pro #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: sync pro #1194
Conversation
9e7712f
to
1b6046e
Compare
Expanded the ValidationError exception docstring to include both backend token/api key validation failures and local log file integrity issues.
Replaced assertion with ValidationError for record checksum validation in DataStore. Updated DataPorter to handle ValidationError in parse method with a strict mode. Changed RunStore.run_colors type to Tuple. Added and refactored tests for DataStore validation, moving and expanding test coverage to test/unit/data/porter/test_datastore.py.
Relocated swanlab/data/formatter.py to swanlab/formatter.py and updated all import statements accordingly. Also moved the corresponding test file to match the new structure. This improves project organization by placing shared utilities at the root level.
Added --id and --resume options to the sync CLI command with input validation and error handling. Updated sync logic to support these options and refactored parameter names for clarity. Introduced unit tests to verify correct behavior and error cases for the new options.
Moved LogContent TypedDict from swanlab/log/type.py to swanlab/core_python/uploader/model.py for better modularity. Updated all relevant imports and usages to reference the new location, ensuring type consistency across modules.
Moved experiment mounting and sync logic in CloudPyCallback to use the new Mounter class. Added filter utility functions for metrics, columns, and epochs in swanlab.data.porter.utils, and updated DataPorter to use these for selective uploads. Added unit tests for the new filter utilities. Improved error message in DataStore for unsupported backup versions.
Assigns 'auto' to the id variable if the --resume flag is set, ensuring correct behavior when resuming a sync operation.
Refactored DataPorter to include experiment id and colors, updated parse logic to skip invalid records, and ensured experiment state is updated after synchronization. Mounter no longer handles cleanup and now sets run_colors only if not already set. The sync entrypoint now uses Mounter to set up run_store from parsed data. Updated proto models to require certain fields. Also renamed a test file for clarity and improved the UseMockRunState utility for more flexible test setup.
8c10ede
to
d6f94d9
Compare
Replaced references to self.run_store with self._run_store in DataPorter to ensure correct attribute usage. Updated Mounter to handle None config values by defaulting to an empty dict when reverting config.
Added Jupyter notebooks and scripts under test/sync to test the synchronization feature, including run and sync workflows. Updated swanlab.sync.__init__.py to remove unused success state update after synchronization.
@Zeyi-Lin Ready
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the sync functionality by introducing resumable training capabilities and refactoring the mount logic into a reusable Mounter component. The key changes enable sync to work with existing experiments and improve code reusability between sync and init operations.
- Adds support for resuming existing experiments during sync operations with
--resume
and--id
CLI options - Extracts mounting logic from the cloud callback into a new reusable Mounter class
- Updates log file format version and improves data validation with better error handling
Reviewed Changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 6 comments.
Show a summary per file
File | Description |
---|---|
swanlab/data/porter/mounter.py | New Mounter class that handles project/experiment mounting logic previously embedded in cloud callback |
swanlab/cli/commands/sync/init.py | Enhanced sync CLI with resume functionality and parameter validation |
swanlab/sync/init.py | Updated sync function to use new Mounter class and support experiment ID parameters |
swanlab/data/porter/init.py | Refactored DataPorter to use filtering utilities and improved synchronization logic |
swanlab/data/porter/datastore.py | Updated log file version and improved validation error handling |
swanlab/toolkit/model.py | New LogContent TypedDict definition |
test/unit/data/porter/ | Comprehensive test coverage for new porter functionality |
Removed execution outputs and metadata from run.ipynb and sync.ipynb for cleaner version control. Added a README.md to the Jupyter sync test directory. Also added a missing return type annotation to filter_epoch in porter/utils.py.
Added comments to clarify the sequence of parameter generation and experiment mounting in the Mounter class. This improves code readability and maintainability.
Refactored the sync command to remove the --resume option and replace it with an --id option that accepts 'auto' for resuming runs. Updated parameter validation and login handling, and improved experiment ID and resume mode logic in the sync implementation. Adjusted tests and scripts to use the new --id 'auto' pattern and updated error messages for clarity.
Removed the return value from DataPorter.synchronize and updated its usage in swanlab.sync. The method now directly updates the client state based on the footer, improving clarity and reducing unnecessary return value propagation.
Updated run_colors assignment in swanlab.sync.__init__ to use only the first two elements of exp.colors. Removed unused or obsolete CLI sync tests from test_cli_sync.py.
Moved run store setup logic from swanlab/sync/__init__.py to a new utility function set_run_store in swanlab/sync/sync_utils.py for better modularity and reuse. Added comprehensive unit tests for set_run_store. Removed obsolete CLI sync test.
本 PR 完善了 sync 功能的逻辑,封装原本 resume 的代码为 Mounter (挂载器),可在 sync 和 init 部分复用。
描述
本PR允许sync功能结合断点续训一起使用,经典场景为:
以上场景在底层被认为是
resume
功能的复用,在代码实现上也是如此关于测试
除了常规单元测试以外,在项目
test/sync
目录下新增纯python的sync测试和jupyter测试,具体可看文件注释API
从产品设计角度出发,在执行 sync 时依旧创建新的实验,断点续训被认为是可选操作,为此我们新增
--id
参数,他有以下可选值:None
:默认行为,等价于new
new
:创建一个新的实验完成sync,等价于resume=never
auto
:使用日志文件中配置的实验id完成sync,等价于resume=allow
str
:其他字符串,则被认为是实验id,此时等价于resume=must
注意事项
--id
为auto
且不为新实验时,不会同步实验运行时间(这与目前resume逻辑一致)日志文件兼容表
closes: #1156
closes: #1136