-
Notifications
You must be signed in to change notification settings - Fork 142
Featrure/resume #1141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Featrure/resume #1141
Conversation
Modified Client's HTTP methods (post, put, get, patch) to return both the decoded response and the raw response object. Updated internal usage and tests to accommodate this change. Added new properties to ExperimentInfo for flag_id, config, root_proj_cuid, and root_exp_cuid.
Added new properties and improved error handling in the Client class for experiment and project mounting. The mount_exp and update_state methods now support additional parameters and more robust error reporting. Introduced comprehensive unit tests for experiment-related functionality, including edge cases for project mounting and experiment existence.
Centralized metrics upload logic into a new trace_metrics function that checks the client's pending state before uploading. Removed post_metrics from Client and updated all uploader functions to use trace_metrics, ensuring uploads are skipped if the client is pending. Updated CloudPyCallback to warn and skip state updates if pending, improving robustness against concurrent session issues.
Updated cloud callback to avoid saving login info when retrieving it. Added tests to ensure that re-initializing experiments after terminal or code login does not require re-authentication.
Refactored CloudPyCallback to use a static method for client creation, supporting optional login info and improved error handling. Added comprehensive unit tests for cloud callback client creation scenarios. Minor logic update in SwanLabInitializer for reinit check. Updated related tests to cover new behaviors.
Introduced generate_run_id in namer.py to create 21-character lowercase alphanumeric run IDs, and check_run_id_format in formatter.py to validate them. Updated all callbackers to assign run_id on initialization, and enforced run_id presence in SwanLabInitializer. Added corresponding unit tests for run_id generation and validation.
Replaces the 'allow_exist' parameter with 'must_exist' in Client.mount_exp and updates related logic to enforce experiment existence when resuming. Adds 'resume' and 'run_id' parameters to the initializer, with validation and propagation through the run store. Updates tests and callback logic to reflect the new resume behavior and error handling.
Introduces a run_id property to SwanLabRun for retrieving the unique run identifier in cloud mode. Updates initialization logic to assert run_id presence, adds tests for run_id behavior across modes, and adjusts test utilities to set run_id explicitly.
Added experiment session flagId to client HTTP headers for better tracking. Refactored uploader functions to consistently return None and improved assertion message in DataPorter for metric error handling.
Moved the SwanLabKey class from exp.py to a new key.py module, improving code organization and encapsulation. Updated SwanLabExp to use the new SwanLabKey signature. Added get_class method to DataWrapper for type consistency checks. Refactored unit tests to use UseMockRunState context manager for better test isolation and reliability.
Introduced the mock_from_remote class method to SwanLabKey for creating mock key objects from remote data, primarily for resume and error marking scenarios. Refactored internal type checks to use chart_type from ColumnInfo, and added comprehensive unit tests for the new method and related behaviors.
Introduces a 'new' attribute to RunStore to indicate if an experiment is newly created or existing. Updates Client.mount_exp to return this status, propagates it through all callbackers, and adds assertions in SwanLabInitializer to ensure correct state handling. Also adds a unit test to verify experiment existence logic. Update docstrings in SwanLabInitializer class Clarified the descriptions for 'resume' and 'reinit' parameters in the SwanLabInitializer class docstring to improve documentation accuracy.
282cc1b
to
9a3c8cc
Compare
Added the is_system_key utility function to the hardware module's exports and included it in __all__. This allows other modules to check if a key is a system key by importing is_system_key directly from the hardware package.
Implements logic to restore experiment state from the cloud, including remote metric metadata and log epoch, when resuming an experiment. Refactors SwanLabExp to initialize keys from remote metrics, updates CloudPyCallback to fetch and parse remote metric summaries and columns, and adjusts SwanLabKey to support remote instantiation. Also updates log proxy to accept an epoch parameter and clarifies config type checks in the initializer.
Moved http.update_state to occur after session closure in CloudPyCallback. Updated test_key.py to handle new return values and parameters for SwanLabKey.mock_from_remote, and adjusted step-related assertions. Removed unused TestResumeNever class from test_main.py and added a placeholder TestResume class in test_sdk.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a “resume” feature, enabling users to continue logging to existing runs by passing resume
and run_id
to swanlab.init
. It adds validation and storage of resume settings, synchronizes remote metrics and configs when resuming, and updates callback logic and client APIs accordingly.
- Extend
init()
API withresume
/run_id
parameters and enforce resume rules - Enhance
RunStore
, callbackers, andClient
to handle resumed runs (config, metrics, log epoch) - Add tests for run ID formatting, resume behavior, and mock run state
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tutils/setup.py | Allow injecting run_id into UseMockRunState |
test/unit/data/test_sdk.py | Add tests for re-init and resume init behavior |
test/unit/data/test_namer.py | Add tests for generate_run_id |
test/unit/data/test_formatter.py | Add check_run_id_format tests |
test/unit/data/run/test_main.py | Wrap SwanLabRun tests with UseMockRunState and add run_id tests |
test/unit/data/run/test_key.py | New tests for SwanLabKey mock-from-remote |
test/unit/data/run/test_config.py | Wrap config tests with UseMockRunState and add config format tests |
test/unit/data/callbacker/test_cloud.py | Add cloud resume callback tests |
test/unit/core_python/test_client.py | Extend client tests with resume and exp suite |
swanlab/log/log.py | Add epoch parameter to start_proxy |
swanlab/data/store.py | Extend RunStore with resume, config, metrics, log_epoch |
swanlab/data/sdk.py | Add resume /run_id parameters and validation in init |
swanlab/data/run/metadata/hardware/utils.py | Add is_system_key utility |
swanlab/data/run/metadata/hardware/init.py | Export is_system_key |
swanlab/data/run/main.py | Populate run_id in SwanLabRun and adjust monitoring logic |
swanlab/data/run/key.py | Complete SwanLabKey to support remote mock columns |
swanlab/data/run/exp.py | Init experiment with existing metrics on resume |
swanlab/data/run/config.py | Rename __fmt_config , add revert_config |
swanlab/data/porter/init.py | Clarify assert message in trace_metric |
swanlab/data/namer.py | Add generate_run_id() |
swanlab/data/modules/wrapper.py | Add get_class() method |
swanlab/data/formatter.py | Add check_run_id_format() |
swanlab/data/callbacker/offline.py | Set run_id and new on init |
swanlab/data/callbacker/local.py | Set run_id and new on init |
swanlab/data/callbacker/disabled.py | Set run_id and new on init |
swanlab/data/callbacker/cloud.py | Implement resume mount logic and fetch remote data |
swanlab/data/callbacker/callback.py | Pass run_store.log_epoch into terminal proxy |
swanlab/core_python/uploader/upload.py | Adapt create_data and add trace_metrics |
swanlab/core_python/client/model.py | Expose flag_id , config , root_proj_cuid , root_exp_cuid |
swanlab/core_python/client/init.py | Propagate flagId , adjust post/put/get return values and mount_exp logic |
Comments suppressed due to low confidence (4)
tutils/setup.py:48
- UseMockRunState now sets up only
run_dir
,media_dir
, andlog_dir
, but misses creatingconsole_dir
andfile_dir
. Tests or code referencing those will fail—addos.mkdir(self.store.console_dir)
andos.mkdir(self.store.file_dir)
.
os.mkdir(self.store.run_dir)
swanlab/log/log.py:135
AtomicCounter
is used here but not imported at the top of the file. Addfrom .atomic_counter import AtomicCounter
or the correct import to avoid a NameError.
if epoch is not None:
swanlab/data/formatter.py:179
re
is not imported in this module. Please addimport re
at the top offormatter.py
.
if not re.match(r"^[a-z0-9]{21}$", run_id_str):
test/unit/core_python/test_client.py:145
is_skip_cloud_test
is not defined in this scope. You likely meantT.is_skip_cloud_test
or need to import it fromtutils
.
@pytest.mark.skipif(is_skip_cloud_test, reason="skip cloud test")
Replaces all occurrences of the 'run_id' parameter with 'id' in the SwanLabInitializer class and updates related documentation and logic. This change improves consistency and clarity in parameter naming.
Renamed the 'run_id' property to 'id' in SwanLabRun for consistency. Added run id format validation using check_run_id_format in SwanLabInitializer. Updated error message in check_run_id_format for clarity.
LGTM |
Changed the expected error message in test assertions from 'run_id' to 'id' to match updated exception messages in run ID format validation tests.
Replaces references to run.run_id with run.id in test_cloud.py and test_main.py to align with updated attribute naming. Ensures tests use the correct property for run identification.
Introduces RUN_ID and RESUME environment variables to SwanLabEnv and updates SwanLabInitializer to load them from the environment. Also fixes potential issues with summary parsing in cloud callback and always updates experiment state on close.
Corrects the logic for determining proj_id by using root_proj_cuid instead of root_exp_cuid, ensuring data is uploaded to the correct project.
Introduces test cases for the 'allow' and 'must' resume modes in swanlab. The tests verify correct behavior when resuming runs, handling duplicate steps, and error states for both modes.
Expanded test coverage for the resume feature in various modes (never, allow, must) in test_sdk.py, including parameter validation and error handling. Added time delays in resume tests to simulate real-world scenarios. Updated comments in key.py for clarity on step handling after resume.
Updated ValueError messages in the Client class to provide clearer information when resuming cloned experiments or when experiment-project mismatches occur. This enhances clarity for users encountering these errors.
Merged the split ValueError message into a single string when raising an error for cloned experiments that cannot be resumed.
Description
本PR完成了一次全新的、巨大的功能更新:Resume,这将允许用户实现类似“断点续训”的功能,在已完成的实验上继续上传指标。
closes: #1054
API接口
Resume 对于用户而言的改动非常小,只需要在
init
函数上传递以下参数即可实现断点续训:其中
id
为云端实验标识,目前必须为21位a-z,0-9
字符串;resume参数支持如下几种:有以下的额外规则:
resume
参数时,等价于never
,这意味着id
不允许被传递mode=cloud
时设置never
以外的 resume 参数allow
,后者等价于never
使用案例
一个典型代码案例如下:
loss和acc将在同一实验被聚合:
注意事项
resume
为must
,id
必须传递;如果resume
为never
,id
不允许传递id
为21位a-z,0-9
字符串resume
此实验,旧的实验将不再上传指标,但本地日志依旧会被记录resume
时,实验名称、tags、描述都不会被更新(暂时)resume
时,硬件监控不会开启(暂时)resume
时,不会采集环境信息(暂时)resume
时,自动更新终端日志