-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat(dataset): integrate vision API and refactor clean strategy #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- 新增 VisionApiConfig 类用于配置视觉 API - 在数据处理中集成图像识别功能,支持并行处理 - 重构数据清洗策略,支持在线和离线两种方式- 优化数据清洗流程,提高可扩展性和可维护性
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates a Vision API for image recognition and refactors the data cleaning strategy to support multi-modal datasets, including updates to configuration, processing, and template files.
- Introduced VisionApiConfig and updated dataset configurations to conditionally switch data sources based on image recognition.
- Added ImageToTextProcessor to process images via an external API with parallel execution, and refactored cleaning strategies.
- Updated documentation and sample configuration files to reflect the new multi-modal processing options.
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
weclone/utils/config_models.py | Added VisionApiConfig and integrated vision_api in MakeDatasetArgs |
weclone/utils/configV2.py | Passed vision_api config into WCTrainSftConfig |
weclone/utils/config.py | Adjusted dataset selection based on vision_api enable flag |
weclone/train/train_sft.py | Updated cleaning strategy usage and dynamic dataset name update |
weclone/prompts/clean_data.py | Modified instructions for evaluating chat quality with style criteria |
weclone/data/utils.py | Added ImageToTextProcessor for image-to-text conversion with retry logic |
weclone/data/qa_generatorV2.py | Integrated image processing in parallel for QA generation |
weclone/data/clean/strategies.py | Refactored cleaning strategies and consolidated online cleaning logic |
settings.template.jsonc & examples/mllm.template.jsonc | Added vision_api configuration parameters |
dataset/res_csv/sft/dataset_info.json | Added dataset info for the cleaned chat-sft dataset |
README.md | Updated documentation to describe multi-modal training and data completion using vision_api |
Comments suppressed due to low confidence (2)
weclone/data/utils.py:63
- The _encode_image_to_base64 method returns None on failure but is documented to return a string. Update the return type annotation to Optional[str] to accurately reflect possible outcomes.
return None
weclone/data/clean/strategies.py:159
- The class name 'OlineLLMCleaningStrategy' appears to contain a typo. Consider renaming it to 'OnlineLLMCleaningStrategy' for clarity and consistency.
class OlineLLMCleaningStrategy(CleaningStrategy):
f"{self.api_url}/chat/completions", headers=headers, json=payload, timeout=60 | ||
) | ||
if response.status_code == 200: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch for status code 200 currently only has a 'pass' statement. Consider removing 'pass' and adding a comment to clarify that the response is valid and processing continues.
pass | |
# Response is valid; processing continues below. |
Copilot uses AI. Check for mistakes.
Add ImageToText