feat(dataset): integrate vision API and refactor clean strategy #156

BAIKEMARK · 2025-06-12T15:30:25Z

Add ImageToText

- 新增 VisionApiConfig 类用于配置视觉 API - 在数据处理中集成图像识别功能，支持并行处理 - 重构数据清洗策略，支持在线和离线两种方式- 优化数据清洗流程，提高可扩展性和可维护性

Copilot

Pull Request Overview

This PR integrates a Vision API for image recognition and refactors the data cleaning strategy to support multi-modal datasets, including updates to configuration, processing, and template files.

Introduced VisionApiConfig and updated dataset configurations to conditionally switch data sources based on image recognition.
Added ImageToTextProcessor to process images via an external API with parallel execution, and refactored cleaning strategies.
Updated documentation and sample configuration files to reflect the new multi-modal processing options.

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
weclone/utils/config_models.py	Added VisionApiConfig and integrated vision_api in MakeDatasetArgs
weclone/utils/configV2.py	Passed vision_api config into WCTrainSftConfig
weclone/utils/config.py	Adjusted dataset selection based on vision_api enable flag
weclone/train/train_sft.py	Updated cleaning strategy usage and dynamic dataset name update
weclone/prompts/clean_data.py	Modified instructions for evaluating chat quality with style criteria
weclone/data/utils.py	Added ImageToTextProcessor for image-to-text conversion with retry logic
weclone/data/qa_generatorV2.py	Integrated image processing in parallel for QA generation
weclone/data/clean/strategies.py	Refactored cleaning strategies and consolidated online cleaning logic
settings.template.jsonc & examples/mllm.template.jsonc	Added vision_api configuration parameters
dataset/res_csv/sft/dataset_info.json	Added dataset info for the cleaned chat-sft dataset
README.md	Updated documentation to describe multi-modal training and data completion using vision_api

Comments suppressed due to low confidence (2)

weclone/data/utils.py:63

The _encode_image_to_base64 method returns None on failure but is documented to return a string. Update the return type annotation to Optional[str] to accurately reflect possible outcomes.

return None

weclone/data/clean/strategies.py:159

The class name 'OlineLLMCleaningStrategy' appears to contain a typo. Consider renaming it to 'OnlineLLMCleaningStrategy' for clarity and consistency.

class OlineLLMCleaningStrategy(CleaningStrategy):

Copilot · 2025-06-12T15:30:54Z

weclone/data/utils.py

+                    f"{self.api_url}/chat/completions", headers=headers, json=payload, timeout=60
+                )
+                if response.status_code == 200:
+                    pass


The branch for status code 200 currently only has a 'pass' statement. Consider removing 'pass' and adding a comment to clarify that the response is valid and processing continues.

Suggested change

pass

# Response is valid; processing continues below.

BAIKEMARK added 4 commits June 12, 2025 13:13

refactor(data): 重构数据清洗逻辑并更新评估标准

f3e9efc

更新setting模板

f3f0b61

UPDATE README

1f48559

BAIKEMARK requested a review from Copilot June 12, 2025 15:30

Copilot AI reviewed Jun 12, 2025

View reviewed changes

BAIKEMARK and others added 3 commits June 12, 2025 23:37

Merge branch 'master' into img-rec

2ac3d17

refactor(train_sft): 移除无需显式设置的配置项

6f465d3

Merge remote-tracking branch 'myfork/img-rec' into img-rec-a

db13823

xming521 merged commit bca3240 into xming521:master Jun 13, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dataset): integrate vision API and refactor clean strategy #156

feat(dataset): integrate vision API and refactor clean strategy #156

Uh oh!

BAIKEMARK commented Jun 12, 2025 •

edited by xming521

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

feat(dataset): integrate vision API and refactor clean strategy #156

feat(dataset): integrate vision API and refactor clean strategy #156

Uh oh!

Conversation

BAIKEMARK commented Jun 12, 2025 • edited by xming521 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BAIKEMARK commented Jun 12, 2025 •

edited by xming521

Loading