Releases · xming521/WeClone

🎉 What's Changed

Enable configurable thinking in offline cleaning, improve image and gif handling in QA processing, refactor configuration models for cleaner dataset naming, and bump versions and dependencies for release v0.3.02

New Features:

Introduce enable_thinking flag in LLMCleanConfig to control offline cleaning behavior
Supporting scoring and cleaning of datasets containing images (assigning the highest score to QA pairs that include images).

Enhancements:

Refactor cleaned_dataset_name to derive dynamically from original dataset
Pass enable_thinking through vLLM inference pipeline and adjust repetition_penalty and max_new_tokens accordingly
Implement CommonMethods to parse dataset names with modality-based suffixes and remove deprecated config fields

Build:

Bump project version to 0.3.02 and config_version to 0.3.02
Update dependencies: openai to 1.87.0, vllm to 0.10.0, torch to 2.7.1, add torchvision, transformers to 4.53.2, and triton to 3.3.1

Full Changelog: v0.3.01...v0.3.02

😊 更新内容

在离线清理中启用可配置的“思考”功能，改进问答处理中的图像和 GIF 处理，重构配置模型以实现更清晰的数据集命名，并为发布 v0.3.02 提升版本和依赖项。

新功能：

引入 enable_thinking 以控制离线清理行为
支持对含有图片的数据集打分清洗（含有图片的qa对赋值最高分）

改进：

重构 cleaned_dataset_name 以从原始数据集动态派生
将 enable_thinking 传递给 vLLM 推理管道，并相应调整 repetition_penalty 和 max_new_tokens
实现 CommonMethods 以解析带有模态后缀的数据集名称，并移除已弃用的配置字段

构建：

将项目版本提升至 0.3.02，配置版本提升至 0.3.02
更新依赖项：openai 至 1.87.0，vllm 至 0.10.0，torch 至 2.7.1，添加 torchvision，transformers 至 4.53.2，以及 triton 至 3.3.1

CI：

将 pre-commit-hooks 升级至 v6.0.0，ruff 升级至 v0.12.8

🎉 What's Changed

New Features:

Added retry_on_http_error and retry_openai_api decorators with backoff strategy to implement automatic retry mechanism for online LLM calls
feat(dataset): Enhance training data by including time-related options.

Enhancements:

performance(PII): add batch PII detection to improve performance
refactor: unifies combined content separator \n
feat(PII): enhance PII detection for Chinese

Fix:

Fix Regarding deepspeed version #184
fix(dataset): Data processing results have no images.

Tests:

Implement test for PII filtering in dataset generation
Refactor test fixtures in test_full_pipe, add setup_data_environment and blocked word/image tag assertions

Full Changelog: v0.3.0...v0.3.01

🥲更新内容

新功能：

新增退避策略的retry_on_http_error和retry_openai_api装饰器，增加LLM在线调用自动重试机制
feat(dataset)：训练数据增加包含时间的选项

功能优化：

performance(PII)：新增批量PII检测以提升性能
refactor：统一group内容分隔符为\n
feat(PII)：增强中文PII检测能力

问题修复：

修复关于deepspeed版本的#184问题
fix(dataset)：数据处理结果缺失图像

测试相关：

新增PII过滤测试脚本
test_full_pipe新增setup_data_environment及禁用词/图片标签断言检查

🎉 What's Changed

Support fine-tuning of Telegram chat logs
Utilize presidio for privacy filtering
Added multilingual support configuration
Optimized vllm inference and decoding parsing
Optimized the logging system, hooked other dependency logging, and added log level configuration
Translated log printing and code comments to English
Migrated commentjson dependency to pyjson5
Added/Updated CLI commands
Other (e.g., examples, tests, README)
Version upgraded with some text updates for consistency and clarity.

🐛 Bug fix

fix #172 fix #170

Full Changelog: v0.2.24...v0.3.0

🥰 更新内容

支持Telegram聊天记录微调
使用presidio进行隐私过滤
添加多语种支持配置；
优化vllm推理、解码解析
优化日志系统，hook其他依赖logging，添加日志等级配置
日志打印、代码注释翻译为英文
迁移commentjson依赖为pyjson5
添加/更新了CLI命令
其他（例如示例、测试、README）
版本升级并进行了一些文本更新以保持一致性和清晰性。

🥰 What's Changed

Update torch version to 2.7.0 and vllm version to 0.9.1, switch offline inference to chat-style invocation
Add test_model_args and vllm_args configuration items to allow custom test dataset files
Add config file path option in CLI, support setting WECLONE_CONFIG_PATH environment variable
Update max_new_tokens and enable_thinking parameters in data cleaning strategy to optimize inference
Partial feature adaptation for qwen3

🐛 Bug fix

fix #158 fix #83 fix #77 fix #69

Full Changelog: v0.2.23...v0.2.24

🥰 更新内容

更新torch版本至2.7.0，vllm版本到0.9.1，离线推理改为chat方式调用
添加test_model_args and vllm_args配置项，允许自定义测试集文件
CLI中添加配置文件路径选项，支持设置WECLONE_CONFIG_PATH环境变量
更新数据清理策略中的max_new_tokens和enable_thinking参数以优化推理过程
部分功能适配qwen3

@xming521

🥰 What's Changed

Refactoring settings, and add an image modality test script. by @xming521 in #153
- Refactoring the entire settings-related functionality using pydantic
- add an image modality test script
- Unified dataset : chat-sft.
- Pure text model fine-tuning data switched to ShareGPT format, defaulting to carrying chat history context
- Upgrade dependencies to support qwen3
feat(dataset): Add ImageToText by integrate vision API and refactor clean strategy by @BAIKEMARK in #156
Add pre-commit, format code with ruff, update .gitignore, update pyproject.toml, update README.md. by @xming521 in #149

Full Changelog: v0.2.22...v0.2.23

🥰 更新内容

重构配置项，并新增图像模态测试脚本。由 @xming521 提交于 #153
- 使用 pydantic 重构全部配置相关功能
- 新增图像模态测试脚本
- 统一数据集格式：chat-sft
- 纯文本模型微调数据切换为 ShareGPT 格式，默认携带聊天历史上下文
- 升级依赖以支持 qwen3
feat(dataset): 添加图像转文本功能，重构清洗策略由 @BAIKEMARK 提交于 #156
新增 pre-commit，使用 ruff 格式化代码，更新 .gitignore，更新 pyproject.toml，更新 README.md。由 @xming521 提交于 #149

@RockChinQ

What's Changed

doc: add LangBot integration by @RockChinQ in #65
优化CSV文件读取 by @Mundi-Xu in #87
add log and test pipeline by @xming521 in #118
Add online LLM data cleaning functionality by @niulinbiao in #119

更新了什么

文档：添加 LangBot 集成，由 @RockChinQ 在 #65
优化 CSV 文件读取，由 @Mundi-Xu 在 #87
添加日志和测试流水线，由 @xming521 在 #118
添加在线 LLM 数据清洗功能，由 @niulinbiao 在 #119

New Contributors

@RockChinQ made their first contribution in #65
@songhahaha66 made their first contribution in #68
@BAIKEMARK made their first contribution in #74
@Mundi-Xu made their first contribution in #87
@niulinbiao made their first contribution in #119

Full Changelog: v0.2.2...v0.2.21

✨ 新增特性

新增llm清洗数据。使用llm judge对聊天记录进行打分，使用vllm进行离线推理
支持cli 通过命令行 weclone-cli 使用

🎈 功能优化

blocked_words 禁用词库配置移到setting.json文件中
更新依赖项版本，提升torch和torchaudio至2.6.0，更新openai至1.52.0 相应更新test_model，调整pytorch源为cu124，添加vllm。
配置文件改为模板方式

🐛 修复 Bug

fix 多卡length_cdf

Full Changelog: v0.2.0...v0.2.2

支持cli 通过命令行 weclone-cli 使用
更新依赖项版本，提升torch和torchaudio至2.6.0，更新openai至1.52.0 相应更新test_model，调整pytorch源为cu124，添加vllm。
配置文件改为模板方式

更新内容

0.2.0 版本进行了全面重构，数据集目录和脚本路径全部进行了修改，拉取新代码后，csv文件夹放在dataset下，并且需要重新安装依赖。
默认使用Qwen2.5-7B-Instruct模型，可修改settings.json的model_name_or_path和template选择其他模型。
python版本升级到3.10
修复ds多卡训练
可以使用FlashAttention加速训练
完善文档

@xming521

对数据处理进行了重构

What's Changed

更新readme by @xming521 in #17
Refactoring data processing by @xming521 in #23
Update README.md by @xming521 in #25

Full Changelog: v0.1.2...v0.1.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🎉 What's Changed

😊 更新内容

Uh oh!

🎉 What's Changed

🥲更新内容

Uh oh!

🎉 What's Changed

🐛 Bug fix

🥰 更新内容

Uh oh!

🥰 What's Changed

🐛 Bug fix

🥰 更新内容

Uh oh!

🥰 What's Changed

🥰 更新内容

Contributors

Uh oh!

What's Changed

更新了什么

New Contributors

Contributors

Uh oh!

✨ 新增特性

🎈 功能优化

🐛 修复 Bug

Uh oh!

Uh oh!

Uh oh!

What's Changed

Contributors

Uh oh!

Releases: xming521/WeClone

v0.3.02

🎉 What's Changed

😊 更新内容

Uh oh!

v0.3.01

🎉 What's Changed

🥲更新内容

Uh oh!

v0.3.0

🎉 What's Changed

🐛 Bug fix

🥰 更新内容

Uh oh!

v0.2.24

🥰 What's Changed

🐛 Bug fix

🥰 更新内容

Uh oh!

v0.2.23

🥰 What's Changed

🥰 更新内容

Contributors

Uh oh!

v0.2.21

What's Changed

更新了什么

New Contributors

Contributors

Uh oh!

v0.2.20

✨ 新增特性

🎈 功能优化

🐛 修复 Bug

Uh oh!

v0.2.1-beta1

Uh oh!

v0.2.0

Uh oh!

v0.1.3

What's Changed

Contributors

Uh oh!