Releases: xming521/WeClone
Releases · xming521/WeClone
v0.3.02
🎉 What's Changed
Enable configurable thinking in offline cleaning, improve image and gif handling in QA processing, refactor configuration models for cleaner dataset naming, and bump versions and dependencies for release v0.3.02
New Features:
- Introduce enable_thinking flag in LLMCleanConfig to control offline cleaning behavior
- Supporting scoring and cleaning of datasets containing images (assigning the highest score to QA pairs that include images).
Enhancements:
- Refactor cleaned_dataset_name to derive dynamically from original dataset
- Pass enable_thinking through vLLM inference pipeline and adjust repetition_penalty and max_new_tokens accordingly
- Implement CommonMethods to parse dataset names with modality-based suffixes and remove deprecated config fields
Build:
- Bump project version to 0.3.02 and config_version to 0.3.02
- Update dependencies: openai to 1.87.0, vllm to 0.10.0, torch to 2.7.1, add torchvision, transformers to 4.53.2, and triton to 3.3.1
Full Changelog: v0.3.01...v0.3.02
😊 更新内容
在离线清理中启用可配置的“思考”功能,改进问答处理中的图像和 GIF 处理,重构配置模型以实现更清晰的数据集命名,并为发布 v0.3.02 提升版本和依赖项。
新功能:
- 引入
enable_thinking
以控制离线清理行为 - 支持对含有图片的数据集打分清洗(含有图片的qa对赋值最高分)
改进:
- 重构
cleaned_dataset_name
以从原始数据集动态派生 - 将
enable_thinking
传递给 vLLM 推理管道,并相应调整repetition_penalty
和max_new_tokens
- 实现
CommonMethods
以解析带有模态后缀的数据集名称,并移除已弃用的配置字段
构建:
- 将项目版本提升至 0.3.02,配置版本提升至 0.3.02
- 更新依赖项:
openai
至 1.87.0,vllm
至 0.10.0,torch
至 2.7.1,添加torchvision
,transformers
至 4.53.2,以及triton
至 3.3.1
CI:
- 将
pre-commit-hooks
升级至 v6.0.0,ruff
升级至 v0.12.8
v0.3.01
🎉 What's Changed
New Features:
- Added retry_on_http_error and retry_openai_api decorators with backoff strategy to implement automatic retry mechanism for online LLM calls
- feat(dataset): Enhance training data by including time-related options.
Enhancements:
- performance(PII): add batch PII detection to improve performance
- refactor: unifies combined content separator \n
- feat(PII): enhance PII detection for Chinese
Fix:
- Fix Regarding deepspeed version #184
- fix(dataset): Data processing results have no images.
Tests:
- Implement test for PII filtering in dataset generation
- Refactor test fixtures in test_full_pipe, add setup_data_environment and blocked word/image tag assertions
Full Changelog: v0.3.0...v0.3.01
🥲更新内容
新功能:
- 新增退避策略的retry_on_http_error和retry_openai_api装饰器,增加LLM在线调用自动重试机制
- feat(dataset):训练数据增加包含时间的选项
功能优化:
- performance(PII):新增批量PII检测以提升性能
- refactor:统一group内容分隔符为\n
- feat(PII):增强中文PII检测能力
问题修复:
- 修复关于deepspeed版本的#184问题
- fix(dataset):数据处理结果缺失图像
测试相关:
- 新增PII过滤测试脚本
- test_full_pipe新增setup_data_environment及禁用词/图片标签断言检查
v0.3.0
🎉 What's Changed
- Support fine-tuning of
Telegram
chat logs - Utilize
presidio
for privacy filtering - Added multilingual support configuration
- Optimized vllm inference and decoding parsing
- Optimized the logging system, hooked other dependency logging, and added log level configuration
- Translated log printing and code comments to English
- Migrated commentjson dependency to pyjson5
- Added/Updated CLI commands
- Other (e.g., examples, tests, README)
Version upgraded with some text updates for consistency and clarity.
🐛 Bug fix
Full Changelog: v0.2.24...v0.3.0
🥰 更新内容
- 支持Telegram聊天记录微调
- 使用presidio进行隐私过滤
- 添加多语种支持配置;
- 优化vllm推理、解码解析
- 优化日志系统,hook其他依赖logging,添加日志等级配置
- 日志打印、代码注释翻译为英文
- 迁移commentjson依赖为pyjson5
- 添加/更新了CLI命令
- 其他(例如示例、测试、README)
版本升级并进行了一些文本更新以保持一致性和清晰性。
v0.2.24
🥰 What's Changed
- Update torch version to 2.7.0 and vllm version to 0.9.1, switch offline inference to chat-style invocation
- Add
test_model_args
andvllm_args
configuration items to allow custom test dataset files - Add config file path option in CLI, support setting WECLONE_CONFIG_PATH environment variable
- Update max_new_tokens and enable_thinking parameters in data cleaning strategy to optimize inference
- Partial feature adaptation for qwen3
🐛 Bug fix
fix #158 fix #83 fix #77 fix #69
Full Changelog: v0.2.23...v0.2.24
🥰 更新内容
- 更新torch版本至2.7.0,vllm版本到0.9.1,离线推理改为chat方式调用
- 添加
test_model_args
andvllm_args
配置项,允许自定义测试集文件 - CLI中添加配置文件路径选项,支持设置WECLONE_CONFIG_PATH环境变量
- 更新数据清理策略中的max_new_tokens和enable_thinking参数以优化推理过程
- 部分功能适配qwen3
v0.2.23
🥰 What's Changed
- Refactoring settings, and add an image modality test script. by @xming521 in #153
- Refactoring the entire settings-related functionality using pydantic
- add an image modality test script
- Unified dataset : chat-sft.
- Pure text model fine-tuning data switched to ShareGPT format, defaulting to carrying chat history context
- Upgrade dependencies to support qwen3
- feat(dataset): Add ImageToText by integrate vision API and refactor clean strategy by @BAIKEMARK in #156
- Add pre-commit, format code with ruff, update .gitignore, update pyproject.toml, update README.md. by @xming521 in #149
Full Changelog: v0.2.22...v0.2.23
🥰 更新内容
v0.2.21
What's Changed
- doc: add LangBot integration by @RockChinQ in #65
- 优化CSV文件读取 by @Mundi-Xu in #87
- add log and test pipeline by @xming521 in #118
- Add online LLM data cleaning functionality by @niulinbiao in #119
更新了什么
- 文档:添加 LangBot 集成,由 @RockChinQ 在 #65
- 优化 CSV 文件读取,由 @Mundi-Xu 在 #87
- 添加日志和测试流水线,由 @xming521 在 #118
- 添加在线 LLM 数据清洗功能,由 @niulinbiao 在 #119
New Contributors
- @RockChinQ made their first contribution in #65
- @songhahaha66 made their first contribution in #68
- @BAIKEMARK made their first contribution in #74
- @Mundi-Xu made their first contribution in #87
- @niulinbiao made their first contribution in #119
Full Changelog: v0.2.2...v0.2.21
v0.2.20
✨ 新增特性
- 新增llm清洗数据。使用llm judge对聊天记录进行打分,使用vllm进行离线推理
- 支持cli 通过命令行
weclone-cli
使用
🎈 功能优化
- blocked_words 禁用词库配置移到setting.json文件中
- 更新依赖项版本,提升torch和torchaudio至2.6.0,更新openai至1.52.0 相应更新test_model,调整pytorch源为cu124,添加vllm。
- 配置文件改为模板方式
🐛 修复 Bug
Full Changelog: v0.2.0...v0.2.2
v0.2.1-beta1
- 支持cli 通过命令行
weclone-cli
使用 - 更新依赖项版本,提升torch和torchaudio至2.6.0,更新openai至1.52.0 相应更新test_model,调整pytorch源为cu124,添加vllm。
- 配置文件改为模板方式
v0.2.0
更新内容
- 0.2.0 版本进行了全面重构,数据集目录和脚本路径全部进行了修改,拉取新代码后,
csv
文件夹放在dataset
下,并且需要重新安装依赖。 - 默认使用Qwen2.5-7B-Instruct模型,可修改settings.json的
model_name_or_path
和template
选择其他模型。 - python版本升级到3.10
- 修复ds多卡训练
- 可以使用FlashAttention加速训练
- 完善文档