Skip to content

Releases: xming521/WeClone

v0.3.02

17 Aug 07:26
a96996f
Compare
Choose a tag to compare

🎉 What's Changed

Enable configurable thinking in offline cleaning, improve image and gif handling in QA processing, refactor configuration models for cleaner dataset naming, and bump versions and dependencies for release v0.3.02

New Features:

  • Introduce enable_thinking flag in LLMCleanConfig to control offline cleaning behavior
  • Supporting scoring and cleaning of datasets containing images (assigning the highest score to QA pairs that include images).

Enhancements:

  • Refactor cleaned_dataset_name to derive dynamically from original dataset
  • Pass enable_thinking through vLLM inference pipeline and adjust repetition_penalty and max_new_tokens accordingly
  • Implement CommonMethods to parse dataset names with modality-based suffixes and remove deprecated config fields

Build:

  • Bump project version to 0.3.02 and config_version to 0.3.02
  • Update dependencies: openai to 1.87.0, vllm to 0.10.0, torch to 2.7.1, add torchvision, transformers to 4.53.2, and triton to 3.3.1

Full Changelog: v0.3.01...v0.3.02

😊 更新内容

在离线清理中启用可配置的“思考”功能,改进问答处理中的图像和 GIF 处理,重构配置模型以实现更清晰的数据集命名,并为发布 v0.3.02 提升版本和依赖项。

新功能:

  • 引入 enable_thinking 以控制离线清理行为
  • 支持对含有图片的数据集打分清洗(含有图片的qa对赋值最高分)

改进:

  • 重构 cleaned_dataset_name 以从原始数据集动态派生
  • enable_thinking 传递给 vLLM 推理管道,并相应调整 repetition_penaltymax_new_tokens
  • 实现 CommonMethods 以解析带有模态后缀的数据集名称,并移除已弃用的配置字段

构建:

  • 将项目版本提升至 0.3.02,配置版本提升至 0.3.02
  • 更新依赖项:openai 至 1.87.0,vllm 至 0.10.0,torch 至 2.7.1,添加 torchvisiontransformers 至 4.53.2,以及 triton 至 3.3.1

CI:

  • pre-commit-hooks 升级至 v6.0.0,ruff 升级至 v0.12.8

v0.3.01

17 Jul 07:23
d65784f
Compare
Choose a tag to compare

🎉 What's Changed

New Features:

  • Added retry_on_http_error and retry_openai_api decorators with backoff strategy to implement automatic retry mechanism for online LLM calls
  • feat(dataset): Enhance training data by including time-related options.

Enhancements:

  • performance(PII): add batch PII detection to improve performance
  • refactor: unifies combined content separator \n
  • feat(PII): enhance PII detection for Chinese

Fix:

  • Fix Regarding deepspeed version #184
  • fix(dataset): Data processing results have no images.

Tests:

  • Implement test for PII filtering in dataset generation
  • Refactor test fixtures in test_full_pipe, add setup_data_environment and blocked word/image tag assertions

Full Changelog: v0.3.0...v0.3.01

🥲更新内容

新功能:

  • 新增退避策略的retry_on_http_error和retry_openai_api装饰器,增加LLM在线调用自动重试机制
  • feat(dataset):训练数据增加包含时间的选项

功能优化:

  • performance(PII):新增批量PII检测以提升性能
  • refactor:统一group内容分隔符为\n
  • feat(PII):增强中文PII检测能力

问题修复:

  • 修复关于deepspeed版本的#184问题
  • fix(dataset):数据处理结果缺失图像

测试相关:

  • 新增PII过滤测试脚本
  • test_full_pipe新增setup_data_environment及禁用词/图片标签断言检查

v0.3.0

05 Jul 07:49
3de74ac
Compare
Choose a tag to compare

🎉 What's Changed

  • Support fine-tuning of Telegram chat logs
  • Utilize presidio for privacy filtering
  • Added multilingual support configuration
  • Optimized vllm inference and decoding parsing
  • Optimized the logging system, hooked other dependency logging, and added log level configuration
  • Translated log printing and code comments to English
  • Migrated commentjson dependency to pyjson5
  • Added/Updated CLI commands
  • Other (e.g., examples, tests, README)
    Version upgraded with some text updates for consistency and clarity.

🐛 Bug fix

fix #172 fix #170

Full Changelog: v0.2.24...v0.3.0

🥰 更新内容

  • 支持Telegram聊天记录微调
  • 使用presidio进行隐私过滤
  • 添加多语种支持配置;
  • 优化vllm推理、解码解析
  • 优化日志系统,hook其他依赖logging,添加日志等级配置
  • 日志打印、代码注释翻译为英文
  • 迁移commentjson依赖为pyjson5
  • 添加/更新了CLI命令
  • 其他(例如示例、测试、README)
    版本升级并进行了一些文本更新以保持一致性和清晰性。

v0.2.24

19 Jun 10:07
dda773a
Compare
Choose a tag to compare

🥰 What's Changed

  • Update torch version to 2.7.0 and vllm version to 0.9.1, switch offline inference to chat-style invocation
  • Add test_model_args and vllm_args configuration items to allow custom test dataset files
  • Add config file path option in CLI, support setting WECLONE_CONFIG_PATH environment variable
  • Update max_new_tokens and enable_thinking parameters in data cleaning strategy to optimize inference
  • Partial feature adaptation for qwen3

🐛 Bug fix

fix #158 fix #83 fix #77 fix #69

Full Changelog: v0.2.23...v0.2.24

🥰 更新内容

  • 更新torch版本至2.7.0,vllm版本到0.9.1,离线推理改为chat方式调用
  • 添加test_model_args and vllm_args配置项,允许自定义测试集文件
  • CLI中添加配置文件路径选项,支持设置WECLONE_CONFIG_PATH环境变量
  • 更新数据清理策略中的max_new_tokens和enable_thinking参数以优化推理过程
  • 部分功能适配qwen3

v0.2.23

13 Jun 08:01
2defd56
Compare
Choose a tag to compare

🥰 What's Changed

  • Refactoring settings, and add an image modality test script. by @xming521 in #153
    • Refactoring the entire settings-related functionality using pydantic
    • add an image modality test script
    • Unified dataset : chat-sft.
    • Pure text model fine-tuning data switched to ShareGPT format, defaulting to carrying chat history context
    • Upgrade dependencies to support qwen3
  • feat(dataset): Add ImageToText by integrate vision API and refactor clean strategy by @BAIKEMARK in #156
  • Add pre-commit, format code with ruff, update .gitignore, update pyproject.toml, update README.md. by @xming521 in #149

Full Changelog: v0.2.22...v0.2.23

🥰 更新内容

  • 重构配置项,并新增图像模态测试脚本。由 @xming521 提交于 #153
    • 使用 pydantic 重构全部配置相关功能
    • 新增图像模态测试脚本
    • 统一数据集格式:chat-sft
    • 纯文本模型微调数据切换为 ShareGPT 格式,默认携带聊天历史上下文
    • 升级依赖以支持 qwen3
  • feat(dataset): 添加图像转文本功能,重构清洗策略 由 @BAIKEMARK 提交于 #156
  • 新增 pre-commit,使用 ruff 格式化代码,更新 .gitignore,更新 pyproject.toml,更新 README.md。由 @xming521 提交于 #149

v0.2.21

23 May 07:56
65233af
Compare
Choose a tag to compare

What's Changed

更新了什么

New Contributors

Full Changelog: v0.2.2...v0.2.21

v0.2.20

08 May 14:11
537a221
Compare
Choose a tag to compare

✨ 新增特性

  • 新增llm清洗数据。使用llm judge对聊天记录进行打分,使用vllm进行离线推理
  • 支持cli 通过命令行 weclone-cli 使用

🎈 功能优化

  • blocked_words 禁用词库配置移到setting.json文件中
  • 更新依赖项版本,提升torch和torchaudio至2.6.0,更新openai至1.52.0 相应更新test_model,调整pytorch源为cu124,添加vllm。
  • 配置文件改为模板方式

🐛 修复 Bug

Full Changelog: v0.2.0...v0.2.2

v0.2.1-beta1

01 May 11:55
Compare
Choose a tag to compare
v0.2.1-beta1 Pre-release
Pre-release
  • 支持cli 通过命令行 weclone-cli 使用
  • 更新依赖项版本,提升torch和torchaudio至2.6.0,更新openai至1.52.0 相应更新test_model,调整pytorch源为cu124,添加vllm。
  • 配置文件改为模板方式

v0.2.0

22 Apr 13:34
00e5ff6
Compare
Choose a tag to compare

更新内容

  • 0.2.0 版本进行了全面重构,数据集目录和脚本路径全部进行了修改,拉取新代码后,csv文件夹放在dataset下,并且需要重新安装依赖。
  • 默认使用Qwen2.5-7B-Instruct模型,可修改settings.jsonmodel_name_or_pathtemplate选择其他模型。
  • python版本升级到3.10
  • 修复ds多卡训练
  • 可以使用FlashAttention加速训练
  • 完善文档

v0.1.3

13 Apr 11:59
9d269e2
Compare
Choose a tag to compare
  • 对数据处理进行了重构

What's Changed

Full Changelog: v0.1.2...v0.1.3