Add online LLM data cleaning functionality #119

niulinbiao · 2025-05-22T12:10:43Z

实现在线LLM清洗数据功能，通过配置文件online_llm_clear参数可选择在线大模型清洗数据，适配所有openai风格的接口。

Copilot

Pull Request Overview

This PR adds an online LLM data cleaning functionality that adapts to OpenAI-style interfaces. The changes include a new prompt for online cleaning, a new cleaning strategy class integrated in the QA generation workflow, and an online LLM inference module along with the corresponding configuration updates.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
weclone/prompts/clean_data.py	Introduces an online cleaning prompt for LLM-based data cleaning.
weclone/data/qa_generator.py	Updates the QA generation process to conditionally use the online cleaning strategy.
weclone/data/clean/strategies_online.py	Implements the new cleaning strategy using an online LLM service.
weclone/core/inference/online_infer.py	Adds an inference client for calling OpenAI-style APIs.
settings.template.jsonc	Adds configuration options for online LLM cleaning.
pyproject.toml	Updates the configuration version and changelog accordingly.

Comments suppressed due to low confidence (3)

weclone/data/clean/strategies_online.py:25

The class name 'OlineLLMCleaningStrategy' appears to be misspelled. For consistency with the rest of the code, consider renaming it to 'OnlineLLMCleaningStrategy'.

class OlineLLMCleaningStrategy(CleaningStrategy):

weclone/data/qa_generator.py:13

The import uses 'OlineLLMCleaningStrategy' which seems to contain a typographical error. Please update the import to use the corrected class name 'OnlineLLMCleaningStrategy' once the class name is updated.

from weclone.data.clean.strategies_online import OlineLLMCleaningStrategy

settings.template.jsonc:37

[nitpick] The configuration key 'online_llm_clear' is inconsistent with the naming used in the prompt (ONLINE_LLM_CLEAN_PROMPT) and cleaning strategy. Consider renaming it to 'online_llm_clean' for consistency.

"online_llm_clear":false,

Undertone0809 · 2025-05-27T05:37:08Z

可否添加更具体的文档来呈现如何配置这一信息呢？

Undertone0809 · 2025-05-27T05:39:51Z

@xming521 作者能否把 vitepress 文档的代码也放入本项目内，这些方便大家一起来维护优化文档。

xming521 · 2025-05-27T05:46:01Z

@xming521 作者能否把 vitepress 文档的代码也放入本项目内，这些方便大家一起来维护优化文档。

https://github.com/xming521/WeClone-docs/tree/main/docs 文档仓库

Undertone0809 · 2025-05-27T06:01:28Z

@xming521 可以考虑一下放在一个仓库里？很多贡献者似乎默认不知道单独文档仓库的位置，他们在写 PR 的时候，很多时候需要完善对应的文档，放在一起可以让他们更加方便地撰写文档。否则有的时候需要在这里和文档仓库构建两个 PR，放在一起可以减少贡献者上手的成本，对仓库文档维护来说也是件好事。

xming521 · 2025-05-27T06:03:00Z

但是会污染主仓库

Undertone0809 · 2025-05-27T06:29:39Z

但是会污染主仓库

您理解的污染，具体指的是？🤔

xming521 · 2025-05-27T06:32:02Z

但是会污染主仓库

您理解的污染，具体指的是？🤔
大家没必要把文档内容克隆下来吧，文档仓库里还有很多图片很占空间

Undertone0809 · 2025-05-27T06:40:26Z

我刚看了一下文档，其实目前没有图片资源，而且 vitepress 文档本身体积也非常小，几乎不会对主仓造成实际负担，和本地模型相比，这个大小其实很微不足道的。我能理解你担心文档混入代码仓库会让目录结构显得不够干净、降低代码仓库的专注度，特别是对一些追求主仓代码整洁度的项目维护者来说，这确实是一个需要权衡的问题。

但从大多数开源项目实践来看，文档和代码放在一个仓库其实是提升协作效率的最佳实践，尤其对贡献者来说有几个直接好处：

减少 PR 心智负担：很多开发者在提功能或修改代码时会同时想更新文档，如果分仓，他们需要同步打开两个 PR，对初学者和临时贡献者门槛较高；
便于 review 关联：代码和文档在一个 PR 里更容易被 reviewer 一起 review，功能逻辑和文档说明可以联动，分开来的话有些人就不会写对应的文档，最后可能要你来维护，你来写文档，这样反而可能会增加你的工作量；
提升文档维护的连贯性：文档内容的版本和代码功能保持一致，不容易因为版本漂移导致文档失效；
符合 Dev-first 项目生态：很多 dev-friendly 项目（如 Vite、Next.js、LangChain）都将文档作为子目录 docs/ 一起维护。
从社区运营角度来看，这也是降低新贡献者上手门槛的一种方式，更容易形成协作氛围。

看看有没有什么我可以帮上忙的。

niulinbiao and others added 5 commits May 22, 2025 19:50

实现在线LLM清洗数据

bcf89d2

实现在线LLM清洗数据

5085a34

Merge branch 'master' into nb-dev

84f2577

删除 .idea 文件夹，清理项目配置文件

1de1373

编写配置文件更新日志

2bf7bcc

xming521 changed the title ~~实现在线LLM清洗数据功能~~ Add online LLM data cleaning functionality May 22, 2025

xming521 merged commit 17b7c82 into xming521:master May 23, 2025
1 check passed

xming521 requested a review from Copilot May 23, 2025 02:38

Copilot AI reviewed May 23, 2025

View reviewed changes

niulinbiao deleted the nb-dev branch June 8, 2025 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add online LLM data cleaning functionality #119

Add online LLM data cleaning functionality #119

Uh oh!

niulinbiao commented May 22, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

xming521 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

xming521 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

xming521 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add online LLM data cleaning functionality #119

Add online LLM data cleaning functionality #119

Uh oh!

Conversation

niulinbiao commented May 22, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

xming521 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

xming521 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025

Uh oh!

xming521 commented May 27, 2025

Uh oh!

Undertone0809 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Undertone0809 commented May 27, 2025 •

edited

Loading