Skip to content

Add online LLM data cleaning functionality #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 23, 2025

Conversation

niulinbiao
Copy link
Contributor

实现在线LLM清洗数据功能,通过配置文件online_llm_clear参数可选择在线大模型清洗数据,适配所有openai风格的接口。

@xming521 xming521 changed the title 实现在线LLM清洗数据功能 Add online LLM data cleaning functionality May 22, 2025
@xming521 xming521 merged commit 17b7c82 into xming521:master May 23, 2025
1 check passed
@xming521 xming521 requested a review from Copilot May 23, 2025 02:38
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an online LLM data cleaning functionality that adapts to OpenAI-style interfaces. The changes include a new prompt for online cleaning, a new cleaning strategy class integrated in the QA generation workflow, and an online LLM inference module along with the corresponding configuration updates.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
weclone/prompts/clean_data.py Introduces an online cleaning prompt for LLM-based data cleaning.
weclone/data/qa_generator.py Updates the QA generation process to conditionally use the online cleaning strategy.
weclone/data/clean/strategies_online.py Implements the new cleaning strategy using an online LLM service.
weclone/core/inference/online_infer.py Adds an inference client for calling OpenAI-style APIs.
settings.template.jsonc Adds configuration options for online LLM cleaning.
pyproject.toml Updates the configuration version and changelog accordingly.
Comments suppressed due to low confidence (3)

weclone/data/clean/strategies_online.py:25

  • The class name 'OlineLLMCleaningStrategy' appears to be misspelled. For consistency with the rest of the code, consider renaming it to 'OnlineLLMCleaningStrategy'.
class OlineLLMCleaningStrategy(CleaningStrategy):

weclone/data/qa_generator.py:13

  • The import uses 'OlineLLMCleaningStrategy' which seems to contain a typographical error. Please update the import to use the corrected class name 'OnlineLLMCleaningStrategy' once the class name is updated.
from weclone.data.clean.strategies_online import OlineLLMCleaningStrategy

settings.template.jsonc:37

  • [nitpick] The configuration key 'online_llm_clear' is inconsistent with the naming used in the prompt (ONLINE_LLM_CLEAN_PROMPT) and cleaning strategy. Consider renaming it to 'online_llm_clean' for consistency.
"online_llm_clear":false,

@Undertone0809
Copy link

可否添加更具体的文档来呈现如何配置这一信息呢?

@Undertone0809
Copy link

@xming521 作者能否把 vitepress 文档的代码也放入本项目内,这些方便大家一起来维护优化文档。

@xming521
Copy link
Owner

@xming521 作者能否把 vitepress 文档的代码也放入本项目内,这些方便大家一起来维护优化文档。

https://github.com/xming521/WeClone-docs/tree/main/docs 文档仓库

@Undertone0809
Copy link

@xming521 可以考虑一下放在一个仓库里?很多贡献者似乎默认不知道单独文档仓库的位置,他们在写 PR 的时候,很多时候需要完善对应的文档,放在一起可以让他们更加方便地撰写文档。否则有的时候需要在这里和文档仓库构建两个 PR,放在一起可以减少贡献者上手的成本,对仓库文档维护来说也是件好事。

@xming521
Copy link
Owner

但是会污染主仓库

@Undertone0809
Copy link

但是会污染主仓库

您理解的污染,具体指的是?🤔

@xming521
Copy link
Owner

但是会污染主仓库

您理解的污染,具体指的是?🤔
大家没必要把文档内容克隆下来吧,文档仓库里还有很多图片很占空间

@Undertone0809
Copy link

Undertone0809 commented May 27, 2025

我刚看了一下文档,其实目前没有图片资源,而且 vitepress 文档本身体积也非常小,几乎不会对主仓造成实际负担,和本地模型相比,这个大小其实很微不足道的。我能理解你担心文档混入代码仓库会让目录结构显得不够干净、降低代码仓库的专注度,特别是对一些追求主仓代码整洁度的项目维护者来说,这确实是一个需要权衡的问题。

但从大多数开源项目实践来看,文档和代码放在一个仓库其实是提升协作效率的最佳实践,尤其对贡献者来说有几个直接好处:

  • 减少 PR 心智负担:很多开发者在提功能或修改代码时会同时想更新文档,如果分仓,他们需要同步打开两个 PR,对初学者和临时贡献者门槛较高;

  • 便于 review 关联:代码和文档在一个 PR 里更容易被 reviewer 一起 review,功能逻辑和文档说明可以联动,分开来的话有些人就不会写对应的文档,最后可能要你来维护,你来写文档,这样反而可能会增加你的工作量;

  • 提升文档维护的连贯性:文档内容的版本和代码功能保持一致,不容易因为版本漂移导致文档失效;

  • 符合 Dev-first 项目生态:很多 dev-friendly 项目(如 Vite、Next.js、LangChain)都将文档作为子目录 docs/ 一起维护。

  • 从社区运营角度来看,这也是降低新贡献者上手门槛的一种方式,更容易形成协作氛围。

看看有没有什么我可以帮上忙的。

@niulinbiao niulinbiao deleted the nb-dev branch June 8, 2025 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants