-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add online LLM data cleaning functionality #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an online LLM data cleaning functionality that adapts to OpenAI-style interfaces. The changes include a new prompt for online cleaning, a new cleaning strategy class integrated in the QA generation workflow, and an online LLM inference module along with the corresponding configuration updates.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
weclone/prompts/clean_data.py | Introduces an online cleaning prompt for LLM-based data cleaning. |
weclone/data/qa_generator.py | Updates the QA generation process to conditionally use the online cleaning strategy. |
weclone/data/clean/strategies_online.py | Implements the new cleaning strategy using an online LLM service. |
weclone/core/inference/online_infer.py | Adds an inference client for calling OpenAI-style APIs. |
settings.template.jsonc | Adds configuration options for online LLM cleaning. |
pyproject.toml | Updates the configuration version and changelog accordingly. |
Comments suppressed due to low confidence (3)
weclone/data/clean/strategies_online.py:25
- The class name 'OlineLLMCleaningStrategy' appears to be misspelled. For consistency with the rest of the code, consider renaming it to 'OnlineLLMCleaningStrategy'.
class OlineLLMCleaningStrategy(CleaningStrategy):
weclone/data/qa_generator.py:13
- The import uses 'OlineLLMCleaningStrategy' which seems to contain a typographical error. Please update the import to use the corrected class name 'OnlineLLMCleaningStrategy' once the class name is updated.
from weclone.data.clean.strategies_online import OlineLLMCleaningStrategy
settings.template.jsonc:37
- [nitpick] The configuration key 'online_llm_clear' is inconsistent with the naming used in the prompt (ONLINE_LLM_CLEAN_PROMPT) and cleaning strategy. Consider renaming it to 'online_llm_clean' for consistency.
"online_llm_clear":false,
可否添加更具体的文档来呈现如何配置这一信息呢? |
@xming521 作者能否把 vitepress 文档的代码也放入本项目内,这些方便大家一起来维护优化文档。 |
https://github.com/xming521/WeClone-docs/tree/main/docs 文档仓库 |
@xming521 可以考虑一下放在一个仓库里?很多贡献者似乎默认不知道单独文档仓库的位置,他们在写 PR 的时候,很多时候需要完善对应的文档,放在一起可以让他们更加方便地撰写文档。否则有的时候需要在这里和文档仓库构建两个 PR,放在一起可以减少贡献者上手的成本,对仓库文档维护来说也是件好事。 |
但是会污染主仓库 |
您理解的污染,具体指的是?🤔 |
|
我刚看了一下文档,其实目前没有图片资源,而且 vitepress 文档本身体积也非常小,几乎不会对主仓造成实际负担,和本地模型相比,这个大小其实很微不足道的。我能理解你担心文档混入代码仓库会让目录结构显得不够干净、降低代码仓库的专注度,特别是对一些追求主仓代码整洁度的项目维护者来说,这确实是一个需要权衡的问题。 但从大多数开源项目实践来看,文档和代码放在一个仓库其实是提升协作效率的最佳实践,尤其对贡献者来说有几个直接好处:
看看有没有什么我可以帮上忙的。 |
实现在线LLM清洗数据功能,通过配置文件online_llm_clear参数可选择在线大模型清洗数据,适配所有openai风格的接口。