Skip to content

Feat text input cleaning #4850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 24, 2025

Conversation

HarikalarKutusu
Copy link
Contributor

@HarikalarKutusu HarikalarKutusu commented Mar 27, 2025

This fixes several raw text related problems for new sentences and reports.

  • Introduces simple central functions
  • All free-form text input (sentences, sentence sources, text from reports' other option) will be cleaned before entering the database.
  • The sentence_id is calculated from the cleaned sentence

Notes/warnings:

  • It does not handle any existing data. We still need cleaning in bundler for metadata .tsv files so they have correct structure (columns) coming from past malformed data.
  • This PR fixes [BUG] Irregular spacing between words can result in duplicate sentences added via write page #4614 but if there already is a sentence entered with double-space (or TAB, CR, LF, ...) the new & cleaned sentence will be allowed.
  • In dev environments, where sentences are imported from files, there might be cases where the sentence_ids / even sentences not match, because this time they will be cleaned of invisible characters like TAB or CR/LF (word separating characters).
  • Some existing regex cleaning has been replaced by this, but the piped one (server/src/core/sentences/cleaning/multiple-sentences.ts) is untouched.

Limited testing on dev environment - checked from local database (Dockerized):

  • Entering a single sentence with TAB chars in sentence and source => OK
  • Entering a small batch sentences with (malformed sentence and source) => OK
  • Reporting with "other", multiple lines and TAB characters => OK

@moz-bozden moz-bozden merged commit 37b6e6b into common-voice:main Jun 24, 2025
2 checks passed
@moz-bozden moz-bozden deleted the feat-text-input-clean branch June 24, 2025 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Irregular spacing between words can result in duplicate sentences added via write page
3 participants