Skip to content

DB migration for ug Variants, based on earlier work by @moz-bozden for kdy #4891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

moz-kathyreid
Copy link
Contributor

Pull Request Form

Type of Pull Request

  • DB migration to ascribe all existing sentences of ug Uyghur to ug-Arab Variant to allow for new sentences to be added of ug-Cyrl Variant

  • Related to a listed issue

  • Other

Acknowledging contributors

  • many thanks @moz-bozden for providing an SQL query to work from used for very similar request for Circassian languages

@moz-kathyreid moz-kathyreid requested a review from a team as a code owner May 8, 2025 13:17
@moz-kathyreid moz-kathyreid requested review from data-sync-user and removed request for a team May 8, 2025 13:17
@moz-kathyreid moz-kathyreid added Variant A language variant request or issues related to language variants language:ug ئۇيغۇر تىلى, Уйғур тили, Uyghur tili, Uyƣur tili, Uyğur tili labels May 8, 2025
@moz-kathyreid moz-kathyreid linked an issue May 8, 2025 that may be closed by this pull request
@moz-kathyreid moz-kathyreid marked this pull request as draft May 8, 2025 13:18
@moz-kathyreid
Copy link
Contributor Author

Testing in progress, this is still in draft

  • Test that the migration works on docker compose up without error => yes
  • Test that the database is in the expected state after migration applied => pending, need to import ug sentences and confirm using
SELECT COUNT(*) AS cnt
FROM sentences s
JOIN sentence_metadata sm ON sm.sentence_id = s.id
WHERE sm.variant_id = (
    SELECT variant_id
    FROM variants v
    WHERE v.variant_token = 'ug-Arab'
)

@moz-kathyreid moz-kathyreid changed the title DB migration, based on earlier work by @moz-bulent for kdy wq DB migration, based on earlier work by @moz-bulent for kdy May 8, 2025
@moz-kathyreid moz-kathyreid changed the title DB migration, based on earlier work by @moz-bulent for kdy DB migration, based on earlier work by @moz-bozden for kdy May 8, 2025
Copy link

github-actions bot commented May 8, 2025

Size Change: +6 B (0%)

Total Size: 6.87 MB

ℹ️ View Unchanged
Filename Size Change
web/dist/1-1.****************.png 9.92 kB 0 B
web/dist/1-2.****************.png 7.71 kB 0 B
web/dist/1-3.****************.png 14.2 kB 0 B
web/dist/1-4.****************.png 24.5 kB 0 B
web/dist/1-5.****************.png 15.8 kB 0 B
web/dist/1-6.****************.png 12.4 kB 0 B
web/dist/1-red-copy.****************.svg 9.1 kB 0 B
web/dist/171.****************.css 2.39 kB 0 B
web/dist/171.****************.js 5.92 kB 0 B
web/dist/178.****************.js 3.42 kB 0 B
web/dist/178.****************.css 4.34 kB 0 B
web/dist/185.****************.css 2.96 kB 0 B
web/dist/185.****************.js 6.61 kB 0 B
web/dist/2-1.****************.png 7.14 kB 0 B
web/dist/2-2.****************.png 14.2 kB 0 B
web/dist/2-3.****************.png 10.4 kB 0 B
web/dist/2-4.****************.png 9.92 kB 0 B
web/dist/2-5.****************.png 8.49 kB 0 B
web/dist/227.****************.js 629 B 0 B
web/dist/227.****************.css 391 B 0 B
web/dist/275.****************.css 9.7 kB 0 B
web/dist/275.****************.js 15.8 kB 0 B
web/dist/290.****************.js 6.69 kB 0 B
web/dist/303.****************.css 1.34 kB 0 B
web/dist/303.****************.js 1.68 kB 0 B
web/dist/353.****************.js 2.04 kB 0 B
web/dist/353.****************.css 1.29 kB 0 B
web/dist/357.****************.css 4.98 kB 0 B
web/dist/357.****************.js 16.8 kB 0 B
web/dist/397.****************.js 912 B 0 B
web/dist/397.****************.css 1.27 kB 0 B
web/dist/41.****************.css 3.37 kB 0 B
web/dist/41.****************.js 8.65 kB 0 B
web/dist/425.****************.css 3.12 kB 0 B
web/dist/425.****************.js 5.61 kB 0 B
web/dist/457.****************.js 5.52 kB 0 B
web/dist/457.****************.css 3.28 kB 0 B
web/dist/483.****************.js 3.17 kB 0 B
web/dist/483.****************.css 817 B 0 B
web/dist/513.****************.js 8.43 kB 0 B
web/dist/513.****************.css 5.46 kB 0 B
web/dist/616.****************.css 765 B 0 B
web/dist/659.****************.js 8.69 kB 0 B
web/dist/659.****************.css 4.25 kB 0 B
web/dist/752.****************.js 6.46 kB 0 B
web/dist/752.****************.css 2.78 kB 0 B
web/dist/778.****************.js 495 B 0 B
web/dist/778.****************.css 736 B 0 B
web/dist/846.****************.css 2.23 kB 0 B
web/dist/846.****************.js 3.43 kB 0 B
web/dist/909.****************.js 5.41 kB 0 B
web/dist/alert.****************.svg 745 B 0 B
web/dist/award.****************.svg 334 B 0 B
web/dist/background-wave-error.****************.svg 402 B 0 B
web/dist/background-wave-success.****************.svg 550 B 0 B
web/dist/background-wave.****************.svg 284 B 0 B
web/dist/bbc.****************.svg 5.4 kB 0 B
web/dist/challenge-mtn-post.****************.svg 3.3 kB 0 B
web/dist/challenge-mtn-pre.****************.svg 3.36 kB 0 B
web/dist/checkmark-green.****************.svg 219 B 0 B
web/dist/checkmark.****************.svg 313 B 0 B
web/dist/chevron-left.****************.svg 292 B 0 B
web/dist/chevron-right.****************.svg 294 B 0 B
web/dist/chrome-color.****************.svg 1.96 kB 0 B
web/dist/close-black-light.****************.svg 307 B 0 B
web/dist/close-black.****************.svg 820 B 0 B
web/dist/close.****************.svg 145 B 0 B
web/dist/common-voice-mars-neutral.****************.png 126 kB 0 B
web/dist/contact.****************.svg 971 B 0 B
web/dist/contribute.****************.png 600 kB 0 B
web/dist/cv-logo-black.****************.svg 2.33 kB 0 B
web/dist/cv-logo-white.****************.svg 2.85 kB 0 B
web/dist/dashboard.****************.png 955 kB 0 B
web/dist/datasets-intro-background-triangle.****************.svg 446 B 0 B
web/dist/datasets-intro-background.****************.svg 944 B 0 B
web/dist/deepspeech.****************.png 10.7 kB 0 B
web/dist/deepspeech@2x.****************.png 31.6 kB 0 B
web/dist/deepspeech@3x.****************.png 63.2 kB 0 B
web/dist/discord.****************.svg 18.4 kB 0 B
web/dist/discourse.****************.png 16.4 kB 0 B
web/dist/discourse.****************.svg 473 B 0 B
web/dist/discourse@2x.****************.png 44.8 kB 0 B
web/dist/discourse@3x.****************.png 74 kB 0 B
web/dist/donate-banner-desktop-coral.****************.svg 600 B 0 B
web/dist/donate-banner-desktop-pink.****************.svg 600 B 0 B
web/dist/donate-bg1.****************.png 57.8 kB 0 B
web/dist/donate-bg2.****************.png 43.4 kB 0 B
web/dist/down-arrow.****************.svg 399 B 0 B
web/dist/dropdown-arrow.****************.svg 165 B 0 B
web/dist/email-bg-light.****************.png 12.7 kB 0 B
web/dist/email-bg-md-light.****************.svg 919 B 0 B
web/dist/email-bg-md.****************.svg 1.11 kB 0 B
web/dist/email-bg-partnerships.****************.png 287 kB 0 B
web/dist/email-bg.****************.png 4.27 kB 0 B
web/dist/exclamation.****************.svg 534 B 0 B
web/dist/fast-company.****************.svg 19.8 kB 0 B
web/dist/feedback.****************.png 26.4 kB 0 B
web/dist/feedback@2x.****************.png 82.8 kB 0 B
web/dist/feedback@3x.****************.png 269 kB 0 B
web/dist/ff-color.****************.svg 4.21 kB 0 B
web/dist/ff-logo.****************.svg 91.6 kB 0 B
web/dist/financial-times.****************.svg 42.3 kB 0 B
web/dist/github.****************.svg 851 B 0 B
web/dist/globe_black.****************.svg 930 B 0 B
web/dist/grid.****************.svg 387 B 0 B
web/dist/guidelines-waves-footer-small.****************.png 11.8 kB 0 B
web/dist/guidelines-waves-footer.****************.png 154 kB 0 B
web/dist/guidelines-waves.****************.png 225 kB 0 B
web/dist/happy-mars@2x.****************.png 23 kB 0 B
web/dist/hex-done.****************.svg 336 B 0 B
web/dist/hex.****************.svg 300 B 0 B
web/dist/home.****************.svg 506 B 0 B
web/dist/ibm.****************.svg 498 B 0 B
web/dist/idea-light-bulb.****************.svg 285 B 0 B
web/dist/keyboard.****************.svg 354 B 0 B
web/dist/languages/test.json 20 B 0 B
web/dist/lenovo.****************.svg 912 B 0 B
web/dist/librispeech.****************.png 161 B 0 B
web/dist/listen-bg.****************.svg 1.91 kB 0 B
web/dist/mail.****************.svg 998 B 0 B
web/dist/main-9a8b795a.****************.css 10.5 kB +6 B (+0.06%)
web/dist/main-9a8b795a.****************.js 41.8 kB 0 B
web/dist/main-ef7d455c.****************.js 9.61 kB 0 B
web/dist/mars-avatar.****************.svg 1.35 kB 0 B
web/dist/mars-email-success.****************.svg 7.66 kB 0 B
web/dist/mars-request.****************.svg 6.94 kB 0 B
web/dist/mars-sad.****************.svg 6.14 kB 0 B
web/dist/mars-solo.****************.svg 1.92 kB 0 B
web/dist/mars.****************.svg 2.63 kB 0 B
web/dist/mesh.****************.svg 492 B 0 B
web/dist/mic.****************.svg 435 B 0 B
web/dist/MIT-technology-review.****************.svg 14.4 kB 0 B
web/dist/mozilla-common-voice_foundation-academia.****************.png 10.2 kB 0 B
web/dist/mozilla-common-voice_foundation-community.****************.png 10.9 kB 0 B
web/dist/mozilla-common-voice_foundation-corporates.****************.png 11.9 kB 0 B
web/dist/mozilla-common-voice_foundation-foundations.****************.png 12.3 kB 0 B
web/dist/mozilla-common-voice_foundation-governments.****************.png 7.51 kB 0 B
web/dist/mozilla-common-voice_foundation-mars_small.****************.png 24 kB 0 B
web/dist/mozilla-common-voice_foundation-mars.****************.png 49.3 kB 0 B
web/dist/mozilla-common-voice_foundation-small-business.****************.png 10.6 kB 0 B
web/dist/mozilla-logo.****************.svg 880 B 0 B
web/dist/mozilla.****************.svg 468 B 0 B
web/dist/nvidia-logo.****************.svg 23.8 kB 0 B
web/dist/play.****************.svg 2.1 kB 0 B
web/dist/plus.****************.svg 333 B 0 B
web/dist/question-mark.****************.svg 755 B 0 B
web/dist/red-robot.****************.svg 1.1 MB 0 B
web/dist/releases/cv-corpus-1.json 2.91 kB 0 B
web/dist/releases/cv-corpus-10.0-2022-07-04.json 14.5 kB 0 B
web/dist/releases/cv-corpus-10.0-delta-2022-07-04.json 7.49 kB 0 B
web/dist/releases/cv-corpus-11.0-2022-09-21.json 15 kB 0 B
web/dist/releases/cv-corpus-2.json 4.32 kB 0 B
web/dist/releases/cv-corpus-3.json 4.42 kB 0 B
web/dist/releases/cv-corpus-4-2019-12-10.json 5.48 kB 0 B
web/dist/releases/cv-corpus-5-2020-06-22.json 9.77 kB 0 B
web/dist/releases/cv-corpus-5-singleword.json 2.53 kB 0 B
web/dist/releases/cv-corpus-5.1-2020-06-22.json 9.78 kB 0 B
web/dist/releases/cv-corpus-5.1-singleword.json 2.51 kB 0 B
web/dist/releases/cv-corpus-6.0-2020-12-11.json 10.5 kB 0 B
web/dist/releases/cv-corpus-6.0-singleword.json 3.42 kB 0 B
web/dist/releases/cv-corpus-6.1-2020-12-11.json 10.7 kB 0 B
web/dist/releases/cv-corpus-6.1-singleword.json 3.51 kB 0 B
web/dist/releases/cv-corpus-7.0-2021-07-21.json 13.1 kB 0 B
web/dist/releases/cv-corpus-7.0-singleword.json 3.63 kB 0 B
web/dist/releases/cv-corpus-8.0-2022-01-19.json 13.2 kB 0 B
web/dist/releases/cv-corpus-9.0-2022-04-27.json 14.4 kB 0 B
web/dist/review-waves.****************.png 20.4 kB 0 B
web/dist/robot.****************.png 52.1 kB 0 B
web/dist/runtime.****************.js 2.61 kB 0 B
web/dist/safari-color.****************.svg 8.6 kB 0 B
web/dist/sap.****************.svg 842 B 0 B
web/dist/search.****************.svg 432 B 0 B
web/dist/segment-dots.****************.svg 474 B 0 B
web/dist/sodedif.****************.png 1.96 kB 0 B
web/dist/sound-waves-1.****************.png 54.7 kB 0 B
web/dist/sound-waves-2.****************.png 20.9 kB 0 B
web/dist/speak-bg.****************.svg 1.92 kB 0 B
web/dist/star.****************.svg 562 B 0 B
web/dist/stars-disabled.****************.svg 2.06 kB 0 B
web/dist/stars.****************.svg 2.1 kB 0 B
web/dist/success.****************.svg 1.44 kB 0 B
web/dist/support.****************.svg 1.9 kB 0 B
web/dist/table.****************.png 217 kB 0 B
web/dist/tail.****************.svg 357 B 0 B
web/dist/tatoeba.****************.png 21 kB 0 B
web/dist/ted.****************.png 178 B 0 B
web/dist/upload.****************.svg 546 B 0 B
web/dist/vendors-0bc0478e.****************.js 14.8 kB 0 B
web/dist/vendors-27545368.****************.js 4.43 kB 0 B
web/dist/vendors-2a46b080.****************.js 17.9 kB 0 B
web/dist/vendors-2b4841d6.****************.js 16.8 kB 0 B
web/dist/vendors-5178c671.****************.js 3.27 kB 0 B
web/dist/vendors-6ff3e2ec.****************.js 15.9 kB 0 B
web/dist/vendors-866ab763.****************.js 12.8 kB 0 B
web/dist/vendors-94223cbb.****************.js 13 kB 0 B
web/dist/vendors-96700d3a.****************.js 15.8 kB 0 B
web/dist/vendors-a06af057.****************.js 8.48 kB 0 B
web/dist/vendors-acdd0895.****************.js 140 kB 0 B
web/dist/vendors-b403787e.****************.js 19.8 kB 0 B
web/dist/vendors-c092ac97.****************.js 5.1 kB 0 B
web/dist/vendors-cb98e9ec.****************.js 53.7 kB 0 B
web/dist/vendors-cdd60c62.****************.js 7.6 kB 0 B
web/dist/vendors-d17272b1.****************.js 5.32 kB 0 B
web/dist/vendors-d2eb5610.****************.js 13.9 kB 0 B
web/dist/vendors-e4915c4c.****************.js 16.1 kB 0 B
web/dist/vendors-e5bca7e4.****************.js 16.7 kB 0 B
web/dist/vendors-efdee510.****************.js 11.2 kB 0 B
web/dist/vendors-f501a089.****************.js 1.15 kB 0 B
web/dist/venture-beat.****************.svg 61.7 kB 0 B
web/dist/voxforge.****************.png 10.5 kB 0 B
web/dist/wave-1.****************.svg 446 B 0 B
web/dist/wave-2.****************.svg 809 B 0 B
web/dist/wave-3.****************.svg 468 B 0 B
web/dist/wave-blue.****************.svg 4.39 kB 0 B
web/dist/wave-eq.****************.svg 1.03 kB 0 B
web/dist/wave-fading.****************.svg 346 B 0 B
web/dist/wave-grey.****************.svg 1.83 kB 0 B
web/dist/wave-top.****************.png 5.3 kB 0 B
web/dist/wave.****************.png 196 kB 0 B
web/dist/wave.****************.svg 1.04 kB 0 B
web/dist/waves-md.****************.svg 1.65 kB 0 B
web/dist/waves-small.****************.png 35.1 kB 0 B
web/dist/waves.****************.svg 1.78 kB 0 B
web/dist/waves.****************.png 33.4 kB 0 B
web/dist/waves@2x.****************.png 114 kB 0 B
web/dist/waves@3x.****************.png 228 kB 0 B
web/dist/waving-mars-clipped.****************.svg 92.3 kB 0 B
web/dist/waving-mars.****************.svg 92.3 kB 0 B

compressed-size-action

@moz-kathyreid moz-kathyreid changed the title DB migration, based on earlier work by @moz-bozden for kdy DB migration for ug Variants, based on earlier work by @moz-bozden for kdy May 9, 2025
@moz-kathyreid
Copy link
Contributor Author

I updated the SQL query used here because ...

  • after loading ug sentences and running the migration, I ran this query to validate that the migration was doing what I expected it to do:
SELECT COUNT(*) AS cnt
FROM sentences s
JOIN sentence_metadata sm ON sm.sentence_id = s.id
WHERE sm.variant_id = (
    SELECT variant_id
    FROM variants v
    WHERE v.variant_token = 'ug-Arab'
)

This returned 0 rows, so I did further investigation.

  • The original query I used here, being:
      UPDATE sentence_metadata sm
      JOIN sentences s on s.id = sm.sentence_id
      SET sm.variant_id= (SELECT id FROM variants WHERE variant_token = 'ug-Arab')
      WHERE locale_id = (SELECT id FROM locales WHERE name='ug')
        AND variant_id IS NULL

makes the assumption that the sentence has previously had a corresponding row inserted into sentence_metadata with a variant_id. This is incorrect - as there is no corresponding row for the ug sentences.

  • This means that instead of an UPDATE statement for this migration, an INSERT statement was needed:
      INSERT INTO sentence_metadata (sentence_id, variant_id, created_at)
        SELECT 
          s.id,
          v.id,
          NOW()
        FROM 
          sentences s,
          variants v,
          locales l
        WHERE 
          l.name = 'ug'
          AND s.locale_id = l.id
          AND v.variant_token = 'ug-Arab'
  • This was tested successfully via the following mechanisms:
    • ensuring migration ran OK => OK
common-voice  | [BE] [INFO] Processed migration 02505082230990-set-empty-ug-variants-to-arab
common-voice  | [BE] [INFO] Done
  • running the original validation SQL above => 13,769 rows altered (as expected)

Now ready for review and merge.

@moz-kathyreid moz-kathyreid marked this pull request as ready for review May 9, 2025 04:39
…rt where the equivalent sentence_id does not exist already in sentence_metadata
@moz-kathyreid
Copy link
Contributor Author

Refactored based on discussions with @moz-dfeller

  • There already exist some sentences for ug in the sentence_metadata table in production
  • So we need a way to INSERT a row, with variant_id into sentence_metadata if the sentence_id does not already exist in the table, but to UPDATE it if it already does.
  • We initially discussed using an UPDATE ON DUPLICATE KEY approach, however, sentence_id column in the sentence_metadata table is a foreign key constraint, not a primary key constraint so the UPDATE ON DUPLICATE KEY is not triggered.

Instead, I approached this in two parts:

  1. Update the existing records in sentence_metadata with the new variant
  2. Insert new records into sentence_metadata where they did not already exist

To test this, I:

  1. Deleted all existing rows for ug variants in sentence_metadata
  2. Seeded the table with the following 10 rows which have null variant_id:
INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('fffb7a7b400a2677092451fb8dd4852915b833478bff75e59b0028807805d283', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('fff9f1d40b87002c1e17f5d0cbf8579b72f438f1d9b88cf03a2ea346ea8e0a2c', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('fff09b3c774316891ae9179c8350ca0dc83e9ddf5c30fc2df7dbe357225fa565', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('ffeed1f099fc9ee842859f6240bc53a307bc3bd31c6bbe9bf271ab088ce7ac52', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('ffed42e558076d9e017b174fe82e5618cf31719afdf511c9ca6fbb6d60ec9a78', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('ffe98cab65a55efc87b2b5e4cac528e74dded9ca3939eed86aa0eadae0ebd9a5', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('ffe01ec16e9a9d6553a3d0685142fea357f57ce97b53dbe699b0204dba3545ce', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('ffde1c90813b233ea8d1a3a0760634fa4f350f207802fbc12dd18d0b3159ec89', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('ffd5b2fdf9933c4f6107c70629a93e0d714437a2ccf8d85b7415af6cc7fe4bd5', NOW())

INSERT INTO sentence_metadata (sentence_id, created_at)
VALUES ('ffd5954b414f497871f305847b2cdda9527fda728fce787c1662372c3ffd580b', NOW())
  1. Deleted existing migration from migrations table
  2. Ran docker compose up => check for successful migration run:
common-voice  | [BE] [INFO] Processed migration 02505082230990-set-empty-ug-variants-to-arab
common-voice  | [BE] [INFO] Done
  1. Check that all rows in DB for ug have expected variant_id in sentence_metadata:
SELECT COUNT(*) AS CNT 
FROM sentence_metadata sm 
JOIN variants v on v.variant_token = 'ug-Arab'
WHERE v.id = sm.variant_id 

Expected value = 13,769 (13,759 new, 10 to update) = 13,769 in database

Validation checks passed, sending for review and merge.

Copy link
Contributor

@moz-dfeller moz-dfeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this awesome update! I bet you trained you SQL muscle quite a bit ;) There are just two things:

  1. I believe the timestamp in your file name has an extraneous 0 at the end, which makes the timestamp inconsistent with the rest of the files
  2. The created_at field has a default value of NOW() so there is no need to specify it explicitly in this case, but I don't mind either way, so that's just more of a comment :)

Great job 🚀

@moz-kathyreid moz-kathyreid self-assigned this May 15, 2025
@moz-kathyreid
Copy link
Contributor Author

Thanks @moz-dfeller - yeah this was a good exercise in MySQL - I haven't worked heavily with it for 10 years or so, excellent refresher. I really enjoy SQL so this was a fun challenge.

Excellent pickup on the filename - very precise, I like it.

This is now renamed, sending back for review / merge.

@moz-kathyreid moz-kathyreid merged commit fff5976 into main May 21, 2025
2 checks passed
@moz-kathyreid moz-kathyreid deleted the 4144-add-db-migration-to-ascribe-sentences-to-ug-Arab branch May 21, 2025 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language:ug ئۇيغۇر تىلى, Уйғур тили, Uyghur tili, Uyƣur tili, Uyğur tili Variant A language variant request or issues related to language variants
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request support for Uyghur Cryllic Script
2 participants