stabilize translations by reverting old ids by text content #33270

l0rinc · 2025-08-29T09:23:56Z

Summary

Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs.

You can reproduce by checking out #33224 and running:

cmake --preset dev-mode -DWITH_USDT=OFF -DENABLE_IPC=OFF && cmake --build build_dev_mode --target translate

This will change many translation IDs:

git diff | egrep '^[+-].+trans-unit id=' 
-      <trans-unit id="_msg1040">
-      <trans-unit id="_msg1041">
+      <trans-unit id="_msg1040">
...
-      <trans-unit id="_msg1160">
+      <trans-unit id="_msg1159">
+      <trans-unit id="_msg1160">

Details

As mentioned in #33224 (comment), changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder.

Fix

The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.

Reproducer

This PR has two commits; the first resets the baseline to make testing simpler.

Deleting a translation

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457795674)
@@ -904,7 +904,6 @@
     }

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

correctly deletes only that one

 git diff | egrep '^[+-].+trans-unit id='
-      <trans-unit id="_msg1039">

Adding a new one:

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457688152)
@@ -905,6 +905,7 @@

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
         InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
+        InitWarning(_("Options2 '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

keep all other ones intact

git diff | egrep '^[+-].+trans-unit id='
+      <trans-unit id="_msg1239">

Modifying a single one

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457951822)
@@ -904,7 +904,7 @@
     }

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
+        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated and are expected to be removed in a future version."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

is marked as a clean delete/add

 git diff | egrep '^[+-].+trans-unit id='
-      <trans-unit id="_msg1039">
+      <trans-unit id="_msg1239">

And as a bonus, only changing the case of a translation

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457080223)
@@ -904,7 +904,7 @@
     }

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
+        InitWarning(_("OPTIONS '-DATACARRIER' OR '-DATACARRIERSIZE' ARE SET BUT ARE MARKED AS DEPRECATED. THEY WILL BE REMOVED IN A FUTURE VERSION."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

results in keeping the message id

git diff | egrep '^[+-].+trans-unit id='
-      <trans-unit id="_msg1039">
+      <trans-unit id="_msg1039">

DrahtBot · 2025-08-29T09:24:01Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33270.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	hebasto

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

DrahtBot · 2025-08-29T09:27:57Z

🚧 At least one of the CI tasks failed.
_{Task lint: https://github.com/bitcoin/bitcoin/runs/49171015206}
_{LLM reason (✨ experimental): Lint failure: Python style error E401 (multiple imports on one line) in contrib/devtools/stabilize_xlf_ids.py.}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

maflcko · 2025-08-29T10:03:26Z

Not sure if this needs a script. Post-freeze changes should be rare enough to just manually take over the affected string without going through any script.

Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs. Changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder. The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.

l0rinc · 2025-08-29T22:14:46Z

@maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex.

@achow101 created a test Bitcoin Core translation sandbox for me where I've uploaded the latest bitcoin_en.xlf.
Between Core versions most of the translations need to be reassigned (in the example of 29 vs 30 only 10% of the original IDs were kept so the translations all showed that most of the strings don't have pairs)

I have generated a dummy French translation and simulated a review on it:

Before this PR, changing a single line invalidated 10% of the other translations as well because of the sorting bug described above:

Uploaded the same file that's generated with this PR for #33224 agains a 100% approved language:

And since the translation IDs are resurrected, it correctly shows that a single entry needs to be retranslated:

maflcko · 2025-08-30T13:42:50Z

@maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex.

Ah, I see. I wonder if the shasum of the full content can be used as an id for the translation string. Conceptually it seems simpler to get a stable id, than to try to artificially number and re-number the strings, depending on the history. (Obviously this wouldn't help with #33224, but this instance should be trivial to handle manually as a one-off.)

hebasto · 2025-09-01T09:42:06Z

Since we are not using ID-based translations, it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.

l0rinc · 2025-09-01T18:30:26Z

we are not using ID-based translations

@hebasto, are suggesting that the bitcoin-test clone where we tried it isn't representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we're not using the ids, why aren't the translations stable? What alternative do you suggest to stabilize them?

shasum of the full content can be used as an id for the translation string

@maflcko we could of course do that, but it would likely invalidate every single text in the next release. We could of course do it step-by-step and only add hashes for the new entries (instead of max-id + 1, as I did in the script here). Note that hashes would prohibit same-text-with-different-translations, e.g. Clear could mean "delete" or "agree", based on context.

hebasto · 2025-09-01T19:44:02Z

we are not using ID-based translations

@hebasto, are suggesting that the bitcoin-test clone where we tried it isn't representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we're not using the ids, why aren't the translations stable?

We do not use them, but Transifex does.

What alternative do you suggest to stabilize them?

As I wrote in #33270 (comment):

... it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.

l0rinc · 2025-09-01T19:49:09Z

Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead? Any reason we didn't do that before?

hebasto · 2025-09-01T19:55:54Z

Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead?

That’s my guess, though I haven’t tested it.

Any reason we didn't do that before?

I’m not aware of any specific reason.

l0rinc · 2025-09-01T20:07:40Z

My understanding is that Translation Memory Fillups are Growth plan feature only.

I have tried uploading the English translation without any ids (called keys in Transifex)

adding back a single id results in only that value being imported:

achow101 · 2025-09-01T22:58:44Z

The id attribute is required by the XLIFF 1.2 spec: https://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#trans-unit

hebasto

stabilize translations...

Concept ACK on this idea.

Could we avoid reintroducing Python scripts into the translate build target?

l0rinc · 2025-09-02T17:42:01Z

It wouldn't be difficult to make this into a script that generates an id mapping that's we'd run manually, if we're worried about python, something like:

sed -E -i \
  -e 's|id="_msg1160"|id="_msg1239"|g' \
  -e 's|id="_msg1159"|id="_msg1160"|g' \
  -e 's|id="_msg120\[0\]"|id="_msg1240\[0\]"|g' \
  -e 's|id="_msg120\[1\]"|id="_msg1240\[1\]"|g' \
  'src/qt/locale/bitcoin_en.xlf'

Would that be more useful in your opinion? What are the worries exactly, can you please point me to the discussion that you were referring to with "reintroducing Python scripts" (or is this just a reference to the cmake migration)? We could of course migrate this to cmake as well, but do we really want to do that :)? I don't...

hebasto · 2025-09-05T14:10:14Z

... can you please point me to the discussion that you were referring to with "reintroducing Python scripts"...

The Python dependency for the translate build target was recently removed in #33209 by @purpleKarrot.

purpleKarrot · 2025-09-06T06:12:41Z

do we really want to do that :)? I don't...

I would do It, but I am concerned about the bus factor, so I would prefer to train someone else.
05255d5 is an example for replacing a sed replacement with cmake code.

translations: recreate baseline to simplify testing

14cb1f8

l0rinc changed the title ~~translations: recreate baseline to simplify testing~~ stabilize translations by reverting old ids by text content Aug 29, 2025

l0rinc force-pushed the l0rinc/stabilize-translations branch from a2568c0 to 70dd57e Compare August 29, 2025 09:27

DrahtBot added the CI failed label Aug 29, 2025

l0rinc mentioned this pull request Aug 29, 2025

doc: unify datacarriersize warning with release notes #33224

Merged

l0rinc force-pushed the l0rinc/stabilize-translations branch from 70dd57e to 291749d Compare August 29, 2025 19:18

l0rinc marked this pull request as ready for review August 29, 2025 20:16

DrahtBot removed the CI failed label Aug 29, 2025

hebasto reviewed Sep 2, 2025

View reviewed changes

stabilize translations by reverting old ids by text content #33270

Are you sure you want to change the base?

stabilize translations by reverting old ids by text content #33270

Conversation

l0rinc commented Aug 29, 2025

Summary

Details

Fix

Reproducer

Deleting a translation

Adding a new one:

Modifying a single one

And as a bonus, only changing the case of a translation

Uh oh!

DrahtBot commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Uh oh!

DrahtBot commented Aug 29, 2025

Uh oh!

maflcko commented Aug 29, 2025

Uh oh!

l0rinc commented Aug 29, 2025

Uh oh!

maflcko commented Aug 30, 2025

Uh oh!

hebasto commented Sep 1, 2025

Uh oh!

l0rinc commented Sep 1, 2025

Uh oh!

hebasto commented Sep 1, 2025

Uh oh!

l0rinc commented Sep 1, 2025

Uh oh!

hebasto commented Sep 1, 2025

Uh oh!

l0rinc commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achow101 commented Sep 1, 2025

Uh oh!

hebasto left a comment

Choose a reason for hiding this comment

Uh oh!

l0rinc commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hebasto commented Sep 5, 2025

Uh oh!

purpleKarrot commented Sep 6, 2025

Uh oh!

Uh oh!

DrahtBot commented Aug 29, 2025 •

edited

Loading

l0rinc commented Sep 1, 2025 •

edited

Loading

l0rinc commented Sep 2, 2025 •

edited

Loading