Skip to content

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Aug 29, 2025

Summary

Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs.

You can reproduce by checking out #33224 and running:

cmake --preset dev-mode -DWITH_USDT=OFF -DENABLE_IPC=OFF && cmake --build build_dev_mode --target translate

This will change many translation IDs:

git diff | egrep '^[+-].+trans-unit id=' 
-      <trans-unit id="_msg1040">
-      <trans-unit id="_msg1041">
+      <trans-unit id="_msg1040">
...
-      <trans-unit id="_msg1160">
+      <trans-unit id="_msg1159">
+      <trans-unit id="_msg1160">

Details

As mentioned in #33224 (comment), changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder.

Fix

The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.

Reproducer

This PR has two commits; the first resets the baseline to make testing simpler.

Deleting a translation

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457795674)
@@ -904,7 +904,6 @@
     }

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

correctly deletes only that one

 git diff | egrep '^[+-].+trans-unit id='
-      <trans-unit id="_msg1039">

Adding a new one:

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457688152)
@@ -905,6 +905,7 @@

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
         InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
+        InitWarning(_("Options2 '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

keep all other ones intact

git diff | egrep '^[+-].+trans-unit id='
+      <trans-unit id="_msg1239">

Modifying a single one

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457951822)
@@ -904,7 +904,7 @@
     }

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
+        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated and are expected to be removed in a future version."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

is marked as a clean delete/add

 git diff | egrep '^[+-].+trans-unit id='
-      <trans-unit id="_msg1039">
+      <trans-unit id="_msg1239">

And as a bonus, only changing the case of a translation

diff --git a/src/init.cpp b/src/init.cpp
--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
+++ b/src/init.cpp	(date 1756457080223)
@@ -904,7 +904,7 @@
     }

     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
+        InitWarning(_("OPTIONS '-DATACARRIER' OR '-DATACARRIERSIZE' ARE SET BUT ARE MARKED AS DEPRECATED. THEY WILL BE REMOVED IN A FUTURE VERSION."));
     }

     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.

results in keeping the message id

git diff | egrep '^[+-].+trans-unit id='
-      <trans-unit id="_msg1039">
+      <trans-unit id="_msg1039">

@DrahtBot
Copy link
Contributor

DrahtBot commented Aug 29, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33270.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK hebasto

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

@l0rinc l0rinc changed the title translations: recreate baseline to simplify testing stabilize translations by reverting old ids by text content Aug 29, 2025
@l0rinc l0rinc force-pushed the l0rinc/stabilize-translations branch from a2568c0 to 70dd57e Compare August 29, 2025 09:27
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task lint: https://github.com/bitcoin/bitcoin/runs/49171015206
LLM reason (✨ experimental): Lint failure: Python style error E401 (multiple imports on one line) in contrib/devtools/stabilize_xlf_ids.py.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@maflcko
Copy link
Member

maflcko commented Aug 29, 2025

Not sure if this needs a script. Post-freeze changes should be rare enough to just manually take over the affected string without going through any script.

Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs.
Changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder.
The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.
@l0rinc l0rinc force-pushed the l0rinc/stabilize-translations branch from 70dd57e to 291749d Compare August 29, 2025 19:18
@l0rinc l0rinc marked this pull request as ready for review August 29, 2025 20:16
@l0rinc
Copy link
Contributor Author

l0rinc commented Aug 29, 2025

@maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex.

@achow101 created a test Bitcoin Core translation sandbox for me where I've uploaded the latest bitcoin_en.xlf.
Between Core versions most of the translations need to be reassigned (in the example of 29 vs 30 only 10% of the original IDs were kept so the translations all showed that most of the strings don't have pairs)

I have generated a dummy French translation and simulated a review on it:

Before this PR, changing a single line invalidated 10% of the other translations as well because of the sorting bug described above:

Uploaded the same file that's generated with this PR for #33224 agains a 100% approved language:

And since the translation IDs are resurrected, it correctly shows that a single entry needs to be retranslated:

@maflcko
Copy link
Member

maflcko commented Aug 30, 2025

@maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex.

Ah, I see. I wonder if the shasum of the full content can be used as an id for the translation string. Conceptually it seems simpler to get a stable id, than to try to artificially number and re-number the strings, depending on the history. (Obviously this wouldn't help with #33224, but this instance should be trivial to handle manually as a one-off.)

@hebasto
Copy link
Member

hebasto commented Sep 1, 2025

Since we are not using ID-based translations, it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.

@l0rinc
Copy link
Contributor Author

l0rinc commented Sep 1, 2025

we are not using ID-based translations

@hebasto, are suggesting that the bitcoin-test clone where we tried it isn't representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we're not using the ids, why aren't the translations stable? What alternative do you suggest to stabilize them?

shasum of the full content can be used as an id for the translation string

@maflcko we could of course do that, but it would likely invalidate every single text in the next release. We could of course do it step-by-step and only add hashes for the new entries (instead of max-id + 1, as I did in the script here). Note that hashes would prohibit same-text-with-different-translations, e.g. Clear could mean "delete" or "agree", based on context.

@hebasto
Copy link
Member

hebasto commented Sep 1, 2025

we are not using ID-based translations

@hebasto, are suggesting that the bitcoin-test clone where we tried it isn't representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we're not using the ids, why aren't the translations stable?

We do not use them, but Transifex does.

What alternative do you suggest to stabilize them?

As I wrote in #33270 (comment):

... it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.

@l0rinc
Copy link
Contributor Author

l0rinc commented Sep 1, 2025

Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead? Any reason we didn't do that before?

@hebasto
Copy link
Member

hebasto commented Sep 1, 2025

Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead?

That’s my guess, though I haven’t tested it.

Any reason we didn't do that before?

I’m not aware of any specific reason.

@l0rinc
Copy link
Contributor Author

l0rinc commented Sep 1, 2025

My understanding is that Translation Memory Fillups are Growth plan feature only.

I have tried uploading the English translation without any ids (called keys in Transifex)

adding back a single id results in only that value being imported:

@achow101
Copy link
Member

achow101 commented Sep 1, 2025

The id attribute is required by the XLIFF 1.2 spec: https://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#trans-unit

Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stabilize translations...

Concept ACK on this idea.

Could we avoid reintroducing Python scripts into the translate build target?

@l0rinc
Copy link
Contributor Author

l0rinc commented Sep 2, 2025

It wouldn't be difficult to make this into a script that generates an id mapping that's we'd run manually, if we're worried about python, something like:

sed -E -i \
  -e 's|id="_msg1160"|id="_msg1239"|g' \
  -e 's|id="_msg1159"|id="_msg1160"|g' \
  -e 's|id="_msg120\[0\]"|id="_msg1240\[0\]"|g' \
  -e 's|id="_msg120\[1\]"|id="_msg1240\[1\]"|g' \
  'src/qt/locale/bitcoin_en.xlf'

Would that be more useful in your opinion? What are the worries exactly, can you please point me to the discussion that you were referring to with "reintroducing Python scripts" (or is this just a reference to the cmake migration)? We could of course migrate this to cmake as well, but do we really want to do that :)? I don't...

@hebasto
Copy link
Member

hebasto commented Sep 5, 2025

... can you please point me to the discussion that you were referring to with "reintroducing Python scripts"...

The Python dependency for the translate build target was recently removed in #33209 by @purpleKarrot.

@purpleKarrot
Copy link
Contributor

do we really want to do that :)? I don't...

I would do It, but I am concerned about the bus factor, so I would prefer to train someone else.
05255d5 is an example for replacing a sed replacement with cmake code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants