-
Notifications
You must be signed in to change notification settings - Fork 37.8k
stabilize translations by reverting old ids by text content #33270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33270. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. |
a2568c0
to
70dd57e
Compare
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
Not sure if this needs a script. Post-freeze changes should be rare enough to just manually take over the affected string without going through any script. |
Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs. Changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder. The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.
70dd57e
to
291749d
Compare
@maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex. @achow101 created a test Bitcoin Core translation sandbox for me where I've uploaded the latest I have generated a dummy French translation and simulated a review on it: Before this PR, changing a single line invalidated 10% of the other translations as well because of the sorting bug described above: Uploaded the same file that's generated with this PR for #33224 agains a 100% approved language: And since the translation IDs are resurrected, it correctly shows that a single entry needs to be retranslated: |
Ah, I see. I wonder if the shasum of the full content can be used as an id for the translation string. Conceptually it seems simpler to get a stable id, than to try to artificially number and re-number the strings, depending on the history. (Obviously this wouldn't help with #33224, but this instance should be trivial to handle manually as a one-off.) |
Since we are not using ID-based translations, it seems reasonable to consider removing the |
@hebasto, are suggesting that the bitcoin-test clone where we tried it isn't representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we're not using the ids, why aren't the translations stable? What alternative do you suggest to stabilize them?
@maflcko we could of course do that, but it would likely invalidate every single text in the next release. We could of course do it step-by-step and only add hashes for the new entries (instead of max-id + 1, as I did in the script here). Note that hashes would prohibit same-text-with-different-translations, e.g. |
We do not use them, but Transifex does.
As I wrote in #33270 (comment):
|
Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead? Any reason we didn't do that before? |
That’s my guess, though I haven’t tested it.
I’m not aware of any specific reason. |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stabilize translations...
Concept ACK on this idea.
Could we avoid reintroducing Python scripts into the translate
build target?
It wouldn't be difficult to make this into a script that generates an id mapping that's we'd run manually, if we're worried about python, something like: sed -E -i \
-e 's|id="_msg1160"|id="_msg1239"|g' \
-e 's|id="_msg1159"|id="_msg1160"|g' \
-e 's|id="_msg120\[0\]"|id="_msg1240\[0\]"|g' \
-e 's|id="_msg120\[1\]"|id="_msg1240\[1\]"|g' \
'src/qt/locale/bitcoin_en.xlf' Would that be more useful in your opinion? What are the worries exactly, can you please point me to the discussion that you were referring to with "reintroducing Python scripts" (or is this just a reference to the cmake migration)? We could of course migrate this to cmake as well, but do we really want to do that :)? I don't... |
The Python dependency for the |
I would do It, but I am concerned about the bus factor, so I would prefer to train someone else. |
Summary
Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs.
You can reproduce by checking out #33224 and running:
cmake --preset dev-mode -DWITH_USDT=OFF -DENABLE_IPC=OFF && cmake --build build_dev_mode --target translate
This will change many translation IDs:
Details
As mentioned in #33224 (comment), changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder.
Fix
The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.
Reproducer
This PR has two commits; the first resets the baseline to make testing simpler.
Deleting a translation
correctly deletes only that one
Adding a new one:
keep all other ones intact
Modifying a single one
is marked as a clean delete/add
And as a bonus, only changing the case of a translation
results in keeping the message id