-
Notifications
You must be signed in to change notification settings - Fork 5.7k
BIP39 Add German Wordlist #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BIP39 Add German Wordlist #1071
Conversation
Thanks @DavidMStraub for starting with the first attempt and @cr for the second attempt regarding a BIP-0039 German Wordlist. Hope you will join this PR which main difference is the implementation of levenshtein distance (addition, substitution & permutation not lower than 2). Supplementary to the basic requirements some more considerations:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @SebastianFloKa, thanks for opening this PR with an alternative list. I know that a lot of work has gone into it already from #721, and again thanks to everyone who participated in the previous two attempts for a German wordlist. Hope some of you native German speakers could go through the list here too and leave some comments!
Reviewed the wordlist until line 1000 so far, going through the rest later.
ACK. This list does not include any homophones. LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through the rest of the list, very good work IMO 👍 Just some minor notes on some words.
Very well-prepared word selection @SebastianFloKa, LGTM 👍 Went through the entire word list word by word, left minor comments on a few words. If you have the chance and especially if you're a native German speaker, please jump in for a review too. |
Checking with https://www.korrekturen.de/rechtschreibpruefung.shtml following words (beside already mentioned ones: "Gumpe", "Tidehub", "Trebe" & "Zuseher") are marked eventhough they are all properly listed in the https://www.duden.de/. Beside other reasons this seems partly be related to words more common in Austria or Switzerland. I personally think it's good to have some words from different parts of German language region as long as they are understood everywhere - open to discuss. Allrad @thomasklemm in particular and maybe @neox5 wants to have a look as well: Shall we replace all of above words or would you say we can / should keep some of them? |
In case we would replace all the words highlighted by @thomasklemm except for "Fresko" & "Tidenhub" as well as all the 10 words marked by the spellchecker (Allrad, .... , Zuhause) there are 31 words to be replaced in the next loop. BART Due to the inter correlation it might be necessary to have some backup words: AKKU So if you prefer to replace some other words from the initial list with above backup words is fine as well. |
Vorschläge: Amsel |
If replacements are needed, I'd like to suggest |
Guys thank you for your service, but I can't hide that I'm mostly following this conversation because it reliably makes me giggle. PS: After thinking it through, I would probably not include any of the Breze* words. There are too many regional variations, which will lead at least to confusion, but maybe even to emotion and anger ("why did they dare to include this inferior spelling in my seed phrase?"). |
thanks @TZocker @nisc @DivineDominion for joining and your input. New proposal with implementations also with the initial ones of @thomasklemm will follow soon. |
|
@SebastianFloKa "Luv" and maybe "Tidehub", as pointed out by others in comments above |
@TZocker I checked your proposals against criterias:
Amsel --> NOK - Levensthein substitution collision with AMPEL @DivineDominion @nisc all "Breze*" words will be removed |
Co-authored-by: Thomas Klemm <github@tklemm.eu>
Improvement loop mainly based on feedback of @thomasklemm but also @TZocker & @DivineDominion & @nisc
Vorschläge: If you could share your tools for checking, I would do the checks by myself! So you don't have to do all the work by yourself 😉 |
@SebastianFloKa Thanks for incorporating all the feedback to the word list. IMO it's really good work, has had many iterations already and can get merged. @SebastianFloKa You should see a "Resolve conversation" button next to each individual conversation and can close the ones that are now resolved (Only PR author and repo maintainers seems to see it according to https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/commenting-on-a-pull-request#resolving-conversations, so I can't mark my own comments as resolved). To all German native speakers reading this PR: It would be really good if you can take the time to go through the list and leave your comments or a 👍 on the PR. |
@neox5 thanks
Daumen - NOK - Levenshtein-substitution collision with Gaumen Regarding the tool:
|
"Zunahme" too close to "Zuname" and other improvements
I had already finished a draft of it independently in December and unfortunately only now managed to publish it. However, it also contains words that appear in other word lists. Maybe the comparison helps anyway: https://github.com/dys2p/wordlists-de/blob/main/de_2048.md |
I just had a quick look at the list, you could shorten some words even more e.g. |
@b068931cc450442b63f5b3d276ea4297 Hi, thanks for participating. unfortunately you are right, your list contains many collisions with other BIP0039 Wordlists (278 collisions in total). Beside this it contains
GOLDADER --> GOLD --> NOK - levenshtein substitution error with GELD Question: Do you have the ability to filter your list for singular nouns only and share it? Then a countercheck might make sense. |
@SebastianFloKa Thank you. In which form do we want to do this best and why do you actually only want nouns? |
Improvement loop related to input of @b068931cc450442b63f5b3d276ea4297: replacing uncommon / difficult words + reducing "homophone risky words" + reducing amount of words starting with AB*** Thanks for approving if OK or leaving comments if NOK. Also @thomasklemm @TZocker @DivineDominion @nisc @neox5 @rodasmith
See latest proposal - leave comments in case you disagree or approve if OK.
There are advantages for brainwallets but these can be negleted as brainwallets aren't recommended except for special situations. But mainly it's the same reason why the effort regarding levenshtein distance is done: to reduce room for misinterpretation which might cause loss of money.
@thomasklemm @TZocker @DivineDominion @nisc @neox5 @rodasmith @b068931cc450442b63f5b3d276ea4297 |
@SebastianFloKa I think that there should be another fundamental discussion about whether it makes sense to omit verbs and adjectives or not. The other word lists also work with adjectives and verbs and the omission only unnecessarily restricts the possible words.
By making the words as familiar as possible and known to everyone, you probably also reduce the risk of people making mistakes. Words like kurz, lang, rot, blau, laufen, gehen, stehen are known to elementary school students while words (are only examples) like akazie, amnestie, anagramm, annexion and anode are far less known.
This is not true, because the used words are always n of 2048 possible words of the list (if only this list was used). So it doesn't matter if there were only nouns or nouns, verbs and adjectives. @thomasklemm @TZocker @DivineDominion @nisc @neox5 @rodasmith What do you guys say, should only nouns be used or also adjectives and verbs in their base form? |
I'm satisfied with the outcome of the earlier conversation in #942 that concluded to use nouns only, avoiding confusion around capitalization. Here's an excerpt from that conversation:
The decision seemed good then and I don't see any reason to revisit it. |
Capitalization is not a problem because the words in the lists are always all lowercase (except @SebastianFloKa who writes everything in capital letters). I think this is also an important point that should be discussed again. I do not share the opinion of @SebastianFloKa:
All previous lists are in lowercase and with nouns, verbs and adjectives. |
@SebastianFloKa ich bin der Meinung das es nicht als so entscheiden ist ob Verben etc. auch verwendet werden, wir sollten uns an die anderen Bips richten. Wir bekommen dadurch weitere Alternativen. Levenshtein Kollision wird vieles verhindern. Merksätze wären dann möglich.... Sry bei meinem Vorschlag habe das mit Levenshtein nicht verstanden. Sry.... Würde eher darauf wert legen den Sprachschatz auf das Niveau von einem 12 Jährigen zu reduzieren. Wörter wie Zwinger und Ritze sollten evtl. noch ersetzt werden (Vieldeutigkeit). MFG |
Somebody reconstructing a partially destroyed wallet will appreciate less choices of word categories once it comes to guessing hard to read words (e.g. from housefire etc.). Not a superimportant advantage, true, but mentioned anyway.
Well, simply doing the same thing would mean an enormous amount of levenshtein errors (English wordlist) or unintended 9 letters per word (Italian wordlist) etc., so you probably mean to focus on the positive progress of other wordlists. Q: Is there any advantage for people (people, not for the IT behind that can handle capital letters) in the German language area to write words in the more uncommon “all lowercase” that we might haven’t taken into consideration yet? Q: Do you require to have adjectives and/or verbs in the list or is it because we might not find sufficient easy nouns? About certain words: Generally: Proposal: |
No, the word category does not play a role but only the word list (incl. used characters and the length of the words).
According to my understanding, the advantage is that you don't have to worry about upper and lower case and therefore write everything in lower case in such lists. This is the case with the other bip39 lists, with diceware lists like https://www.eff.org/deeplinks/2016/07/new-wordlists-random-passphrases and many other projects.
Both, if we take them on the lists corresponds to the convetions of the word choice of the other languages and it increases the pool from which we can use words that most 12 year olds know. Proposal: Do you use https://bip39validator.readthedocs.io/en/latest/running.html for the tests? |
Neither the list, nor the discussion about it is closed from my point of view. If this list get merged in this form, it would be a missed opportunity. |
thanks @luke-jr and other BIP39 authors + responsibles and "welcome" @b068931cc450442b63f5b3d276ea4297 no worries, "proposed BIP modification" doesn't mean it's merged.
Have you ever tried to recapture a partially destroyed wallet (e.g. from fire) where e.g. the first letter of a word is illegible as well as some at the end or in the center. A normal user doesn't have a tool to filter for words with certain letters on certain positions. Means the user will have to guess possible words. So it's easier for him to search for a noun only instead of nouns, verbs and adjectives. It's not a must have or the most important feature, but a small advantage.
no, running my own - but this might be good to work with.
What do you mean with advertisement? Generally OK to go through the list and select inappropriate words, of course. For lower case I'm personally not convinced yet, not sure about the others. It feels very strange for people from german language area to write nouns in lower case plus the other reasons (people write more legible in all caps etc.) - also will this later be part of the BIP39 authors decision as well. I'm fine to continue step by step (as we do since years now), just let me replace the 10 above mentioned words with other nouns first (need a bit of time) and then go through the list again. |
Thank you.
I haven't, but whether it's a noun, verb or adjective doesn't matter at all. Since it is 1 of 2048 that are in the list. Sorry, I meant verbs and wanted to write an example with werben/Werbung (advertisement) first. With your list, I have already submitted as a pull request what I would remove and what I would add if necessary. I am currently working on another list, which could help if we want to add verbs and adjectives. For me it feels strange to see and write everything in capital letters. Even when we write normally, most of the letters used in any normal sentence are lowercase. The contract with the applications I also find a bit far-fetched, I think every person writes in letters, messengers and everywhere much more lowercase and finds it rather strange when someone with capslock writes everything in capital letters. |
Eliminated the words with highest complexity and replaced with simpler ones.
Of course is each word in the list 1of 2048, but in my example the "wordpool" for the user is not the list but all words. Let's have an example: A steelwallet went through housefire, some words are not completely readable anymore, e.g. at one word the second letter is readable as "L", the third is "A", the fourth is "T", the first letter and the ending is unknown (?LATT???). The user has two options: A) Go through the complete list line by line and check if the word might fit. Or the much more realistic scenario B) one will "guees" which word could be meant. In our case the noun "BLATT" might come to your mind and you will check in the wordlist directly under "B" if this is one possible solution. If also verbs & adjectives are included there are more choices to look up and will be more time consuming to figure out which one is intended: "glatt, platt, flattern, etc.". The expectation of limiting complexity to a certain age (e.g. 12 year-old) sounds nice, but couldn't find a source for correlation between "age" and "words", means it will stay our subjective decision which words to accept. Having few words being on a 16 year-old basis would statistically result in every once in a while a wallet created could include one or few words that would need to be looked up by the user (in case even is interested in). So far we said this disadvantage is worth all the advantages gained by nouns-only, it makes sense to go through history of this to get an understanding - but if the community disagrees and requests many words to be replaced and not only few I'm open that the list will be reworked accordingly, of course. What's your positions on this? Or do you want a survey? @thomasklemm @TZocker @DivineDominion @nisc @neox5 @rodasmith @b068931cc450442b63f5b3d276ea4297 |
If I can still read "?LATT???" from the letters I open the list with the 2048 words, press Ctrl+F and enter "LATT". It really doesn't matter to which word category the word belongs. I don't have to go through line by line, and even if I do, it's easier than picking out a much larger number of nouns from the Duden, for example. I see no advantages but many disadvantages in choosing a list of nouns only. The 12 years was just an example. The simpler and more widespread the words are, the better. You can also look at "basic vocabulary" and "extended basic vocabulary", just like the linguistic levels A-B.
No I think you are the only one who says/writes that. |
I think it's a tough call. Most people today wouldn't know how BIP39 works and that there's a pre-defined list of 2048 words, with each word in the 24-word mnemonic representing 11 bits of a 256+8 bit seed ("What is a Bit?"). Other people wouldn't realize that there's a pattern, i.e., that the seed only includes nouns. I slightly prefer the nouns only version. I think more people see the only-nouns pattern than the 264 bits. In the end it really doesn't matter too much, though. If people lose a lot of money, they'll seek help. Someone will be able to explain it to them. |
thanks for the effort. Two considerations
|
Thanks as well! I, as a German, would much prefer all uppercase instead of all lowercase. All lowercase doesn't conform to the rules of orthography, and traditionally uppercase letters are used if only one case is allowed (e.g. in forms or crosswords). Word contours will be wrong with all lowercase, as nouns are normally written with a capital at the beginning. Anyway, in the case of this word list, correctly reading every single letter is probably more important than quickly reading whole words. Maybe all uppercase is even advantageous for this purpose. |
Why exclude Umlaut and ß, they are part of the Language? They could of course be considered equal to their non Umlaut counterparts (äöü -> aou and ß -> ss) as is the case with. See other languages wordlists. It just seems like an arbitrary constraint. |
The BIP-0039 German Wordlist is based on spelling rules defined in the “German Duden” and checked along different aspects of quality by native speakers. Words were selected manually and also checked manually to ensure words are sufficiently common and positive. Tools were used to ensure sufficient levenshtein distance between words, prevent conflict with other BIP-0039 wordlists as well as to eliminate homophones inside the wordlist.
There was a first attempt (#721) and a second attempt (#942) for a BIP-0039 German Wordlist. This third attempt intents to combine the requirements from both, the Bitcoin Community within the Geman-speaking area as well as must-have requirements for BIP-0039 Wordlists such as levenshtein distance and no homophones.
Special considerations: