Checking your math...

In [The Similar-Words Problem](https://github.com/sts10/generated-wordlists#the-similar-words-problem) in your **Readme**, you wrote:

>If we assumed a hypothetical 18,000 word list that was just 9,000 words and their plurals, I think the odds of getting at least one "awkward double" in a 4-word passphrase is (1/18000) * (2/18000) * (3/18000), which is a really small number. But check my math!

Although your conclusion is correct (_"the odds...is a really small number"_), the odds of this happening is over 600 million times more probable than what you have estimated.

The correct probability is `1/9000 + 2/9000 + 3/9000 - 11/9000**2 + 6/9000**3`.

To prove this for a word list containing _N_ words and their plurals (2 _N_ words total), if _P_1 is the probability of getting **at least one** "awkward double", and if _P_0 is the probability of getting **no** awkward doubles, then

_P_1 =1 - _P_0

The probability if getting no awkward doubles (_P_0) is the number of passphrases containing only unique stems (i.e., once a word has been selected, it cannot be reselected itself, and neither can its conjugate -- the plural or singular form, whichever was not picked in the previous selection), divided by the total number of possible passphrases. For a passphrase consisting of _k_ words, the total number of passphrases is 

_N_total = (2 _N_)_k_

To compute the number of passphrases containing only unique stems, the size of the word pool is reduced by 2 each time a word is selected (because the word itself is eliminated from further consideration, as is the plural/singular form of that word):

_N_unique = (2 _N_)(2 _N_ - 2)(2 _N_ - 4) ... (2 _N_ - 2 _k_ +2)

Therefore, the probability of getting only unique stems is 

_P_0 = _N_unique/_N_total = (1 - (1/_N_))(1 - (2/_N_))...(1 - (k-1)/_N_)

Therefore, the general solution for the probability of getting at least one "awkward double" is

_P_1 = 1 - (1 - (1/_N_))(1 - (2/_N_))...(1 - (k-1)/_N_)

For _k_=4, the math works out to the following result:

_P_1 = 6/_N_ - 11/_N_2 + 6/_N_3

If, in the general solution, one neglects higher-order terms (_N_-2, _N_-3, etc.), the following approximate solution is obtained:

_P_1 &approx; 1/_N_ + 2/_N_ + ... + (_k_ - 1)/_N_ = (_k_(_k_ - 1))/(2 _N_)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checking your math... #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Checking your math... #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions