I listened to German songs and couldn't understand half of the words, so I decided to learn these words using Anki.
Anki is a flashcard program that helps you spend more time on challenging material, and less on what you already know.
This repository provides scripts for extracting words from German (song) texts and generating German-English cards for these words using an LLM accessed via an OpenAI API. You can provide a file with a list of unknown words, and cards will be generated only for these words.
Additionally, this repository provides data files used for generating my deck. You can replace the data if needed.
The demo deck has 20 cards selected from my deck.
You can import the demo deck into Anki and then reuse the deck note type, card types, card templates, and styles. Alternatively, you can recreate everything from templates.
The raw deck and the note type for the demo deck share the following fields:
-
index
- an index; -
word_deu
- a German word; -
part_of_speech_deu
- a part of speechword_deu
, in English; -
word_eng
- a translationword_deu
to English; -
sentence_deu
- a German sentence that containsword_deu
; -
sentence_eng
- a translation of the German sentencesentence_deu
to English that contains the translation of the German wordword_eng
; -
sentence_lemmatized_deu
-sentence_deu
where each word was lemmatized;The goal of lemmatization is to reduce a word to its root form, also called a lemma. For example, the verb
running
would be identified asrun
.
The note type in the demo deck has the following additional fields for sound files:
word_sound_deu
- pronunciation ofword_deu
;word_sound_eng
- pronunciation ofword_eng
;sentence_sound_deu
- pronunciation ofsentence_deu
;sentence_sound_eng
- pronunciation ofsentence_eng
.
I wanted to significantly increase my active vocabulary. Therefore, in my deck, there are ~8300 unique German words among ~3200 cards. Such a density of unique words is due to constraints.
My cards are in the raw deck file.
You can import the demo deck into Anki, remove demo cards, and then import cards from the raw deck.
Use the following deck options:
- New Cards:
- Insertion order: Sequential
- Display Order:
- New card gather order: Ascending position
- New card sort order: Order gathered
If you want to recreate cards in Anki using the templates instead of reusing the demo deck:
- In Anki, create a new note type from
Basic (and reversed card)
. - Rename card types to
En-De
andDe-En
. - Use corresponding
.html
templates from custom/de/card-templates for the card templates. - Use custom/de/card-templates/styling.css for the card styling.
./custom/
de/
deck/
deck.csv
is a|
-separated CSV containing a list of words having cards followed by a list of words without cards.
./custom/
de/
data/
external/
dewiki-noun-articles.csv
- nouns with articles from dewiktionary. Can be downloaded here.
./custom/
de/
data/
external/
dwds_lemmata_2025-01-15.csv
- lemmas from DWDS. Can be downloaded here.
./custom/
de/
data/
sources/
playlist/
lemmatized.csv
- almost the same asplaylist/data.csv
except the texts are lemmatized.words.csv
- words from all texts, in the order of their occurence in the lemmatized texts inlemmatized.csv
.
words/
bad-baseform.csv
- For certain German words, map each word to its specialbaseform
that is likely to appear in the corresponding lemmatized German sentence.counts.csv
- how many times each lemma appears in the deck.known.csv
- known German words with their parts of speech and translations. Parts of speech, translations, and indices aren't strictly necessary.lemmas.csv
- provides the full form for each correct (according to the dictionaries inexternal
) lemma from lemmatized sentences.not-lemmas.csv
- provides the full form for certain words that aren't correct lemmas according to the dictionaries inexternal
.too-frequent.csv
- a list of German words that appear in the deck too often (determined manually), according tocounts.csv
.
.custom/
de/
script/
lib.py
- mostly language-agnostic functionality for generating cards for the given raw deck.main.py
- provides deck-specific functionality for fetching song texts, lemmatization, extracting words, collecting unknown words, preparing the raw deck, and running the cards generator.api_request_parallel_processor.py
- used for sending parallel requests to the OpenAI API.
Tested on my Linux machine.
-
Install the Nix package manager (link) and reload the computer.
- method 1.
- method 2 + enable flakes permanently - I prefer the single-user installation because it's easier to manage;
-
Install direnv (don't forget to hook it into your shell!).
-
Clone this repository.
git clone https://github.com/deemp/anki-decks
-
Open VS Code in the repository directory.
code anki-decks
-
Install the recommended VS Code extensions.
- You can open the Command Palette (
Ctrl + Shift + P
on Linux), type and clickExtensions: Show Recommended Extensions
orConfigure Recommeded Extensions
.
- You can open the Command Palette (
-
Open the terminal (
Ctrl + `
on Linux) and allowdirenv
to work in the repository directory.direnv allow
-
Answer
yes
to questions. -
Install Python dependencies.
nix develop poetry install poetry run python -m spacy download de_core_news_lg
-
Open the Command Palette, type and click
direnv: Reset and reload environment
. -
If you plan to edit Nix files, e.g. flake.nix, install
nil
.nix profile install nixpkgs#nil
-
Open the Command Palette, type and click
Python: Select Interpreter
. Click the option that has./.venv/bin/python
. -
Open main.py. You should see the
Run cell
buttons above the# %%
comments. -
Create a
.env
file in the root directory of the repository with your credentials for OpenAI and Genius.com.OPENAI_API_KEY= GENIUS_CLIENT_ID= GENIUS_CLIENT_SECRET= GENIUS_CLIENT_ACCESS_TOKEN=
My deck uses song texts as the source of words. The user may skip the steps 1 and 3 if the user doesn't want to fetch the song texts.
-
The user provides a new list of songs and their authors in
./custom/de/data/sources/playlist/raw.csv
or appends the data to an existing list. Words from these songs will get indices which will be used in Anki. Appends allow to not invalidate indices and hence preserve the learning progress in Anki. -
In
./custom/de/script/main.py
, the user runs the first two cells to load necessary functions. -
In the same file, the user runs the next cell containing
update_songs
.update_songs
joins the./custom/de/data/sources/playlist/raw.csv
with a table constructed from./custom/de/data/sources/playlist/data.yaml
ontitle
andauthor
.It then tries to fetch missing texts unless it is known that these texts are unavailable.
Next, it saves the results (potentially reindexed) into
./custom/de/data/sources/playlist/data.csv
.Then, it searches for songs that don't have the
text
and tries to fetch corresponding texts unless it is known that these texts are unavailable.The songs with unavailable texts are moved to the end of the table.
At the end, the data from
./custom/de/data/sources/playlist/data.csv
is copied into./custom/de/data/sources/playlist/data.yaml
. -
The user edits
./custom/de/data/sources/playlist/data.yaml
.Here, the user may edit texts if the user generates the deck first time. Otherwise, the user appends new texts to not invalidate indices of words and preserve the progress in Anki after loading there the updated version of the deck.
update_songs
doesn't update the texts that are already present. Hence, the user's edits are safe. -
The user runs the next cell containing
update_lemmatized_sources
.It reads
./custom/de/data/sources/playlist/data.yaml
, lemmatizes them, and writes into./custom/de/data/sources/lemmatized.csv
.Lemmatization handles the separable verbs like
aufstehen
.Songs without texts are still at the end of the table.
-
The user runs the next cell containing
update_word_lists
.The
update_word_lists
calls several other functions.-
update_sources_words
splits the lemmatized sentences from./custom/de/data/sources/lemmatized.csv
producing lists of words. I call them words because some of them are not lemmas.Then, it concatenates these lists of words, assigns each word an index, and writes the words to./custom/de/data/words/words.csv
.The
song_id
is the index in./custom/de/data/sources/lemmatized.csv
of the song that the word comes from. -
update_words_not_lemmas
updates the./custom/de/data/words/not-lemmas.csv
file.The
word
s and their indices come from./custom/de/data/words/words.csv
. These words are not lemmas according to the dictionaries.The
lemma
column provides a lemma of theword
. These lemmas were obtained via an LLM.The
lemma_correct
column provides one of the full forms of the word (e.g., a noun with an article), where the word is a correct lemma. This column isn't strictly necessary. -
copy_lemmas_from_words_not_lemmas_to_words_lemmas
copies correctlemma
s from./custom/de/data/words/not-lemmas.csv
to./custom/de/data/words/lemmas.csv
. -
update_lemmas_correct
then adds missinglemma_correct
in./custom/de/data/words/lemmas.csv
.For nouns that change their meaning depending on the article, there will be a row with an index like
504.0
for the version with one article and additional rows with indices like504.001
for versions with other articles. The articles are looked up in the nouns dictionary. -
copy_words_lemmas_to_deck
copieslemma_correct
from./custom/de/data/words/lemmas.csv
to the raw deck and updates the indices of words there. -
filter_deck_raw_by_sentence_length
filters out sentences with an inappropriate length. -
update_words_bad_baseform
updates./custom/de/data/words/bad-baseform.csv
that maps words from the raw deck to a form that is likely to appear in a lemmatized sentence. This file is used for partitioning.
-
-
The user runs the next cell containing
generate_deck_data_iteratively
.It makes a specified number of iterations. On each iteration, it calls
update_deck_raw
.The
update_deck_raw
which does the following:-
It partitions the raw deck so that first go rows that have card data and then go rows without that data.
-
It calls
prepare_requests
which does the following:-
It calculates the number of blocks of words that will be processed on this iteration. Each block will contain a specified number of words or less than that number if there are less than this number of words left in the raw deck.
-
It uniformly randomly selects words without deck data and distributes them into blocks. This way, blocks are always different and the LLM that loads these blocks into its context is less likely to generate the same sentences as on one of the previous iterations.
-
It prepares a request for each block.
-
-
It makes parallel requests to the OpenAI API.
-
It processes responses and writes them into the raw deck.
-
It partitions the raw deck.
-
-
The user
git commit
s the changes to the raw deck. -
The user checks the logs to find words that didn't occur in their lemmatized sentences. The user adds mappings for these words to
./custom/de/data/words/bad-baseform.csv
, connecting the word with its form in the lemmatized sentence. -
The user continues running
generate_deck_data_iteratively
until each word in the raw deck has the card data. -
The user debugs things if something doesn't work.
In my deck, all sentences must be 60 to 70 characters long. Sentences in this range of length:
- aren't boring;
- expose various grammatical structures that shorter sentences often lack;
- often contain long words that don't fit into shorter sentences;
- are quicker to review than longer sentences.
In my deck, each sentence_lemmatized_deu
must contain at least two words (not counting the word_deu
) that appear at most three times among all sentence_lemmatized_deu
. This constraint helped me increase the vocabulary in my deck.
Each sentence generated for a word must contain that word.
In my deck, I check this condition for the German word (word_deu
) and the German sentence (sentence_deu
). However, it's better to also check it for the English translation of the German word (word_eng
) and the English sentence (sentence_eng
).
After writing the generated card data to the raw deck, it's necessary to remove the bad card data and prepare the deck for the next iteration of generation. Only the word and its index remain after removing the bad card data for a word, just like before the current iteration.
Partitioning is performed by the partition_deck_raw
function defined in ./custom/de/script/lib.py
which does the following:
-
It calls several other functions:
-
filter_deck_raw_by_sentence_length
the card data for all words where thesentence_deu
(German sentence) has an inappropriate length (see Length constraint). -
update_deck_raw_lemmatized_sentences
writes lemmatized versions ofsentence_deu
to thesentence_lemmatized_deu
column. -
update_word_counts
calculates how many times each lemma fromsentence_lemmatized_deu
appeared in the deck. -
update_words_bad_baseform
updates the./custom/de/data/words/bad-baseform.csv
file (discussed above).
-
-
It selects the rows that have the card data and starts processing them from larger to smaller indices using
check_is_correct_sentence
.The
check_is_correct_sentence
function checks that thesentence_deu
is correct by doing the following:-
It checks that the
sentence_deu
contains enough rare words. -
It checks that the lemmatized
sentence_deu
contains the baseform of the German word obtained via themake_baseform
function defined in./custom/de/script/lib.py
or the baseform specified in./custom/de/data/words/bad-baseform.csv
. -
It decrements the counts for lemmas from
sentence_lemmatized_deu
if the sentence is incorrect. -
It prints a message if the
sentence_deu
doesn't contain the baseform.
-
-
It removes the card data where the
sentence_deu
is incorrect. -
It reorders rows so that the rows with deck data go first and other rows go next. In each group, rows are sorted by the index.
I assume that the LLM prefers to generate simple words because they are probably used more often in the training corpus.
I want to encounter difficult words in German sentences in the rows with small indices.
So, the goal is to avoid accumulating all simple words in the German sentences in the rows with smaller indices.
During partitioning, sentences are checked from larger to smaller indices. Since the word counts (see update_word_counts
description above) are maximal at larger indices, these sentences are more likely to be removed due to the rare words constraint. Each time such a sentence is removed, the counts are decremented for the lemmas in this sentence. Hence, the sentences with smaller indices become less likely to be removed due to the rare words constraint.
A possible solution is to check in the random order, not from larger to smaller indices. However, this approach has two (potential) downsides. First, I assume that it may require more iterations to converge to stable set of sentences. Second, LLM may not have enough vocabulary to generate sentences for all words. Such a vocabulary saturation happened on several other decks.
Meanwhile, I wanted to start learning from words with smaller indices. Therefore, I kept checking in a non-random order.
Append-only because swapping sources will affect indices. Hence, notes (sorted by index) will be swapped and the progress will be incorrect.