Skip to content

deemp/songs2anki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anki deck generator for studying new languages

I listened to German songs and couldn't understand half of the words, so I decided to learn these words using Anki.

Anki is a flashcard program that helps you spend more time on challenging material, and less on what you already know.

src

This repository provides scripts for extracting words from German (song) texts and generating German-English cards for these words using an LLM accessed via an OpenAI API. You can provide a file with a list of unknown words, and cards will be generated only for these words.

Additionally, this repository provides data files used for generating my deck. You can replace the data if needed.

Demo deck

The demo deck has 20 cards selected from my deck.

You can import the demo deck into Anki and then reuse the deck note type, card types, card templates, and styles. Alternatively, you can recreate everything from templates.

Cards

Preview in Anki

Card Demo

Fields

Fields in the raw deck

The raw deck and the note type for the demo deck share the following fields:

  1. index - an index;

  2. word_deu - a German word;

  3. part_of_speech_deu - a part of speech word_deu, in English;

  4. word_eng - a translation word_deu to English;

  5. sentence_deu - a German sentence that contains word_deu;

  6. sentence_eng - a translation of the German sentence sentence_deu to English that contains the translation of the German word word_eng;

  7. sentence_lemmatized_deu - sentence_deu where each word was lemmatized;

    The goal of lemmatization is to reduce a word to its root form, also called a lemma. For example, the verb running would be identified as run.

    src

Fields in the demo deck

The note type in the demo deck has the following additional fields for sound files:

  1. word_sound_deu - pronunciation of word_deu;
  2. word_sound_eng - pronunciation of word_eng;
  3. sentence_sound_deu - pronunciation of sentence_deu;
  4. sentence_sound_eng - pronunciation of sentence_eng.

My deck

Vocabulary in my deck

I wanted to significantly increase my active vocabulary. Therefore, in my deck, there are ~8300 unique German words among ~3200 cards. Such a density of unique words is due to constraints.

Raw deck

My cards are in the raw deck file.

You can import the demo deck into Anki, remove demo cards, and then import cards from the raw deck.

Use the following deck options:

Templates

If you want to recreate cards in Anki using the templates instead of reusing the demo deck:

Files

Raw deck file

  • ./custom/
    • de/
      • deck/
        • deck.csv is a |-separated CSV containing a list of words having cards followed by a list of words without cards.

Dictionaries

Dictionary: nouns with articles

DWDS: lemmas

Data files

  • ./custom/
    • de/
      • data/
        • sources/
        • words/
          • bad-baseform.csv - For certain German words, map each word to its special baseform that is likely to appear in the corresponding lemmatized German sentence.
          • counts.csv - how many times each lemma appears in the deck.
          • known.csv - known German words with their parts of speech and translations. Parts of speech, translations, and indices aren't strictly necessary.
          • lemmas.csv - provides the full form for each correct (according to the dictionaries in external) lemma from lemmatized sentences.
          • not-lemmas.csv - provides the full form for certain words that aren't correct lemmas according to the dictionaries in external.
          • too-frequent.csv - a list of German words that appear in the deck too often (determined manually), according to counts.csv.

Scripts

  • .custom/
    • de/
      • script/
        • lib.py - mostly language-agnostic functionality for generating cards for the given raw deck.
        • main.py - provides deck-specific functionality for fetching song texts, lemmatization, extracting words, collecting unknown words, preparing the raw deck, and running the cards generator.
        • api_request_parallel_processor.py - used for sending parallel requests to the OpenAI API.

Setup

Tested on my Linux machine.

  • Install the Nix package manager (link) and reload the computer.

  • Install direnv (don't forget to hook it into your shell!).

  • Clone this repository.

    git clone https://github.com/deemp/anki-decks
  • Open VS Code in the repository directory.

    code anki-decks
  • Install the recommended VS Code extensions.

    • You can open the Command Palette (Ctrl + Shift + P on Linux), type and click Extensions: Show Recommended Extensions or Configure Recommeded Extensions.
  • Open the terminal (Ctrl + ` on Linux) and allow direnv to work in the repository directory.

    direnv allow
  • Answer yes to questions.

  • Install Python dependencies.

    nix develop
    poetry install
    poetry run python -m spacy download de_core_news_lg
  • Open the Command Palette, type and click direnv: Reset and reload environment.

  • If you plan to edit Nix files, e.g. flake.nix, install nil.

    nix profile install nixpkgs#nil
  • Open the Command Palette, type and click Python: Select Interpreter. Click the option that has ./.venv/bin/python.

  • Open main.py. You should see the Run cell buttons above the # %% comments.

  • Create a .env file in the root directory of the repository with your credentials for OpenAI and Genius.com.

    OPENAI_API_KEY=
    GENIUS_CLIENT_ID=
    GENIUS_CLIENT_SECRET=
    GENIUS_CLIENT_ACCESS_TOKEN=

Usage

My deck uses song texts as the source of words. The user may skip the steps 1 and 3 if the user doesn't want to fetch the song texts.

  1. The user provides a new list of songs and their authors in ./custom/de/data/sources/playlist/raw.csv or appends the data to an existing list. Words from these songs will get indices which will be used in Anki. Appends allow to not invalidate indices and hence preserve the learning progress in Anki.

  2. In ./custom/de/script/main.py, the user runs the first two cells to load necessary functions.

  3. In the same file, the user runs the next cell containing update_songs.

    update_songs joins the ./custom/de/data/sources/playlist/raw.csv with a table constructed from ./custom/de/data/sources/playlist/data.yaml on title and author.

    It then tries to fetch missing texts unless it is known that these texts are unavailable.

    Next, it saves the results (potentially reindexed) into ./custom/de/data/sources/playlist/data.csv.

    Then, it searches for songs that don't have the text and tries to fetch corresponding texts unless it is known that these texts are unavailable.

    The songs with unavailable texts are moved to the end of the table.

    At the end, the data from ./custom/de/data/sources/playlist/data.csv is copied into ./custom/de/data/sources/playlist/data.yaml.

  4. The user edits ./custom/de/data/sources/playlist/data.yaml.

    Here, the user may edit texts if the user generates the deck first time. Otherwise, the user appends new texts to not invalidate indices of words and preserve the progress in Anki after loading there the updated version of the deck.

    update_songs doesn't update the texts that are already present. Hence, the user's edits are safe.

  5. The user runs the next cell containing update_lemmatized_sources.

    It reads ./custom/de/data/sources/playlist/data.yaml, lemmatizes them, and writes into ./custom/de/data/sources/lemmatized.csv.

    Lemmatization handles the separable verbs like aufstehen.

    Songs without texts are still at the end of the table.

  6. The user runs the next cell containing update_word_lists.

    The update_word_lists calls several other functions.

    1. update_sources_words splits the lemmatized sentences from ./custom/de/data/sources/lemmatized.csv producing lists of words. I call them words because some of them are not lemmas.Then, it concatenates these lists of words, assigns each word an index, and writes the words to ./custom/de/data/words/words.csv.

      The song_id is the index in ./custom/de/data/sources/lemmatized.csv of the song that the word comes from.

    2. update_words_not_lemmas updates the ./custom/de/data/words/not-lemmas.csv file.

      The words and their indices come from ./custom/de/data/words/words.csv. These words are not lemmas according to the dictionaries.

      The lemma column provides a lemma of the word. These lemmas were obtained via an LLM.

      The lemma_correct column provides one of the full forms of the word (e.g., a noun with an article), where the word is a correct lemma. This column isn't strictly necessary.

    3. copy_lemmas_from_words_not_lemmas_to_words_lemmas copies correct lemmas from ./custom/de/data/words/not-lemmas.csv to ./custom/de/data/words/lemmas.csv.

    4. update_lemmas_correct then adds missing lemma_correct in ./custom/de/data/words/lemmas.csv.

      For nouns that change their meaning depending on the article, there will be a row with an index like 504.0 for the version with one article and additional rows with indices like 504.001 for versions with other articles. The articles are looked up in the nouns dictionary.

    5. copy_words_lemmas_to_deck copies lemma_correct from ./custom/de/data/words/lemmas.csv to the raw deck and updates the indices of words there.

    6. filter_deck_raw_by_sentence_length filters out sentences with an inappropriate length.

    7. update_words_bad_baseform updates ./custom/de/data/words/bad-baseform.csv that maps words from the raw deck to a form that is likely to appear in a lemmatized sentence. This file is used for partitioning.

  7. The user runs the next cell containing generate_deck_data_iteratively.

    It makes a specified number of iterations. On each iteration, it calls update_deck_raw.

    The update_deck_raw which does the following:

    1. It partitions the raw deck so that first go rows that have card data and then go rows without that data.

    2. It calls prepare_requests which does the following:

      1. It calculates the number of blocks of words that will be processed on this iteration. Each block will contain a specified number of words or less than that number if there are less than this number of words left in the raw deck.

      2. It uniformly randomly selects words without deck data and distributes them into blocks. This way, blocks are always different and the LLM that loads these blocks into its context is less likely to generate the same sentences as on one of the previous iterations.

      3. It prepares a request for each block.

    3. It makes parallel requests to the OpenAI API.

    4. It processes responses and writes them into the raw deck.

    5. It partitions the raw deck.

  8. The user git commits the changes to the raw deck.

  9. The user checks the logs to find words that didn't occur in their lemmatized sentences. The user adds mappings for these words to ./custom/de/data/words/bad-baseform.csv, connecting the word with its form in the lemmatized sentence.

  10. The user continues running generate_deck_data_iteratively until each word in the raw deck has the card data.

  11. The user debugs things if something doesn't work.

Constraints

Length constraint

In my deck, all sentences must be 60 to 70 characters long. Sentences in this range of length:

  • aren't boring;
  • expose various grammatical structures that shorter sentences often lack;
  • often contain long words that don't fit into shorter sentences;
  • are quicker to review than longer sentences.

Rare words constraint

In my deck, each sentence_lemmatized_deu must contain at least two words (not counting the word_deu) that appear at most three times among all sentence_lemmatized_deu. This constraint helped me increase the vocabulary in my deck.

Occurs constraint

Each sentence generated for a word must contain that word.

In my deck, I check this condition for the German word (word_deu) and the German sentence (sentence_deu). However, it's better to also check it for the English translation of the German word (word_eng) and the English sentence (sentence_eng).

Partitioning the raw deck

After writing the generated card data to the raw deck, it's necessary to remove the bad card data and prepare the deck for the next iteration of generation. Only the word and its index remain after removing the bad card data for a word, just like before the current iteration.

Partitioning is performed by the partition_deck_raw function defined in ./custom/de/script/lib.py which does the following:

  1. It calls several other functions:

    1. filter_deck_raw_by_sentence_length the card data for all words where the sentence_deu (German sentence) has an inappropriate length (see Length constraint).

    2. update_deck_raw_lemmatized_sentences writes lemmatized versions of sentence_deu to the sentence_lemmatized_deu column.

    3. update_word_counts calculates how many times each lemma from sentence_lemmatized_deu appeared in the deck.

    4. update_words_bad_baseform updates the ./custom/de/data/words/bad-baseform.csv file (discussed above).

  2. It selects the rows that have the card data and starts processing them from larger to smaller indices using check_is_correct_sentence.

    The check_is_correct_sentence function checks that the sentence_deu is correct by doing the following:

    1. It checks that the sentence_deu contains enough rare words.

    2. It checks that the lemmatized sentence_deu contains the baseform of the German word obtained via the make_baseform function defined in ./custom/de/script/lib.py or the baseform specified in ./custom/de/data/words/bad-baseform.csv.

    3. It decrements the counts for lemmas from sentence_lemmatized_deu if the sentence is incorrect.

    4. It prints a message if the sentence_deu doesn't contain the baseform.

  3. It removes the card data where the sentence_deu is incorrect.

  4. It reorders rows so that the rows with deck data go first and other rows go next. In each group, rows are sorted by the index.

Partitioning discussion

I assume that the LLM prefers to generate simple words because they are probably used more often in the training corpus.

I want to encounter difficult words in German sentences in the rows with small indices.

So, the goal is to avoid accumulating all simple words in the German sentences in the rows with smaller indices.

During partitioning, sentences are checked from larger to smaller indices. Since the word counts (see update_word_counts description above) are maximal at larger indices, these sentences are more likely to be removed due to the rare words constraint. Each time such a sentence is removed, the counts are decremented for the lemmas in this sentence. Hence, the sentences with smaller indices become less likely to be removed due to the rare words constraint.

A possible solution is to check in the random order, not from larger to smaller indices. However, this approach has two (potential) downsides. First, I assume that it may require more iterations to converge to stable set of sentences. Second, LLM may not have enough vocabulary to generate sentences for all words. Such a vocabulary saturation happened on several other decks.

Meanwhile, I wanted to start learning from words with smaller indices. Therefore, I kept checking in a non-random order.

Deck

Append-only because swapping sources will affect indices. Hence, notes (sorted by index) will be swapped and the progress will be incorrect.

About

Generate Anki decks from song texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published