Skip to content

bgub/tokka

Repository files navigation

Tokka - Multilingual BPE Tokenizer Toolkit

A Python toolkit for training multilingual BPE tokenizers with support for dataset streaming and efficient processing.

Status: Fully Working - Successfully tested with multilingual tokenizer training!

Features

  • Multilingual tokenizer training with configurable language mixing
  • HuggingFace dataset integration with streaming support
  • MosaicML Streaming format conversion for efficient training
  • Flexible configuration using YAML + CLI overrides
  • Memory-efficient processing with dataset streaming

Installation

Uses uv for package management:

uv sync

Quick Start

1. Install Dependencies

uv sync

2. Convert Datasets (Optional)

# Convert HuggingFace datasets to efficient MosaicML Streaming format
# This downloads and converts 40 human languages + 10 programming languages
./init_datasets.sh

# Or convert specific datasets manually:
uv run convert-hf --dataset HuggingFaceFW/fineweb --subsets sample-10BT --target-column text --output fineweb

3. Train Your First Tokenizer

# Quick test with tiny config (100K samples, ~2 minutes)
uv run train --config configs/tiny.yaml

# Basic training with simple config
uv run train --config configs/simple.yaml

# With CLI overrides
uv run train --config configs/simple.yaml total_samples=25000000

# Complex multilingual setup
uv run train --config configs/complex.yaml output_dir=./my-tokenizer

4. Test Your Tokenizer

# Test the tokenizer you just trained
uv run test --tokenizer-path tokenizers/796e0f-train-tiny-100k/tokenizer.json

# Or test any saved tokenizer
uv run test --tokenizer-path ./path/to/tokenizer.json

Commands

Train Tokenizer

Train a multilingual BPE tokenizer with various configurations:

Convert HuggingFace Datasets to MDS Format

Convert HuggingFace datasets to MosaicML Streaming format for efficient training:

# Convert FineWeb dataset
uv run convert-hf --dataset HuggingFaceFW/fineweb --target-column text --output ./mds_datasets/fineweb

# Convert StarCoder data with specific language subsets
uv run convert-hf --dataset bigcode/starcoderdata --target-column content --subsets python,java,javascript --output ./mds_datasets/starcoderdata

# Convert with specific data directories
uv run convert-hf --dataset HuggingFaceFW/fineweb --target-column text --data-dirs CC-MAIN-2024-10,CC-MAIN-2024-18 --output ./mds_datasets/fineweb-samples

# Convert with custom split and size limit
uv run convert-hf --dataset wikitext --target-column text --split validation --max-size-gb 0.5 --output ./mds_datasets/wikitext-val

The convert-hf command supports:

  • Multiple subsets/languages: Use --subsets for dataset configurations
  • Data directories: Use --data-dirs for datasets organized by directories
  • Size limits: Control dataset size with --max-size-gb (default: 1GB per configuration)
  • Custom splits: Specify train/validation/test splits
  • Automatic organization: Each subset/data_dir creates its own output folder

Project Structure

tokka/
├── cli/                    # CLI commands
│   ├── train.py           # Tokenizer training
│   ├── test.py            # Tokenizer testing
│   └── convert_hf.py      # HuggingFace to MDS conversion
├── configs/               # YAML configurations
│   ├── tiny.yaml          # Quick test (100K samples)
│   ├── simple.yaml        # Basic setup
│   ├── medium.yaml        # Medium complexity
│   └── complex.yaml       # Full multilingual
├── tokka/                 # Core library
│   ├── config.py          # Configuration management
│   ├── dataset_utils.py   # Dataset processing
│   └── tokenizer_utils.py # Tokenizer utilities
├── init_datasets.sh       # Dataset conversion script (40 languages + 10 programming)
└── mds_datasets/          # Converted MDS datasets (created by init_datasets.sh)

Configuration

Tokenizer training uses YAML configuration files with CLI override support:

# Example: configs/simple.yaml
total_samples: 1000000
vocab_size: 32000
temperature: 0.3
streaming_enabled: true

datasets:
  - name: "HuggingFaceFW/fineweb"
    column: "text"
    priority: 1.0

CLI overrides use dot notation:

uv run train --config configs/simple.yaml total_samples=2000000

Example Training Output

When you run uv run train --config configs/tiny.yaml, you'll see:

🚀 Starting tokenizer training with config: configs/tiny.yaml

🔧 Configuration loaded:
  Total samples: 100,000
  Vocabulary size: 10,000
  Output directory: ./tokenizers/<hash>-train-tiny-100k
  Temperature: 0.3 (controls priority weighting)
  Batch size: 5,000 (performance setting)
  Datasets: 2
    1. fineweb/sample-10BT (priority: 5)
    2. fineweb-2/rus_Cyrl (priority: 5)

📊 Dataset Configuration:
   Temperature: 0.3 (priority weighting)
   Datasets: 2
     sample-10BT: priority=5 → 50.0%
     rus_Cyrl: priority=5 → 50.0%

...

✅ Tokenizer saved using HuggingFace save_pretrained() to: tokenizers/796e0f-train-tiny-100k
🔢 Final vocabulary size: 10000

🚀 You can now load this tokenizer with:
   from transformers import AutoTokenizer
   tokenizer = AutoTokenizer.from_pretrained('tokenizers/796e0f-train-tiny-100k')

The tokenizer is tested on multiple languages and all round-trip tests pass perfectly! ✅

MosaicML Streaming Integration

The toolkit integrates with MosaicML Streaming for efficient dataset processing:

  • Deterministic sample ordering: Reproducible across different hardware configurations
  • Mid-epoch resumption: Resume training quickly after interruptions
  • High throughput: Optimized for streaming large datasets
  • Memory efficiency: Process datasets larger than available RAM

License

MIT License

About

easily train multilingual tokenizers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published