A Python toolkit for training multilingual BPE tokenizers with support for dataset streaming and efficient processing.
✅ Status: Fully Working - Successfully tested with multilingual tokenizer training!
- Multilingual tokenizer training with configurable language mixing
- HuggingFace dataset integration with streaming support
- MosaicML Streaming format conversion for efficient training
- Flexible configuration using YAML + CLI overrides
- Memory-efficient processing with dataset streaming
Uses uv
for package management:
uv sync
uv sync
# Convert HuggingFace datasets to efficient MosaicML Streaming format
# This downloads and converts 40 human languages + 10 programming languages
./init_datasets.sh
# Or convert specific datasets manually:
uv run convert-hf --dataset HuggingFaceFW/fineweb --subsets sample-10BT --target-column text --output fineweb
# Quick test with tiny config (100K samples, ~2 minutes)
uv run train --config configs/tiny.yaml
# Basic training with simple config
uv run train --config configs/simple.yaml
# With CLI overrides
uv run train --config configs/simple.yaml total_samples=25000000
# Complex multilingual setup
uv run train --config configs/complex.yaml output_dir=./my-tokenizer
# Test the tokenizer you just trained
uv run test --tokenizer-path tokenizers/796e0f-train-tiny-100k/tokenizer.json
# Or test any saved tokenizer
uv run test --tokenizer-path ./path/to/tokenizer.json
Train a multilingual BPE tokenizer with various configurations:
Convert HuggingFace datasets to MosaicML Streaming format for efficient training:
# Convert FineWeb dataset
uv run convert-hf --dataset HuggingFaceFW/fineweb --target-column text --output ./mds_datasets/fineweb
# Convert StarCoder data with specific language subsets
uv run convert-hf --dataset bigcode/starcoderdata --target-column content --subsets python,java,javascript --output ./mds_datasets/starcoderdata
# Convert with specific data directories
uv run convert-hf --dataset HuggingFaceFW/fineweb --target-column text --data-dirs CC-MAIN-2024-10,CC-MAIN-2024-18 --output ./mds_datasets/fineweb-samples
# Convert with custom split and size limit
uv run convert-hf --dataset wikitext --target-column text --split validation --max-size-gb 0.5 --output ./mds_datasets/wikitext-val
The convert-hf
command supports:
- Multiple subsets/languages: Use
--subsets
for dataset configurations - Data directories: Use
--data-dirs
for datasets organized by directories - Size limits: Control dataset size with
--max-size-gb
(default: 1GB per configuration) - Custom splits: Specify train/validation/test splits
- Automatic organization: Each subset/data_dir creates its own output folder
tokka/
├── cli/ # CLI commands
│ ├── train.py # Tokenizer training
│ ├── test.py # Tokenizer testing
│ └── convert_hf.py # HuggingFace to MDS conversion
├── configs/ # YAML configurations
│ ├── tiny.yaml # Quick test (100K samples)
│ ├── simple.yaml # Basic setup
│ ├── medium.yaml # Medium complexity
│ └── complex.yaml # Full multilingual
├── tokka/ # Core library
│ ├── config.py # Configuration management
│ ├── dataset_utils.py # Dataset processing
│ └── tokenizer_utils.py # Tokenizer utilities
├── init_datasets.sh # Dataset conversion script (40 languages + 10 programming)
└── mds_datasets/ # Converted MDS datasets (created by init_datasets.sh)
Tokenizer training uses YAML configuration files with CLI override support:
# Example: configs/simple.yaml
total_samples: 1000000
vocab_size: 32000
temperature: 0.3
streaming_enabled: true
datasets:
- name: "HuggingFaceFW/fineweb"
column: "text"
priority: 1.0
CLI overrides use dot notation:
uv run train --config configs/simple.yaml total_samples=2000000
When you run uv run train --config configs/tiny.yaml
, you'll see:
🚀 Starting tokenizer training with config: configs/tiny.yaml
🔧 Configuration loaded:
Total samples: 100,000
Vocabulary size: 10,000
Output directory: ./tokenizers/<hash>-train-tiny-100k
Temperature: 0.3 (controls priority weighting)
Batch size: 5,000 (performance setting)
Datasets: 2
1. fineweb/sample-10BT (priority: 5)
2. fineweb-2/rus_Cyrl (priority: 5)
📊 Dataset Configuration:
Temperature: 0.3 (priority weighting)
Datasets: 2
sample-10BT: priority=5 → 50.0%
rus_Cyrl: priority=5 → 50.0%
...
✅ Tokenizer saved using HuggingFace save_pretrained() to: tokenizers/796e0f-train-tiny-100k
🔢 Final vocabulary size: 10000
🚀 You can now load this tokenizer with:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('tokenizers/796e0f-train-tiny-100k')
The tokenizer is tested on multiple languages and all round-trip tests pass perfectly! ✅
The toolkit integrates with MosaicML Streaming for efficient dataset processing:
- Deterministic sample ordering: Reproducible across different hardware configurations
- Mid-epoch resumption: Resume training quickly after interruptions
- High throughput: Optimized for streaming large datasets
- Memory efficiency: Process datasets larger than available RAM
MIT License