kraken extractor
Kractor extracts sequencing reads from fastq [.gz/.bz2]
files using taxonomic classifications obtained via Kraken2.
It supports single or paired-end reads, can optionally include taxonomic parents or children, and uses minimal memory (~4.5 MB for a 17 GB FASTQ file).
The end result is a fast[q/a]
file containing all reads classified as the specified taxon(s).
Kractor significantly enhances processing speed compared to KrakenTools for both paired and unpaired reads.
Performance vs KrakenTools:
- Paired compressed FASTQ: ~21× faster
- Paired uncompressed FASTQ: ~10× faster
- Unpaired: ~4× faster (compressed or uncompressed)
For additional details, refer to the benchmarks
Provides similar functionality to the KrakenTools extract_kraken_reads
python script.
However the main motivation was to enhance speed when processing multiple, large FASTQ files - and as a way to learn Rust.
Precompiled binaries for Linux, MacOS and Windows are attached to the latest release.
conda install -c bioconda kractor
A docker image is available on quay.io
docker pull quay.io/biocontainers/kractor
Requires cargo
cargo install kractor
To install please refer to the rust documentation: docs
git clone https://github.com/Sam-Sims/Kractor
cd Kractor
cargo build --release
export PATH=$PATH:$(pwd)/target/release
All executables will be in the directory Kractor/target/release.
Extract reads from a FASTQ file based on taxonomic classification via Kraken2.
Usage: kractor [OPTIONS] --input [<INPUT>...] --output [<OUTPUT>...] --kraken <KRAKEN> --taxid <TAXID>...
Options:
-i, --input [<INPUT>...]
Input file path(s). Accepts up to 2 files (for paired-end reads)
-o, --output [<OUTPUT>...]
Output file path(s). Accepts up to 2 files (for paired-end reads)
-k, --kraken <KRAKEN>
Kraken2 stdout file path
-r, --report <REPORT>
Kraken2 report file path
-t, --taxid <TAXID>...
One or more taxon IDs to extract reads for
-p, --parents
Include all parent taxon IDs in the output. Requires a Kraken2 report file
-c, --children
Include all child taxon IDs in the output. Requires a Kraken2 report file
--compression-format <OUTPUT_TYPE>
Compression format for output files (gz, bz2). Overides the inferred format
--compression-level <COMPRESSION_LEVEL>
Compression level (1-9) [default: 2]
--exclude
Exclude specified taxon IDs from the output
--output-fasta
Output results in FASTA format
--summary
Enable a JSON summary output written to stdout
-v, --verbose
Enable verbose output
-h, --help
Print help
-V, --version
Print version
# Extract reads classified as E. coli from single end reads
kractor -i sample.fastq -o extracted.fastq -k kraken_output.txt -t 562
# Extract from paired end reads
kractor -i sample_R1.fastq -i sample_R2.fastq -o extracted_R1.fastq -o extracted_R2.fastq -k kraken_output.txt -t 562
# Extract multiple taxids (Bacillaceae and Listeriaceae)
kractor -i sample.fastq -o extracted.fastq -k kraken_output.txt -t 186817 186820
# Extract all children of Enterobacteriaceae family (requires kraken report)
kractor -i sample.fastq -o extracted.fastq -k kraken_output.txt -r kraken_report.txt -t 543 --children
# Extract everything EXCEPT viral reads (using --exclude)
kractor -i sample.fastq -o extracted.fastq -k kraken_output.txt -t 10239 --exclude
# Output FASTA format instead of FASTQ
kractor -i sample.fastq -o extracted.fasta -k kraken_output.txt -t 562 --output-fasta
Use --json-report
to get summary statistics (output to stdout on completion)
{
"total_taxon_count": 2,
"reads_extracted_per_taxon": {
"0": 745591,
"1": 1646
},
"total_reads_in": 3491078,
"total_reads_out": 747237,
"proportion_extracted": 0.2140419091180432,
"input_format": "single",
"output_format": "fastq",
"kractor_version": "2.0.0"
}
-i, --input
Specifies one or two input FASTQ files to extract reads from. Files may be uncompressed or compressed (gz
, bz2
).
Paired end reads can be specified by:
Using --input
twice: -i <R1_fastq_file> -i <R2_fastq_file>
Using --input
once but passing both files: -i <R1_fastq_file> <R2_fastq_file>
-o, --output
Specifies the output file(s) for extracted reads, matching the order of the input files. Compression type is inferred from the file extension (.gz, .bz2). If not recognised, output will be uncompressed.
-k, --kraken
Path to the Standard Kraken Output Format file, containing taxonomic classification of read IDs.
-t, --taxid
One or more taxonomic IDs to extract.
For example: -t 1 2 10
Each taxid is affected by --exclude
, --parents
, and --children
if those options are used.
--compression-format
Manually set output compression format, overriding what is inferred from file names.
Valid values:
gz
– gzip compressionbz2
– bzip2 compressionnone
– no compression
--compression-level
Set compression level (1–9).
- 1 = fastest, largest file
- 9 = slowest, smallest file
Default: 2 (balance of speed and size)
--output-fasta
Output sequences in FASTA format instead of FASTQ.
-r, --report
Path to the Kraken2 report file. Required if using --parents
or --children
.
--parents
Include reads classified between the root and the specified --taxid
. Requires --report
.
--children
Include reads classified at the given taxid and all its descendant taxa. Requires --report
.
--exclude
Extract all reads except those matching the given taxids. Can be combined with --parents
or --children
.
--summary
Write a JSON report to stdout after processing.
Sam Sims. (2025). Sam-Sims/kractor: kractor-1.0.1 (kractor-1.0.1). Zenodo. https://doi.org/10.5281/zenodo.15761838