Skip to content

ArcInstitute/bqtools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bqtools

MIT licensed Crates.io

A command-line utility for working with BINSEQ files.

Overview

bqtools provides tools to encode, decode, manipulate, and analyze BINSEQ files. It supports both (*.bq) and (*.vbq) files and makes use of the binseq library.

BINSEQ is a binary file format family designed for high-performance processing of DNA sequences. It currently has two variants: BQ and VBQ.

  • BQ (*.bq): Optimized for fixed-length DNA sequences without quality scores.
  • VBQ (*.vbq): Optimized for variable-length DNA sequences with optional quality scores.

Both support single and paired sequences and make use of two-bit encoding for efficient nucleotide packing using bitnuc and efficient parallel FASTX processing using paraseq.

For more information about BINSEQ, see our preprint where we describe the format family and its applications.

Features

  • Encode: Convert FASTA or FASTQ files to a BINSEQ format
  • Decode: Convert a BINSEQ file back to FASTA, FASTQ, or TSV format
  • Cat: Concatenate multiple BINSEQ files
  • Count: Count records in a BINSEQ file
  • Grep: Search for fixed-string or regex patterns in BINSEQ files.

Installation

From Cargo

bqtools can be installed using cargo, the Rust package manager:

cargo install bqtools

To install cargo you can follow the instructions on the official Rust website.

From Source

# Clone the repository
git clone https://github.com/arcinstitute/bqtools.git
cd bqtools

# Install
cargo install --path .

# Check installation
bqtools --help

Usage

# Get help information
bqtools --help

# Get help for specific commands
bqtools encode --help
bqtools decode --help
bqtools cat --help
bqtools count --help

Encoding

bqtools accepts input from stdin or from file paths.

It will auto-determine the input format and compression status.

Convert FASTA/FASTQ files to BINSEQ:

# Encode a single file to bq
bqtools encode input.fastq -o output.bq

# Encode a single file to vbq
bqtools encode input.fastq -o output.vbq

# Encode a file stream to bq (auto-determine input format and compression status)
/bin/cat input.fastq.zst | bqtools encode -o output.bq

# Encode paired-end reads
bqtools encode input_R1.fastq input_R2.fastq -o output.bq

# Encode paired-end reads to vbq
bqtools encode input_R1.fastq input_R2.fastq -o output.vbq

# Encode a SAM/BAM/CRAM file to BINSEQ
bqtools encode input.bam -fb -o output.bq

# Encode an paired-end CRAM file to BINSEQ (sorted by read name)
bqtools encode input.paired.cram -I -fb -o output.vbq

# Specify a policy for handling non-ATCG nucleotides
bqtools encode input.fastq -o output.bq -p r  # Randomly draw A/C/G/T for each N

# Set threads for parallel processing
bqtools encode input.fastq -o output.bq -T 4

Available policies for handling non-ATCG nucleotides:

  • i: Ignore sequences with non-ATCG characters
  • p: Break on invalid sequences
  • r: Randomly draw a nucleotide for each N (default)
  • a: Set all Ns to A
  • c: Set all Ns to C
  • g: Set all Ns to G
  • t: Set all Ns to T

Decoding

Convert BINSEQ files back to FASTA/FASTQ/TSV:

# Decode to FASTQ (default)
bqtools decode input.bq -o output.fastq

# Decode to compressed FASTQ (gzip/zstd)
bqtools decode input.bq -o output.fastq.gz
bqtools decode input.bq -o output.fastq.zst

# Decode to FASTA
bqtools decode input.bq -o output.fa -f a

# Decode paired-end reads into separate files
bqtools decode input.bq --prefix output
# Creates output_R1.fastq and output_R2.fastq

# Specify which read of a pair to output
bqtools decode input.bq -o output.fastq -m 1  # Only first read
bqtools decode input.bq -o output.fastq -m 2  # Only second read

# Specify output format
bqtools decode input.bq -o output.tsv -f t  # TSV format

Concatenating

Combine multiple BINSEQ files:

bqtools cat file1.bq file2.bq file3.bq -o combined.bq

Counting

Count records in a BINSEQ file:

bqtools count input.bq

Grep

You can easily search for specific subsequences or regular expressions within BINSEQ files:

# See full options list
bqtools grep --help

# Search for a specific subsequence (in primary sequence)
bqtools grep input.bq -e "ATCG"

# Search for a regular expression (in primary)
bqtools grep input.bq -r "AT[CG]"

# Search for both a subsequence (in extended sequence) and a regular expression (in either)
bqtools grep input.bq -E "ATCG" -P "AT[CG]"

About

A command line utilty for working with BINSEQ files

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages