Skip to content

tdhock/PeakSegFPOP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deprecated

Please don’t use this code! If you want to use PeakSeg for genome-wide peak calling, please use PeakSegPipeline instead. (it implements the same algorithms, but with a better user interface; it should be easier to install and use)

Old README below

This repository contains scripts for genome-wide supervised ChIP-seq peak prediction, for a single experiment type (e.g. broad H3K36me3 or sharp H3K4me3 data), jointly using multiple samples and cell types.

  • Labeling. The first step of any supervised analysis is to label a few genomic regions with and without peaks in your data. The create_track_hub.R script creates a track hub from bigWig files, which makes it easy to visualize data on the UCSC genome browser and then create labels (specific genomic regions where you have observed presence or absence of peaks in specific samples). For more details about labeling see Input Files -> Label Data below.
  • Single-sample peak calling using PeakSegFPOP. This repository includes PeakSegFPOP, a command line limited memory PeakSeg functional pruning optimal partitioning algorithm (arXiv:1703.03352), which is used to predict preliminary peaks for each sample independently.
  • Multiple-sample peak calling using PeakSegJoint. After single-sample analysis, peaks are clustered into genomic regions with overlapping peaks. In each cluster we run PeakSegJoint to find the most likely common peak positions across multiple samples – this includes a prediction of which samples are up and down for each genomic region.

https://travis-ci.org/tdhock/PeakSegFPOP.png?branch=master

Example output

Broad H3K36me3 data: Train on 21 immune cell samples, genome-wide prediction on 29 samples of several cell types.

Sharp H3K4me3 data: train on 27 immune cell samples, genome-wide prediction on 37 samples of several cell types.

test_demo.R data: train on four labeled samples (two H3K36me3 + two Input), predict on 8 samples in a subset of chr10

Installation

The following commands can be used to install all the required dependencies on Ubuntu precise (linux.x86_64 architecture). If you have another system, make sure to install R, BerkeleyDB STL, and UCSC tools (bigWigToBedGraph, bedToBigBed). Then execute packages.R to install all required R packages.

sudo aptitude install build-essential r-recommended libdb6.0++-dev libdb6.0-stl-dev bedtools
export BIN=~/bin # or any other directory on your $PATH
curl -L http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph -o $BIN/bigWigToBedGraph
curl -L http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed -o $BIN/bedToBigBed
chmod u+x $BIN/bigWigToBedGraph $BIN/bedToBigBed
make
cp PeakSegFPOP $BIN
Rscript packages.R

BerkeleyDB is used by PeakSegFPOP to write a large temporary file to disk. If you don’t have root or aptitude, then you can always install BerkeleyDB by hand to your home directory

wget http://download.oracle.com/berkeley-db/db-6.2.23.NC.tar.gz
tar xf db-6.2.23.NC.tar.gz
cd db-6.2.23.NC/build_unix
../dist/configure --prefix=$HOME --enable-stl
make
make install

If BerkeleyDB is installed to your home directory or some other non-standard directory, then make sure to edit the Makefile to tell the compiler where to find it.

g++ -std=c++0x -o PeakSegFPOP PeakSegFPOPLog.cpp funPieceListLog.cpp -I$HOME/include -L$HOME/lib -ldb_stl -Wl,-rpath=$HOME/lib

Once everything has been installed, you can test your installation by executing test_demo.R. It will first download some bigWigs and labels to the test/demo directory, then run pipeline.R on them. If everything worked, you can view the results by opening test/demo/index.html in a web browser, and it should be the same as the results shown on http://cbio.mines-paristech.fr/~thocking/hubs/test/demo/

Input Files

The pipeline.R script uses PeakSegFPOP + PeakSegJoint to predict common and different peaks in multiple samples. It requires three kinds of input data:

  • coverage data under project/samples,
  • labels in project/labels,
  • genomic segmentation problems in project/problems.bed.

The last step of pipeline.R is plot_all.R which creates

  • index.html a web page which summarizes the results,
  • peaks_matrix.tsv a binary matrix (peaks x samples) in which 1 means peak and 0 means no peak.
  • peaks_summary.tsv is a table with a row for each genomic region that has a peak in at least one sample. The columns are
    • chrom, peakStart, peakEnd genomic region of peak.
    • specificity if you have labeled peaks in Input samples, the model labels each peak as either specific (few Input samples up), or non-specific (many Input samples up). If you want to filter non-specific Input peaks yourself, you can use the n.Input column, which is the number of Input samples with a peak in this region.
    • loss.diff the likelihood of the peak (larger values mean taller and wider peaks in more samples).
    • chisq.pvalue, fisher.pvalue P-Values from Chi-Squared (chisq.test) and Fisher’s exact test (fisher.test) for whether or not this peak is group-specific (lower values mean strong correlation between peak calls and groups).

To give a concrete example, let us consider the data set that is used when you run test_demo.R.

Coverage data

Each coverage data file should contain counts of aligned sequence reads at every genomic position, for one sample. These files can be in either bedGraph or bigWig format (bigWig is preferable since it is indexed and thus faster to read than bedGraph). For example test_demo.R downloads 8 files:

test/demo/samples/bcell/MS026601/coverage.bigWig
test/demo/samples/bcell_/MS010302/coverage.bigWig
test/demo/samples/Input/MS002201/coverage.bigWig
test/demo/samples/Input/MS026601/coverage.bigWig
test/demo/samples/Input_/MS002202/coverage.bigWig
test/demo/samples/Input_/MS010302/coverage.bigWig
test/demo/samples/kidney/MS002201/coverage.bigWig
test/demo/samples/kidney_/MS002202/coverage.bigWig

In the example above we have the test/demo directory which will contain all data sets, labels, and peak calls for this particular project. The samples directory contains a sub-directory for each sample group (experimental conditions or cell types, e.g. bcell or kidney). Each sample group directory should contain a sub-directory for each sample (e.g. MS002201 or MS010302). Each sample sub-directory should contain either a coverage.bedGraph or coverage.bigWig file with counts of aligned sequence reads.

Note that in this demonstration project, the groups with underscores are un-labeled samples (e.g. bcell_), and the groups without underscores are labeled samples (e.g. bcell). In real projects typically you would combine those two groups into a single labeled group, but in this project we keep them separate in order to demonstrate the prediction accuracy of the learning algorithm.

Label Data

The project/labels/*.txt files contain genomic regions with or without peaks. These labels will be used to train the peak prediction models (automatically select model parameters that yield optimal peak prediction accuracy). A quick and easy way to create labels is by visual inspection as in the McGill ChIP-seq peak detection benchmark (for details please read Hocking et al, Bioinformatics 2016).

To visually label your data first create a project directory on a webserver with project/samples/groupID/sampleID/coverage.bigWig files, then create a track hub using a command such as

Rscript create_track_hub.R project http://your.server.com/~user/path- hg19 email@domain.com

The arguments of the create_track_hub.R script are as follows:

  • The first argument project is the data directory.
  • The second argument http://your.server.com/~user/path- is the URL prefix (appended before the first argument to obtain URLs for the trackDb.txt file).
  • The third argument hg19 is the UCSC genome ID for the genomes.txt file.
  • The fourth argument email@domain.com is the email address for the hub.txt file.

If that command worked, then you should see a message Created http://your.server.com/~user/path-project/hub.txt and then you can paste that URL into My Data -> Track Hubs -> My Hubs then click Add Hub to tell the UCSC genome browser to display your data. Navigate around the genome until you have found some peaks, then add positive and negative labels in project/labels/*.txt files.

For example in test_demo.R the data set contains only one labels file,

test/demo/labels/some_labels.txt

which contains lines such as the following

chr10:33,061,897-33,162,814 noPeaks
chr10:33,456,000-33,484,755 peakStart kidney
chr10:33,597,317-33,635,209 peakEnd kidney
chr10:33,662,034-33,974,942 noPeaks

chr10:35,182,820-35,261,001 noPeaks
chr10:35,261,418-35,314,654 peakStart bcell kidney

A chunk is a group of nearby labels. In the example above there are two chunks (far apart genomic regions, separated by an empty line). The first chunk has two regions with noPeaks labels in all samples, and two regions with positive labels in kidney samples and noPeaks labels in bcell samples. The second chunk has one region with noPeaks in bcell and kidney samples, and one region with a peakStart label in bcell and kidney samples.

In general, the labels file is divided into separate chunks by empty lines. Each chunk should contain lines for several nearby genomic regions, the corresponding label (noPeaks, peakStart, peakEnd, peaks), and the sample groups to which that label should be assigned (all other groups mentioned in the labels file will receive the noPeaks label). Ideally, each chunk should contain

  • At least one label with a peak in all samples.
  • At least one label with no peaks in any samples.
  • At least one label with a peak in some samples but not others (these labels are crucial for the model to be able to learn what is a significant difference between up and down).

Visualizing labels. After having added some labels in project/labels/*.txt files, run Rscript convert_labels.R project to create project/all_labels.bed. Then when you re-run Rscript create_track_hub.R ... it will create a new hub with a track “Manually labeled regions with and without peaks” that displays the labels you have created.

Genomic segmentation problems

The last input file that you need to provide is a list of separate segmentation problems for your reference genome (regions without gaps). This file should be in BED format (e.g. hg19_problems.bed).

If you don’t use hg19, but you do use another standard genome that is hosted on UCSC, then you can use downloadProblems.R

Rscript downloadProblems.R hg38 hg38_problems.bed

If your reference genome does not exist on UCSC, you can use gap2problems.R to make a problems.bed file:

Rscript gap2problems.R yourGenome_gap.bed yourGenome_chromInfo.txt yourGenome_problems.bed

where the chromInfo file contains one line for every chromosome, and the gap file contains one line for every gap in the reference (unknown / NNN sequence). If there are no gaps in your genome, then you can use yourGenome_chromInfo.txt as a problems.bed file.

Running steps of the pipeline in parallel

Since the human genome is so large, we recommend to do model training and peak prediction in parallel. To use a PBS/qsub cluster such as Compute Canada’s guillimin, begin by editing the create_problems_all.R script to reflect your cluster configuration. Then run

cd PeakSegFPOP
Rscript convert_labels.R test/demo
Rscript create_problems_all.R test/demo

That will create problem sub-directories in test/demo/samples/*/*/problems/*. Begin model training by computing target.tsv files:

for lbed in test/demo/samples/*/*/problems/*/labels.bed;do qsub $(echo $lbed|sed 's/labels.bed/target.tsv.sh/');done

The target is the largest interval of log(penalty) values for which PeakSegFPOP returns peak models that have the minimum number of incorrect labels. The target.tsv files are used for training a machine learning model that can predict optimal penalty values, even for un-labeled samples and genome subsets. To train a model, use

Rscript train_model.R test/demo

which trains a model using test/demo/samples/*/*/problems/*/target.tsv files, and saves it to test/demo/model.RData. To compute peak predictions independently for each sample and genomic segmentation problem,

for sh in test/demo/problems/*/jointProblems.bed.sh;do qsub $sh;done

which will launch one job for each genomic segmentation problem. Each job will make peak predictions in all samples, then write test/demo/problems/*/jointProblems/* directories with target.tsv.sh and peaks.bed.sh scripts. One directory and joint segmentation problem will be created for each genomic region which has at least one sample with a predicted peak. To train a joint peak calling model, run

qsub test/demo/joint.model.RData.sh

which will compute test/demo/joint.model.RData and test/demo/jobs/*/jobProblems.bed files. To make joint peak predictions, run

for sh in test/demo/jobs/*/jobPeaks.sh;do qsub $sh;done

To gather all the peak predictions in a summary on test/demo/index.html, run

qsub test/demo/peaks_matrix.tsv.sh

Finally, you can create test/demo/hub.txt which can be used as a track hub on the UCSC genome browser:

Rscript create_track_hub.R test/demo http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP- hg19 email@domain.com

The script will create test/demo/samples/*/*/coverage.bigWig and test/demo/samples/*/*/joint_peaks.bigWig files that will be shown together on the track hub in a multiWig container (for each sample, a colored coverage profile with superimposed peak calls as horizontal black line segments).

The PeakSegFPOP command line program

The PeakSegFPOP program finds the peak positions and corresponding piecewise constant segment means which optimize the penalized Poisson likelihood.

PeakSegFPOP coverage.bedGraph penalty [tmp.db]

The first argument coverage.bedGraph is a plain text file with 4 tab-separated columns: chrom, chromStart, chromEnd, coverage (chrom is character and the others are integers). It should include data for only one chromosome, and no gap regions.

The second argument penalty is a non-negative penalty value, for example 0, 0.1, 1e3, or Inf.

The third argument tmp.db is optional. It is the path for a temporary file which takes O(N log N) disk space (N = number of lines in coverage.bedGraph). In practice you can expect the size of the temporary file and the computation time to be as in the table below. Min and max values show the variation over several values of the penalty parameter (larger penalties require more time and disk space), on an Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz.

Nmin(MB)max(MB)min(time)max(time)
1000012431 sec2 sec
10000018962712 sec25 sec
1000000346271483 min5 min
713595650424169518 min56 min
780608252703342535 min167 min

For a single run with penalty parameter X, the PeakSegFPOP program outputs two files. The coverage.bedGraph_penalty=X_segments.bed file has one line for each segment, and the following tab-separated columns: chrom, chromStart, chromEnd, segment.type, segment.mean. The coverage.bedGraph_penalty=X_loss.tsv has just one line and the following tab-separated columns:

  • penalty input penalty parameter.
  • segments number of segments in the optimal model.
  • peaks number of peaks in the optimal model.
  • bases number of bases in the bedGraph file.
  • mean.pen.cost mean penalized Poisson loss.
  • total.cost total un-penalized Poisson loss. The following equation should hold for all data sets and penalty parameters: (total.cost + penalty * peaks)/bases = mean.pen.cost
  • status is the optimal model feasible for the PeakSeg problem with strict inequality constraints? If infeasible, then there is at least one pair of adjacent segment means which are equal (and there is no optimal solution to the problem with strict inequality constraints).
  • mean.intervals mean count of intervals (Poisson loss function pieces) over all the 2*N cost function models computed by the algorithm.
  • max.intervals maximum number of intervals.

Related work

An in-memory implementation of PeakSegFPOP is available in the coseg R package.

implementationtimememorydisk
command lineO(N log N)O(log N)O(N log N)
R pkg cosegO(N log N)O(N log N)0

Note that although both implementations are O(N log N) time complexity for N data points, the command line program is slower due to disk read/write overhead.

About

Command line limited memory PeakSeg functional pruning optimal partitioning algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •