Skip to content

Conversation

hunter-cameron
Copy link
Contributor

This is an addition that my lab wanted because we wanted to be able to BLAST genes from our bins that were identified as marker genes.

It looks similar to what I think issue #22 is requesting. With that in mind, I thought I'd upload my code. Feel free pull it into the main branch if this is what you had in mind. It would be possible to sort the genes by contig and position on contig to have an ordered FASTA of marker genes. This isn't currently done for speed reasons but it could be a good complement to a graphic of the dispersion of marker genes across the contigs. My lab also has interest in knowing how evenly dispersed the markers are over the genome bin (to ensure we haven't just happened to assemble a region that is dense in markers which makes it look more complete than it actually is). So, I may work on making a dispersion graphic at some time in the future.

Commit 8ee5f90 message describes the addition in more detail.

Added the ability to print a FASTA file (amino acid) that includes the sequences
of the genes that were identified as marker genes in each bin. Feature is
implemented in the qa subprogram as a new output format (number 10).

Each sequence is printed with an informative header that takes the form:

>$bin_id $contig_name geneId=$gene_num;start=$start;end=$end;strand=$strand marker=$marker;mstart=$mstart;mend=$mend

Where $ denotes a variable.

	Variables:
	- bin_id: FASTA bin the sequence comes from
	- contig_name: contig the sequence comes from
	- gene_num: gene number as assigned by Prodigal
	- start: gene start position on contig (nucleotide)
	- end: gene end position on contig (nucleotide)
	- strand: strand the gene is on, 1 or -1
	- marker: name of the marker that was identified as a match to the gene
	- mstart: start position on gene of alignment with marker (amino acid)
	- mend: end position on gene of alignment with marker (amino acid)
@donovan-h-parks
Copy link
Contributor

Thanks Cameron,

I have added this to the main branch. I hope to release a new version of CheckM in the next few weeks. I've also added you as a contributor on the CheckM wiki. I very much appreciate the contribution.

Cheers,
Donovan

donovan-h-parks added a commit that referenced this pull request Jun 8, 2015
@donovan-h-parks donovan-h-parks merged commit c2ca2ef into Ecogenomics:master Jun 8, 2015
@hunter-cameron
Copy link
Contributor Author

Sounds good!
Thanks for such a useful tool; we use it to check all of our genomes.
Hunter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants