-
Notifications
You must be signed in to change notification settings - Fork 211
Description
@benedictpaten wants Giraffe to be "fully simple". This has two elements, as I understand it:
- We should have standards and specs for input formats we want from HPRC, to run on
- We should have a pipeline suitable for a BME230 (2nd year graduate course) student to run Giraffe on an input (extended) GFA, without them complaining that they need a giant-memory machine, or that it takes forever to go through all the connected components, or that they can't understand how to build the indexes from the tool help.
So far, we have identified three things we need to do to set this up:
-
Support some form or forms of GFA-with-haplotypes as a single-file input format for indexing for Giraffe. We should be able to go straight from this (blunt but otherwise unprocessed file, possibly with large nodes or string/0 node names) to the indexes Giraffe needs. We want to take this as input for a new tool-oriented indexing approach (maybe
vg index --giraffe
), with help that points to to that and doesn't overwhelm the user with 15 different possible index formats they could make. This would require a bit of workflow work to make it so the one command can decide the right number of connected components to build indexes for in memory at a time, without running out of memory or taking forever in serial. -
Combine the GBWT and GBWTGraph into one file. This would reduce the number of files that Giraffe needs to three (GBWT/graph, distance index, minimizer index). @jltsiren thinks this will be straightforward.
-
Abstract away node chopping and node boundaries, with some kind of system and specified format for translating between GFA coordinates and chopped-graph, numbered-node vg coordinates. The new grad students and other new users keep complaining that vg has chopped up their graph and that the node IDs are "wrong". If we have a way to tell them the coordinate translation we are using (or to maybe even output mappings in the original GFA coordinates?), they will hopefully be mollified. Eventually, we might like a way for the HandleGraph API to let you ignore where node boundaries fall.
@ekg What do you think? We also probably would need vg index --map
and vg index --mpmap
, right? And as for GFA-with-haplotypes, do we have a consensus on what style or styles we should accept for paths with haplotype semantics?