-
Notifications
You must be signed in to change notification settings - Fork 39
Description
With the implementation of the new datastructure (#327), it becomes rather tricky to get information
from the airr rearrangment schema (e.g. what is the "c_call" of the primary "VJ" chain?).
Previously this was possible simply with
adata.obs["IR_VJ_1_c_call"]
With the new datastructure
adata[:, "c_call"].X
just yields an awkward array with a variable number of chains per cell. The information which
chain is VJ/VDJ and primary or secondary is hidden away in adata.obsm.
This motivates the implementation of long-proposed easy-access getter/setter functions. At the very least to
- retrieve AIRR rearrangement variables for a certain chain.
But possibly also with convenience functions, e.g.
- to retrieve the most abundant categories (previously discussed in Straightforward way for getting most abundant categories #51).
The latter is of less importance, but the interface needs to be designed jointly, therefore this is also a topic in this issue.
To get AIRR data, we need something like
ir.get(adata, "locus", "VJ_1") -> pd.Series
We need it regulary to get the top n of a category, e.g. v-gene or clonotype.
This can be achieved in a pandas onliner, if one knows how to do it...
# This is probably hacky, we might think about a better way, but we need the most abundant clonotypes
top_clonotypes = adata.obs.clonotype.value_counts()[:8].index.tolist() # A better way might be needed especailly to take normalization into account
top_vgenes = adata.obs.TRB_1_v_gene.value_counts()[:8].index.tolist()
It would be more user-friendly to have a convenience function to this,
for instance:
ir.tl.top_n(col="clonotype", n=10)
Use it for plotting:
sc.pl.umap(adata, color="clonotype", groups=ir.tl.top_n("clonotype", 10))
To discuss
- retrieve multiple columns at once / vectorization?
- what about "extra" chains?
- plotting
Metadata
Metadata
Assignees
Labels
Type
Projects
Status