-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Description of feature
In the course of implementing the new data structure (#327), I plan to make MuData the default way
of interacting with paired single-cell gene expression/AIRR data.
I'm thinking about how the API should be adapted for this.
Data structure recap
We are talking about a MuData object that looks like this:
MuData object with n_obs × n_vars = 3000 × 30727
2 modalities
gex: 3000 x 30727
obs: 'cluster_orig', 'patient', 'sample', 'source'
uns: 'cluster_orig_colors'
obsm: 'X_umap_orig'
airr: 3000 x 0
obs: 'high_confidence', 'is_cell', 'clonotype_orig'
obsm: 'airr', 'chain_indices'
The gex
modality contains the gene expression data, the airr
modaility the
receptor data. The airr
modality has no .X
, the relevant data are stored in .obsm
.
- Most scirpy functions only operate on the
airr
modality. - Some functions use both
airr
andgex
data. - For visualization, it is useful to plot
airr.obs
on top ofgex
embeddings, or use columns from bothgex.obs
andairr.obs
in a single plot.
Since the airr
modality only has obs
and obsm
, it would be thinkable to
(additionally) support the use of a single AnnData
object with gene expression datain .X
and receptor data in .obsm
.
API consideration for unimodal data
(i.e. scirpy functions that only use the airr
modality)
1. For a function that only operates on the AIRR data, what is the preferred option to interact with mudata?
ir.tl.chain_qc(mdata, airr_key="airr", **kwargs)
or
ir.tl.chain_qc(mdata['airr'], **kwargs)
2. Should a function that only operates on the AIRR data add columns to mdata
or adata
?
def chain_qc(mdata, airr_key="airr", **kwargs):
adata = mdata[airr_key]
adata.obs["new_col"] = np.zeros((adata.n_obs, ))
# should this be called by the function automatically?
mdata.update_obs()
3. Use muon for plotting or scanpy?
Is it preferable to call
mu.pl.umap(mdata, color="gex:cluster")
or
sc.pl.umap(mdata['gex'], color="cluster")
If the former, is there a recommended way to transfer .obsm
from the GEX AnnData to MuData (similar to update_obs
for .obs
)?
API considerations for multimodal data
(i.e. functions that consume both the airr
and gex
modalities)
I have a function that depends on a gene expression neighborhood graph and .obs
annotations based on AIRR data.
API options
- pass both modalities (probably not)
ir.tl.clonotype_modularity(mdata['gex'], mdata['airr'], airr_col="clone_id")
- pass mdata and mod_keys
ir.tl.clonotype_modularity(mdata, gex_mod="gex", airr_col="airr:clone_id")
- Store the gene expression neighborhood graph in mudata
# is there something like mdata.update_obsm() ? mdata.obsp["connectivities"] = mdata["gex"].obsp["connectivities"] ir.tl.clonotype_modularity(mdata, airr_col="airr:clone_id")
Possible solution
I'm leaning towards having all functions operate on MuData
directly,
i.e.
ir.tl.something(mdata, airr_key="airr", col="airr:xxx")
with the option to also pass an anndata object for backwards-compatibility (in that case, airr_key
will be ignored).
ir.tl.something(adata, col="xxx")
Metadata
Metadata
Assignees
Labels
Type
Projects
Status