Skip to content

MuData API considerations #383

@grst

Description

@grst

Description of feature

In the course of implementing the new data structure (#327), I plan to make MuData the default way
of interacting with paired single-cell gene expression/AIRR data.

I'm thinking about how the API should be adapted for this.

Data structure recap

We are talking about a MuData object that looks like this:

MuData object with n_obs × n_vars = 3000 × 30727
  2 modalities
    gex:	3000 x 30727
      obs:	'cluster_orig', 'patient', 'sample', 'source'
      uns:	'cluster_orig_colors'
      obsm:	'X_umap_orig'
    airr:	3000 x 0
      obs:	'high_confidence', 'is_cell', 'clonotype_orig'
      obsm:	'airr', 'chain_indices'

The gex modality contains the gene expression data, the airr modaility the
receptor data. The airr modality has no .X, the relevant data are stored in .obsm.

  • Most scirpy functions only operate on the airr modality.
  • Some functions use both airr and gex data.
  • For visualization, it is useful to plot airr.obs on top of gex embeddings, or use columns from both gex.obs and airr.obs in a single plot.

Since the airr modality only has obs and obsm, it would be thinkable to
(additionally) support the use of a single AnnData object with gene expression datain .X and receptor data in .obsm.

API consideration for unimodal data

(i.e. scirpy functions that only use the airr modality)

1. For a function that only operates on the AIRR data, what is the preferred option to interact with mudata?

ir.tl.chain_qc(mdata, airr_key="airr", **kwargs)

or

ir.tl.chain_qc(mdata['airr'], **kwargs)

2. Should a function that only operates on the AIRR data add columns to mdata or adata?

def chain_qc(mdata, airr_key="airr", **kwargs):
    adata = mdata[airr_key]
    adata.obs["new_col"] = np.zeros((adata.n_obs, ))
    # should this be called by the function automatically? 
    mdata.update_obs()

3. Use muon for plotting or scanpy?

Is it preferable to call

mu.pl.umap(mdata, color="gex:cluster")

or

sc.pl.umap(mdata['gex'], color="cluster")

If the former, is there a recommended way to transfer .obsm from the GEX AnnData to MuData (similar to update_obs for .obs)?

API considerations for multimodal data

(i.e. functions that consume both the airr and gex modalities)

I have a function that depends on a gene expression neighborhood graph and .obs annotations based on AIRR data.

API options

  1. pass both modalities (probably not)
    ir.tl.clonotype_modularity(mdata['gex'], mdata['airr'], airr_col="clone_id")
  2. pass mdata and mod_keys
    ir.tl.clonotype_modularity(mdata, gex_mod="gex", airr_col="airr:clone_id")
  3. Store the gene expression neighborhood graph in mudata
    # is there something like mdata.update_obsm() ? 
    mdata.obsp["connectivities"] = mdata["gex"].obsp["connectivities"]
    ir.tl.clonotype_modularity(mdata, airr_col="airr:clone_id")

Possible solution

I'm leaning towards having all functions operate on MuData directly,
i.e.

ir.tl.something(mdata, airr_key="airr", col="airr:xxx")

with the option to also pass an anndata object for backwards-compatibility (in that case, airr_key will be ignored).

ir.tl.something(adata, col="xxx")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions