Skip to content

Additional modes (surface proteins, methylation, …) #237

@gokceneraslan

Description

@gokceneraslan

In cell hashing or CITE-seq datasets, an additional count matrix is produced and it stores how many ADT/HTO barcodes (see this for details https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-018-1603-1) a cell has. Then depending on the experiment type we use these barcodes to either demultiplex cells into their original "sources" (HTO case) or to quantify protein expression (ADT case). Here is the PBMC HTO file from Seurat tutorial (rows are sample names i.e. HTO barcode IDs, columns are cells):

image

Link to file: https://www.dropbox.com/sh/c5gcjm35nglmvcv/AABGz9VO6gX9bVr5R2qahTZha?dl=0&preview=pbmc_hto_mtx.rds

Right now, we can store these counts in adata.obsm as @wflynny suggested, however there is no place to store barcode strings (or protein names or sample names depending on what barcodes represent) in adata.obsm. One hack would be to store them in adata.uns but that'd be very ugly. Alternatively one can store everything in adata.obs but that'd also pollute the obs and ignore the multivariate nature of the barcodes.

What would be a good solution here? Would it make sense to add "column names" to an adata.obsm? Since they're currently stored as numpy arrays this seems infeasible but how much effort is it to store them as dataframes (like obs) instead of matrices? Alternatively, sc.get might allow us to access a group of columns in adata.obs for convenience.

Let me know what you think. (Btw, related discussion: scverse/scanpy#351)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions