To accommodate datasets that exceed available memory and CPU
resources, VISION creates pooled cells using a custom micropooling
algorithm. In standard use cases, the users provide the entire
expression matrix and VISION will create pooled cells with the
user-defined cellsPerPartition argument as supplied to the
VISION object constructor. Analysis will resume on the pooled data
matrix, where each pooled cell has, on average,
cellsPerPartition cells per pool. This significantly
reduces the run time, producing results from 500K cells in just under
around an hour. We also find that the local autocorrelation scores run
in the latent space as computed from the pooled cells is consistent with
that when these scores are computed from the entire dataset.
The micropooling algorithm relies on iterative clustering of the
latent space. Specifically, given an expression matrix and a
cellsPerPartition target size, we use the following
procedure:
cellsPerPartition cells per
cluster.By default, VISION does not allow micropooling within discrete classes (e.g. cell types, or batch) as we find that unbiased clusterings are the most sensible and reflective of the data. Instead, in the output report, pooled cells report their percent consistency of each discrete meta data item (e.g. a pooled cell is 80% CD4+ T cell and 20% CD8+ T cell). In this way, users may develop an intuition of the phenotype of each pooled cell. To note, users may also elect to perform clusterings outside of the VISION analysis pipeline and provide it to the VISION object constructor as pooled data - this allows users to perform micorclustering within discrete classes if they so wish.
In this vignette, we’ll be analyzing a set of ~5,000 cells during haematopoiesis (Tusi et al, Nature 2018).
Here, we’ll demonstrate a common use case of micropooling on the data:
library(VISION)
counts = as.matrix(read.table("data/hemato_counts.csv.gz", sep=',', header=T, row.names=1))
# compute scaled counts
scale.factor = median(colSums(counts))
scaled.counts = t(t(counts) / colSums(counts)) * scale.factor
# read in meta data
meta = read.table("data/hemato_covariates.txt.gz", sep='\t', header=T, row.names=1)
meta = meta[colnames(scaled.counts), -1]
vis <- Vision(scaled.counts, meta = meta
signatures = c("data/h.all.v5.2.symbols.gmt"),
pool = T, cellsPerPartition = 5)
vis <- analyze(vis)
viewResults(vis)There may be reasons it is desired micropool outside of the main
VISION analyze() function. For example, there may be
discrete classes that users may not wish to pool across (e.g. not
wanting to merge disease and control cells). Or, you may want to only
use the micropooling function independently of the rest of the pipeline.
This can be accomplished using the applyMicroClustering()
utility function as demonstrated here.
## Create scaled.counts and meta as shown above
pools <- applyMicroClustering(scaled.counts, cellsPerPartition = 5)
# Create pooled versions of expression matrix and meta-data
pooledExpression <- poolMatrixCols(scaled.counts, pools)
pooledMeta <- poolMetaData(meta, pools)
# These could then be fed into the VISION constructor
vis <- Vision(pooledExpression, meta = pooledMeta
signatures = c("data/h.all.v5.2.symbols.gmt"))Alternately, if you already have a latent space computed and wish to
use this (instead of PCA) when pooling outside of the normal
Vision pipeline, you can provide it when calling
applyMicroClustering() and later pool the latent space.
This is demonstrated below:
## Create scaled.counts and meta as shown above
pools <- applyMicroClustering(scaled.counts,
cellsPerPartition = 5, latentSpace = latentPpace)
# Create pooled versions of expression matrix and meta-data
pooledExpression <- poolMatrixCols(scaled.counts, pools)
pooledMeta <- poolMetaData(meta, pools)
pooledLatentSpace <- poolMatrixRows(latentSpace, pools)
# These could then be fed into the VISION constructor
vis <- Vision(pooledExpression, meta = pooledMeta
signatures = c("data/h.all.v5.2.symbols.gmt"),
latentSpace = pooledLatentSpace)