Introduction

To accommodate datasets that exceed available memory and CPU resources, VISION creates pooled cells using a custom micropooling algorithm. In standard use cases, the users provide the entire expression matrix and VISION will create pooled cells with the user-defined cellsPerPartition argument as supplied to the VISION object constructor. Analysis will resume on the pooled data matrix, where each pooled cell has, on average, cellsPerPartition cells per pool. This significantly reduces the run time, producing results from 500K cells in just under around an hour. We also find that the local autocorrelation scores run in the latent space as computed from the pooled cells is consistent with that when these scores are computed from the entire dataset.

The Algorithm

The micropooling algorithm relies on iterative clustering of the latent space. Specifically, given an expression matrix and a cellsPerPartition target size, we use the following procedure:

  • Form a latent space using the first 30 components from PCA
    • Alternately, use the user-provided latent space or latent trajectory model
  • Compute a KNN graph, where edge weights are determined by applying a Gaussian kernel to the euclidean distances in the latent model
  • Perform Louvain Clustering on KNN graph
  • Iteratively perform K-means clustering within the louvain clusters until we have about cellsPerPartition cells per cluster.
  • Collapse cells into pooled cells by computing the gene-wise mean for all cells in each cluster.

By default, VISION does not allow micropooling within discrete classes (e.g. cell types, or batch) as we find that unbiased clusterings are the most sensible and reflective of the data. Instead, in the output report, pooled cells report their percent consistency of each discrete meta data item (e.g. a pooled cell is 80% CD4+ T cell and 20% CD8+ T cell). In this way, users may develop an intuition of the phenotype of each pooled cell. To note, users may also elect to perform clusterings outside of the VISION analysis pipeline and provide it to the VISION object constructor as pooled data - this allows users to perform micorclustering within discrete classes if they so wish.

Micropooling Example

Data

In this vignette, we’ll be analyzing a set of ~5,000 cells during haematopoiesis (Tusi et al, Nature 2018).

Workflow

Here, we’ll demonstrate a common use case of micropooling on the data:

library(VISION)

counts = as.matrix(read.table("data/hemato_counts.csv.gz", sep=',', header=T, row.names=1))

# compute scaled counts
scale.factor = median(colSums(counts))
scaled.counts = t(t(counts) / colSums(counts)) * scale.factor

# read in meta data
meta = read.table("data/hemato_covariates.txt.gz", sep='\t', header=T, row.names=1)
meta = meta[colnames(scaled.counts), -1]

vis <- Vision(scaled.counts, meta = meta
              signatures = c("data/h.all.v5.2.symbols.gmt"),
              pool = T, cellsPerPartition = 5)

vis <- analyze(vis)

viewResults(vis)

Micropooling independently of the main VISION analysis

There may be reasons it is desired micropool outside of the main VISION analyze() function. For example, there may be discrete classes that users may not wish to pool across (e.g. not wanting to merge disease and control cells). Or, you may want to only use the micropooling function independently of the rest of the pipeline. This can be accomplished using the applyMicroClustering() utility function as demonstrated here.

## Create scaled.counts and meta as shown above

pools <- applyMicroClustering(scaled.counts, cellsPerPartition = 5)

# Create pooled versions of expression matrix and meta-data
pooledExpression <- poolMatrixCols(scaled.counts, pools)
pooledMeta <- poolMetaData(meta, pools)

# These could then be fed into the VISION constructor
vis <- Vision(pooledExpression, meta = pooledMeta
              signatures = c("data/h.all.v5.2.symbols.gmt"))

Alternately, if you already have a latent space computed and wish to use this (instead of PCA) when pooling outside of the normal Vision pipeline, you can provide it when calling applyMicroClustering() and later pool the latent space.

This is demonstrated below:

## Create scaled.counts and meta as shown above

pools <- applyMicroClustering(scaled.counts,
    cellsPerPartition = 5, latentSpace = latentPpace)

# Create pooled versions of expression matrix and meta-data
pooledExpression <- poolMatrixCols(scaled.counts, pools)
pooledMeta <- poolMetaData(meta, pools)
pooledLatentSpace <- poolMatrixRows(latentSpace, pools)

# These could then be fed into the VISION constructor
vis <- Vision(pooledExpression, meta = pooledMeta
              signatures = c("data/h.all.v5.2.symbols.gmt"),
              latentSpace = pooledLatentSpace)