Demo: Lineage-recorded data from Chan et al, (Nature 2019)

This demo uses data from:

Molecular recording of mammalian embryogenesis

For this demo, you need the expression data (count matrix) for Embryo3 which can be found in the NCBI GEO database under accession GSM3302829

Specifically these three files:

  • GSM3302829_embryo3_SeuratBarcodes.tsv.gz
  • GSM3302829_embryo3_SeuratGenes.tsv.gz
  • GSM3302829_embryo3_SeuratNorm.mtx.gz

The cell-type kernels (for interpreting lineage groups) are located at GSE122187

  • GSE122187_CellStateKernels.xls.gz

Additionally, a reconstructed lineage tree is provided with this notebook (same directory) as made by Cassiopeia

  • 0726_E2-2_tree_greedy_priors.processed.txt

Covered Here

  • Data preprocessing and filtering
  • Computing Autocorrelation in Hotspot to identify lineage-related genes
  • Computing local correlations between lineage genes to identify heritable modules
  • Plotting modules, correlations, and module scores
[1]:
import warnings; warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import hotspot
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

from scipy.io import mmread
from scipy.sparse import csr_matrix
[2]:
counts_raw = mmread("GSM3302829_embryo3_SeuratNorm.mtx.gz")
counts_raw = csr_matrix(counts_raw)
barcodes = pd.read_table("GSM3302829_embryo3_SeuratBarcodes.tsv.gz", header=None)[0]
barcodes = [x+'-1' for x in barcodes] # to match the newick file
genes = pd.read_table("GSM3302829_embryo3_SeuratGenes.tsv.gz", header=None)[0]
[3]:
# Load the tree and enumerate the leaves
from ete3 import Tree
tree = Tree("0726_E2-2_tree_greedy_priors.processed.txt", format=1)
leaves = set()
for tn in tree.traverse('postorder'):
    if tn.is_leaf():
        leaves.add(tn.name)
len(leaves)
[3]:
1756
[4]:
# Subset the count matrix to only the cells where the lineage was recoverable
is_valid = [x in leaves for x in barcodes]
is_valid_indices = np.nonzero(is_valid)[0]
valid_barcodes = [barcodes[i] for i in is_valid_indices]

counts = counts_raw[:, is_valid_indices].todense()
counts = pd.DataFrame(counts, index=genes, columns=valid_barcodes)

# Subset for only genes detected in at least 10 cells
counts = counts.loc[
    (counts > 0).sum(axis=1) >= 10
]

Creating the Hotspot object

To start an analysis, first create the hotspot object When creating the object, you need to specify:

  • The gene counts matrix
  • Which background model to use
  • The latent space we are using to compute our cell metric
    • Here we use the inferred cell tree
  • The per-cell scaling factor
    • Here we use the number of umi per barcode

In this case, only the log-normalized counts are made available. In the Hotspot publication, we used the raw counts and the negative binomial (‘danb’) model. However, to make this example easier to run, here we just supply the log-normalized counts and select the ‘normal’ model.

Once the object is created, the neighborhood is then computed with create_knn_graph

The two options that are specificied are n_neighbors which determines the size of the neighborhood, and weighted_graph.

Here we set weighted_graph=False to just use binary, 0-1 weights (only binary weights are supported when using a lineage tree) and n_neighbors=30 to create a local neighborhood size of the nearest 30 cells. Larger neighborhood sizes can result in more robust detection of correlations and autocorrelations at a cost of missing more fine-grained, smaller-scale patterns.

[5]:
# Create the Hotspot object and the neighborhood graph

hs = hotspot.Hotspot(counts, model='normal', tree=tree)

hs.create_knn_graph(
    weighted_graph=False, n_neighbors=30,
)
100%|██████████| 1756/1756 [00:16<00:00, 110.57it/s]

Determining genes with heritable variation

Now we compute autocorrelations for each gene, using the lineage metric, to determine which genes have the most heritable variation.

[6]:
hs_results = hs.compute_autocorrelations(jobs=1)

hs_results.head(15)
100%|██████████| 12440/12440 [00:04<00:00, 3046.79it/s]
[6]:
C Z Pval FDR
Gene
Rhox5 0.202858 39.835987 0.000000e+00 0.000000e+00
Trap1a 0.182587 35.933251 4.622486e-283 2.204320e-279
Mt1 0.184857 35.929364 5.315884e-283 2.204320e-279
Mt2 0.179544 35.885161 2.602714e-282 8.094441e-279
Lgals2 0.179358 34.875108 8.864925e-267 2.205593e-263
Apom 0.175625 34.854555 1.816081e-266 3.765341e-263
Ttr 0.179074 34.795900 1.402761e-265 2.492907e-262
S100g 0.169737 33.828606 3.745067e-251 5.823579e-248
Apoa1 0.173526 33.700644 2.828200e-249 3.909201e-246
Rbp4 0.172188 33.636521 2.454401e-248 3.053275e-245
Spink1 0.160698 31.947691 2.908280e-224 3.289000e-221
Amn 0.158660 31.168848 1.408731e-213 1.460384e-210
Afp 0.159919 30.329718 2.325988e-202 2.225792e-199
Dab2 0.155662 30.315884 3.539874e-202 3.145431e-199
Apoa2 0.155262 29.959592 1.650030e-197 1.368425e-194

Grouping genes into lineage-based modules

To get a better idea of what heritable expression patterns exist, it is helpful to group the genes into modules.

Hotspot does this using the concept of “local correlations” - that is, correlations that are computed between genes between cells in the same neighborhood.

Here we avoid running the calculation for all Genes x Genes pairs and instead only run this on genes that have significant lineage autocorrelation to begin with.

The method compute_local_correlations returns a Genes x Genes matrix of Z-scores for the significance of the correlation between genes. This object is also retained in the hs object and is used in the subsequent steps.

[7]:
# Select the genes with significant lineage autocorrelation
hs_genes = hs_results.index[hs_results.FDR < 0.05]

# Compute pair-wise local correlations between these genes
lcz = hs.compute_local_correlations(hs_genes, jobs=2)

 27%|██▋       | 450/1685 [00:00<00:00, 4491.33it/s]
Computing pair-wise local correlation on 1685 features...
100%|██████████| 1685/1685 [00:00<00:00, 4357.77it/s]
100%|██████████| 1418770/1418770 [01:17<00:00, 18344.96it/s]

Now that pair-wise local correlations are calculated, we can group genes into modules.

To do this, a convenience method is included create_modules which performs agglomerative clustering with two caveats:

  • If the FDR-adjusted p-value of the correlation between two branches exceeds fdr_threshold, then the branches are not merged.
  • If two branches are two be merged and they are both have at least min_gene_threshold genes, then the branches are not merged. Further genes that would join to the resulting merged module (and are therefore ambiguous) either remain unassigned (if core_only=True) or are assigned to the module with the smaller average correlations between genes, i.e. the least-dense module (if core_only=False)

The output is a Series that maps gene to module number. Unassigned genes are indicated with a module number of -1

This method was used to preserved substructure (nested modules) while still giving the analyst some control. However, since there are a lot of ways to do hierarchical clustering, you can also manually cluster using the gene-distances in hs.local_correlation_z

[8]:
modules = hs.create_modules(
    min_gene_threshold=50, core_only=True, fdr_threshold=0.05
)

modules.value_counts()
[8]:
 1    682
 2    566
-1    223
 3     84
 5     65
 4     65
Name: Module, dtype: int64

Plotting module correlations

A convenience method is supplied to plot the results of hs.create_modules

[9]:
hs.plot_local_correlations()
_images/Lineage_Tutorial_15_0.png

To explore individual genes, we can look at the genes with the top autocorrelation in a given module as these are likely the most informative.

[10]:
# Show the top genes for a module

module = 3

results = hs.results.join(hs.modules)
results = results.loc[results.Module == module]
results.sort_values('Z', ascending=False).head(10)
[10]:
C Z Pval FDR Module
Gene
Klf1 0.083767 15.732653 4.517215e-56 5.251790e-54 3.0
Hba-x 0.067472 12.765132 1.283581e-37 1.078902e-35 3.0
Hbb-bh1 0.061915 12.301353 4.453501e-35 3.551381e-33 3.0
Hba-a1 0.058662 10.890564 6.392324e-28 4.036574e-26 3.0
Hbb-y 0.058233 10.754297 2.828191e-27 1.733137e-25 3.0
Gypa 0.053739 10.322431 2.789620e-25 1.644686e-23 3.0
Blvrb 0.049597 9.756170 8.679374e-23 4.633966e-21 3.0
Anp32b 0.047844 9.041718 7.711397e-20 3.647520e-18 3.0
Alad 0.042718 8.496494 9.770124e-18 4.191046e-16 3.0
Hmbs 0.038978 7.810243 2.853899e-15 1.056622e-13 3.0

To get an idea of what lineage patterns the module is referencing, we can plot the kernel values

[11]:
# Load Kernel values
import gzip
with gzip.open("GSE122187_CellStateKernels.xls.gz") as fin:
    kernels = pd.read_excel(fin, skiprows=1).set_index('Gene')

kernels = kernels.loc[
    :, kernels.columns.sort_values()
]

kernel_name_map = {
    0: "trophoblast stem cells",
    1: "neural ectoderm anterior",
    2: "primitive streak late",
    3: "anterior primitive streak",
    4: "primitive/definitive endoderm",
    5: "allantois",
    6: "secondary heart field/splanchnic lateral plate",
    7: "gut endoderm",
    8: "ectoderm early 1",
    9: "primitive blood early",
    10: "preplacodal ectoderm",
    11: "neural ectoderm posterior",
    12: "posterior lateral plate mesoderm",
    13: "hematopoietic/endothelial progenitors",
    14: "parietal endoderm",
    15: "amnion mesoderm early",
    16: "surface ectoderm",
    17: "epiblast",
    18: "somites",
    19: "ectoderm early 2",
    20: "splanchnic lateral plate/anterior paraxial mesoderm",
    21: "primitive heart tube",
    22: "primitive blood late",
    23: "notochord",
    24: "fore/midbrain",
    25: "distal extraembryonic ectoderm",
    26: "neuromesodermal progenitor early",
    27: "primordial germ cells",
    28: "differentiated trophoblasts",
    29: "visceral endoderm early",
    30: "presomitic mesoderm",
    31: "neuromesodermal progenitor late",
    32: "angioblasts",
    33: "neural crest",
    34: "pharyngeal arch mesoderm",
    35: "similar to neural crest",
    36: "primitive blood progenitors",
    37: "primitive streak early",
    38: "node",
    39: "future spinal cord",
    40: "visceral endoderm late",
    41: "amnion mesoderm late",
}

kernel_name_map = {k: v.capitalize() for k, v in kernel_name_map.items()}

kernels.columns = [kernel_name_map[x] for x in kernels.columns]

# Standardize each gene
kernels_z = kernels \
    .subtract(kernels.mean(axis=1), axis=0) \
    .divide(kernels.std(axis=1), axis=0)


# Plot the kernel values for the genes in each module
mods = hs.modules.unique()
mods = [x for x in mods if x != -1 and not np.isnan(x)]
mods = sorted(mods)

sizes = [(hs.modules == x).sum() for x in mods]
sizes = [np.log10(x) for x in sizes]

fig, axs = plt.subplots(
    len(mods), 1, gridspec_kw=dict(height_ratios=sizes, right=.8),
    figsize=(9, 12.5)
)

cbax = fig.add_axes([.85, .15, .015, .1])

for ax, mod in zip(axs.ravel(), mods):

    plt.sca(ax)
    mod_genes = hs.modules.index[hs.modules == mod]
    mod_genes = pd.Index(mod_genes) & kernels.index

    last_plot = mod == max(mods)

    if last_plot:
        cbar=True
        cbar_ax=cbax
    else:
        cbar=False
        cbar_ax=None

    sns.heatmap(kernels_z.loc[mod_genes], vmin=-2, vmax=2,
                cmap="RdBu_r", ax=ax, xticklabels=True, rasterized=True,
                cbar=cbar, cbar_ax=cbar_ax)
    plt.ylabel(int(mod))

    plt.yticks([])

    if last_plot:
        ax.tick_params(labelsize=9)
    else:
        plt.xticks([])

plt.tight_layout()

_images/Lineage_Tutorial_19_0.png

Summary Module Scores

To aid in the recognition of the general behavior of a module, Hotspot can compute aggregate module scores.

[12]:
module_scores = hs.calculate_module_scores()

module_scores.head()
  0%|          | 0/5 [00:00<?, ?it/s]
Computing scores for 5 modules...
100%|██████████| 5/5 [00:00<00:00,  3.03it/s]
[12]:
1 2 3 4 5
AAACCTGAGAATAGGG-1 -6.993694 5.658538 1.238681 -0.743027 -0.372733
AAACCTGAGACCCACC-1 -2.767821 2.735488 -1.299466 0.587611 0.333043
AAACCTGAGCAGCGTA-1 1.070490 -0.881125 -0.715067 0.503891 0.083320
AAACCTGAGCCCAATT-1 2.906480 -1.453035 -1.730925 -0.576987 -0.339215
AAACCTGAGTACGACG-1 -0.778300 0.360248 0.019698 -0.247749 0.733445

Here we can visualize these module scores by plotting them over a UMAP of the cells

First we’ll compute the UMAP

[13]:
from sklearn.decomposition import PCA
from umap import UMAP

counts_z = counts \
    .subtract(counts.mean(axis=1), axis=0) \
    .divide(counts.std(axis=1), axis=0)

pca_data = PCA(n_components=20).fit_transform(counts_z.values.T)
umap_data = UMAP(n_neighbors=30, min_dist=.5).fit_transform(pca_data)
umap_data = pd.DataFrame(umap_data, index=counts.columns, columns=['UMAP1', 'UMAP2'])
/data/yosef2/users/david.detomaso/programs/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)

Next, let’s use the Kernels to assign every cell to its closest profile

[14]:
from scipy.spatial.distance import cdist

common = counts.index & kernels.index
dd = cdist(counts.loc[common].T, kernels.loc[common].T, metric='cosine')
dd = pd.DataFrame(dd, index=counts.columns, columns=kernels.columns)
assignment = dd.idxmin(axis=1).loc[counts.columns]

colors = sns.color_palette("husl", assignment.unique().size)
plt.figure(figsize=(9, 9))
for i, ct in enumerate(assignment.unique()):
    ii = assignment.index[assignment == ct]
    plt.plot(umap_data.loc[ii].UMAP1, umap_data.loc[ii].UMAP2,
            'o', ms=2, label=ct, color=colors[i])

plt.xticks([])
plt.yticks([])
plt.xlabel('UMAP-1')
plt.ylabel('UMAP-2')
plt.subplots_adjust(right=.6, bottom=.5)
plt.legend(bbox_to_anchor=[1, 1], loc='upper left', markerscale=3);
_images/Lineage_Tutorial_25_0.png

Finally, we plot the module scores on top of the UMAP for comparison

[15]:
fig, axs = plt.subplots(2, 3, figsize=(10, 7))
cax = fig.add_axes(
    [.72, .1, .007, .2]
)

for ax, mod in zip(axs.ravel(), hs.module_scores.columns):
    sc = hs.module_scores[mod]
    # vmin = np.percentile(sc, 5)
    # vmax = np.percentile(sc, 95)
    vmin = -1
    vmax = 1
    plt.sca(ax)
    sc = plt.scatter(
        umap_data.UMAP1, umap_data.UMAP2,
        s=3, c=sc, vmin=vmin, vmax=vmax,
        rasterized=True)
    plt.xticks([])
    plt.yticks([])
    plt.title("Module {}".format(mod))
    for sp in ax.spines.values():
        sp.set_visible(False)

axs[1, 2].remove()
plt.subplots_adjust(hspace=0.4)
plt.colorbar(sc, cax=cax, ticks=[vmin, vmax], label='Module\nScore')
plt.subplots_adjust(left=0.02, right=0.9, top=0.8)
_images/Lineage_Tutorial_27_0.png