Compass Settings¶
Contents
Compass allows users to customize various features:
usage: Compass [-h] [--data FILE] [--data-mtx FILE [FILE ...]] [--model MODEL] [--species SPECIES] [--media MEDIA] [--output-dir DIR]
[--temp-dir DIR] [--torque-queue QUEUE] [--num-processes N] [--lambda F] [--num-threads N] [--and-function FXN]
[--select-reactions FILE] [--select-subsystems FILE] [--num-neighbors N] [--symmetric-kernel] [--input-weights FILE]
[--penalty-diffusion MODE] [--no-reactions] [--calc-metabolites] [--precache] [--input-knn FILE] [--output-knn FILE]
[--latent-space FILE] [--only-penalties] [--example-inputs] [--microcluster-size C] [--list-genes FILE] [--list-reactions FILE]
Below we describe the features in more detail. For details on micropooling/microclustering specifically, see here
Input settings¶
Input gene expression matrix is specified in one of two ways:
- --data [FILE]
File with input gene expression data with rows as genes and columns as samples. The input should be a single tab-delimited file with row and column labels:
--data expression.tsv
- --data-mtx [–data-mtx FILE [FILE …]]
File with input gene expression data with rows as genes and columns as samples in market matrix format (mtx). The input must be followed by a tab separated file with rownames corresponding to genes. Optionally that can be followed by column names corresponding to samples.
--data expression.mtx genes.tsv sample_names.tsv
If the column names file is omitted the samples will be labelled by index.
To view example inputs, use:
- --example-inputs
Flag for Compass to list the directory where example inputs can be found.
Output settings¶
- --output-dir [DIR]
Final directory for final output files (e.g., reactions.tsv). Defaults to ./ (the same directory the command was run from).
- --temp-dir [DIR]
Directory to store partial results for completed samples in a dataset (used to resume interrupted runs). Defaults to ./_tmp.
- --list-genes [FILE]
File to output a list of metabolic genes needed for selected metabolic model. This is useful if you’d like to subset the input matrix to include only the metabolic genes used by the algorithm (gene not included in the list are ignored). This list depends on the
--species
argument.- --list-reactions [FILE]
File to output a list of reaction id’s and their associated subsystem. This is useful if you’d like to compute Compass scores for only a subset of the reactions in order to cut in computation times (see below,
--select-reactions
and--select-subsystems
)..- --select-reactions [FILE]
Compute Compass scores only for the reactions listed in the given file. FILE is expected to be textual, with one line per reaction (undirected, namely adding the suffix "_pos" or "_neg" to a line will create a valid directed reaction id). Unrecognized reactions in FILE are ignored.
- --select-subsystems [FILE]
Compute Compass scores only for the subsystems listed in the given file. FILE is expected to be textual, with one line per subsystem. Unrecognized subsystems in FILE are ignored.
Metabolic Model Settings¶
- --species [SPECIES]
Species to use to match gene names to model. Required parameter. Options:
homo_sapiens
mus_musculus
- --model [MODEL]
Metabolic model to use. Options:
RECON1_mat
RECON2_mat (default)
RECON2.2
- --media [MEDIA]
The media to simulate the model with. This is a placeholder for future algorithmic extensions.
- --and-function [FXN]
A numeric function which substitutes AND relationships in translation of the GSMM’s gene-protein associations into reaction penalties Options:
min
median
mean (default)
- --calc-metabolites
Flag to enable calculation and output of uptake/secretion scores in addition to reaction scores.
- --no-reactions
Flag to disable calculation and output of reaction scores and compute only uptake/secretion scores.
Penalty Settings¶
- --penalty-diffusion [MODE]
Mode to use in information sharing of reaction penalty values between single cells. Options:
gaussian
knn (default)
- --lambda [F]
Smoothing factor for information sharing between single cells (Default is 0, no information sharing). Lambda should be set between 0 and 1. In the manuscript, where information sharing was appropriate, we used 0.25.
Note there are two common scenarios where there is no need for information sharing and lambda should be set to 0: # Running Compass on bulk (i.e., not single cell) RNA # Using a cell pooling procedure (micropools, or metacells) and running Compass on the pooled results.
Note
If lambda is 0, then the cells are processed independently of each other so you can divide up samples to run them separately and get the same results.
- --num-neighbors [K]
Either effective number of neighbors for gaussian penalty diffusion or exact number of neighbors for KNN penalty diffusion. Default is 30
- --input-weights [FILE]
File to input custom weights for averaging of single-cell data. The column and row labels should be the same as the names of samples in expression data.
- --symmetric-kernel
Flag to enable symmetrizing the TSNE kernel which takes longer
- --input-knn [FILE]
File to input a precomputed kNN graph for the samples. File can eiter be a tsv with one row per sample and (k+1) columns. The first column should be sample names, and the next k columns should be indices of the k nearest neighbors (by their order in column 1).
You can also input the numpy array of values without a column of labels in npy format, but be careful that the order of samples is the same as input data.
- --input-knn-distances [FILE]
File to input a precomputed kNN graph for the samples. File can eiter be a tsv with one row per sample and (k+1) columns. The first column should be sample names, and the next k columns should be distances to the k nearest neighbors of that sample.
You can also input the numpy array of values without a column of labels in npy format, but be careful that the order of samples is the same as input data.
- --output-knn [FILE]
File to save kNN graph of the samples to. File will be a tsv with one row per sample and (k+1) columns. The first column will be sample names, and the next k columns will be indices of the k nearest neighbors (by their order in column 1).
Note
These knn formats are the results from scikit-learn’s nearest neighbors algorithm which are then wrapped in a Pandas dataframe.
- --latent-space [FILE]
File with latent space representation of the samples on which to do the kNN clustering for information sharing and/or micropooling. Should be a tsv with one row per sample and one column per dimension of the latent space.
- --only-penalties
Flag for Compass to only compute the reaction penalties for the dataset. This is useful for load splitting when information sharing between cells is needed; only the penalty computation needs to be centrally run, and the subsequent score computations can be split across machines.
Computing Settings¶
- --num-processes [N]
Number of processes for Compass to use, each of which handles a single sample. Must be a positive integer and defaults to the number of processors on machine (using Python’s
multiprocessing.cpu_count()
). Ignored when submitting job onto a queue- --num-threads [N]
Number of threads to use per sample for solving the flux balance optimization problems. Default is 1.
Note
It is generally better to increase the number of processes than the number of threads for better performance, unless the number of processes is greater than the number of samples. This is because it is generally better to have multiple optimization problems being solved at once rather than solving a single optimization problem with multiple threads.
- --torque-queue [QUEUE]
Name of the torque queue to submit to
- --precache
A flag to force Compass to build up the cache for the input selected model and media. This will rebuild the cache even if one already exists.
- --microcluster-size [C]
A target number of cells per microcluster. Compass will aggregate similar cells into clusters and compute reaction penalties for the clusters (using the mean of the cluster).
- --microcluster-file [FILE]
File where a tsv of microclusters will be output. There will be one column where each entry has the label for what micropool/microcluster the sample is in. Defaults to micropools.tsv in the output directory.
- --microcluster-data-file [FILE]
File where a tsv of average gene expression per microcluster will be output. Defaults to micropooled_data.tsv in the output directory.
Note
When using microclusters, information sharing with lambda > 0 is generally unnecessary because the microclusters already serve the same purpose. If both are enabled, then information will be shared between microclusters as well.