emergene.tools package#
Tools module for emergene analysis.
This module contains the main analysis functions for identifying emergent genes, marker genes, computing gene set scores, and identifying gene modules.
- emergene.tools.identifyGeneModule(adata, gene_list, use_rep: str = 'X_pca', resolution: float = 0.5, n_components: int = 30, verbosity: int = 0)#
- emergene.tools.runEMERGENE(adata: AnnData, use_rep: str = 'X_pca', use_rep_acrossDataset: str = 'X_pca', layer: str | None = None, n_nearest_neighbors: int = 10, condition_key: str = 'Sample', random_seed: int = 27, n_repeats: int = 3, mu: float = 1.0, beta: float = 1.0, sigma: float = 100.0, n_cells_expressed_threshold: int = 50, n_top_EG_genes: int = 500, remove_lowly_expressed: bool = True, expressed_pct: float = 0.1, inplace: bool = False, gene_list_as_string: bool = False, verbose: int = 1) Tuple[Dict[str, str | DataFrame], DataFrame] | Dict[str, str | DataFrame]#
Identify emergent genes across different biological conditions.
emergene identifies genes that show coordinated local expression patterns within specific conditions by using graph-based diffusion and cosine similarity. The method compares gene expression patterns within a condition against random backgrounds and other conditions to identify truly emergent patterns.
- Parameters:
adata (AnnData) – Annotated data matrix containing preprocessed single-cell RNA-seq data. Should contain low-dimensional representations (e.g., PCA) in .obsm.
use_rep (str, default='X_pca') – Key in adata.obsm for the low-dimensional embedding used for condition-specific diffusion. Common choices include ‘X_pca’, ‘X_umap’, or other dimensionality reduction results.
use_rep_acrossDataset (str, default='X_pca') – Key in adata.obsm for computing the across-dataset connectivity matrix using bbknn. Can be the same as use_rep or a different representation.
layer (str or None, default=None) – Key in adata.layers for the gene expression matrix to use. If None, uses the default expression matrix stored in adata.X. Recommended to use log-normalized counts (e.g., ‘log1p’) or infog-normalized data.
n_nearest_neighbors (int, default=10) – Number of nearest neighbors used when constructing adjacency matrices. Higher values increase smoothing but may dilute local patterns.
condition_key (str, default='Sample') – Column name in adata.obs that specifies the condition or batch label for each cell. Must be a categorical variable.
random_seed (int, default=27) – Seed for the random number generator to ensure reproducibility of randomization procedures.
n_repeats (int, default=3) – Number of randomizations to perform for background generation. Higher values provide more stable background estimates but increase computation. Minimum recommended: 3, typical: 10-30.
mu (float, default=1.0) – Weight for subtracting the random background specificity in the final emergene score. Set to 0 to disable random background correction.
beta (float, default=1.0) – Weight for subtracting the condition-wise background specificity in the final emergene score. Set to 0 to disable cross-condition correction.
sigma (float, default=100.0) – Scaling parameter for the exponential kernel in adjacency matrix construction. Larger values result in slower decay of edge weights.
n_cells_expressed_threshold (int, default=50) – Minimum number of cells (within the target condition) in which a gene must be expressed to be considered. Genes below this threshold receive minimum scores.
n_top_EG_genes (int, default=500) – Number of top emergent genes to select for each condition based on emergene scores.
remove_lowly_expressed (bool, default=True) – Flag indicating whether to filter lowly expressed genes. Currently implemented via n_cells_expressed_threshold.
expressed_pct (float, default=0.1) – Minimum percentage of cells in which a gene must be expressed. Note: Currently not implemented, planned for future versions.
inplace (bool, default=False) – If True, saves emergene scores directly into adata.var and modifies the AnnData object in-place. If False, returns scores as a DataFrame.
gene_list_as_string (bool, default=False) – If True, saves top genes and scores as a comma-separated string in the format “gene1:score1,gene2:score2,…”. If False, saves as a pandas DataFrame with separate columns for genes and scores.
verbose (int, default=1) – Verbosity level. 0: silent, 1: progress messages, 2: detailed output.
- Returns:
If inplace=False –
- Tuple[Dict, pd.DataFrame]
Dictionary where keys are f’EG_{condition}’ and values are either strings (if gene_list_as_string=True) or DataFrames containing the top emergent genes and their scores for each condition.
DataFrame with columns f’EmerGene_{condition}’ containing emergene scores for all genes across all conditions.
If inplace=True –
- Dict
Dictionary of top emergent genes. The AnnData object is modified in-place with emergene scores added to adata.var and local fold changes added to adata.layers[‘localFC’].
Notes
The method computes three components for each gene:
Target specificity (GSP): Cosine similarity between original expression and diffused expression within the condition
Random background: Average GSP from randomly permuted adjacency matrices
Condition-wise background: GSP between target condition and other conditions
The final emergene score is: GSP - μ × random_GSP - β × condition_GSP
The local fold change matrix is stored in adata.layers[‘localFC’] and represents log1p-transformed fold changes of each gene in each cell relative to the cross-condition background.
- Raises:
ValueError – If use_rep or use_rep_acrossDataset is not found in adata.obsm. If condition_key is not found in adata.obs. If layer is specified but not found in adata.layers. If numeric parameters are out of acceptable ranges.
ImportError – If required dependency bbknn is not installed.
Examples
Basic usage with default parameters:
>>> import scanpy as sc >>> import emergene as eg >>> adata = sc.read_h5ad("data.h5ad") >>> gene_dict, scores = eg.tl.runEMERGENE(adata, condition_key='cell_type') >>> print(gene_dict['EG_T_cell'].head())
Using custom parameters and saving in-place:
>>> gene_dict = eg.tl.runEMERGENE( ... adata, ... condition_key='treatment', ... n_top_EG_genes=1000, ... mu=1.5, ... beta=0.5, ... inplace=True ... ) >>> print(adata.var['EmerGene_treated'].head())
See also
References
- emergene.tools.runMarkG(adata, use_rep: str = 'X_pca', layer: str = 'log1p', n_nearest_neighbors: int = 10, random_seed: int = 27, n_repeats: int = 3, mu: float = 1, sigma: float = 100, remove_lowly_expressed=True, expressed_pct=0.1)#
- emergene.tools.score(adata, gene_list, gene_weights=None, n_nearest_neighbors: int = 30, leaf_size: int = 40, layer: str = 'infog', random_seed: int = 1927, n_ctrl_set: int = 100, key_added: str | None = None, verbosity: int = 0)#
For a given gene set, compute gene expression enrichment scores and P values for all the cells.
- Parameters:
adata (AnnData) – The AnnData object for the gene expression matrix.
gene_list (list of str) – A list of gene names for which the score will be computed.
gene_weights (list of floats, optional) – A list of weights corresponding to the genes in gene_list. The length of gene_weights must match the length of gene_list. If None, all genes in gene_list are weighted equally. Default is None.
n_nearest_neighbors (int, optional) – Number of nearest neighbors to consider for randomly selecting control gene sets based on the similarity of genes’ mean and variance among all cells. Default is 30.
leaf_size (int, optional) – Leaf size for the KD-tree or Ball-tree used in nearest neighbor calculations. Default is 40.
layer (str, optional) – The name of the layer in adata.layers to use for gene expression values. Default is ‘infog’.
random_seed (int, optional) – Random seed for reproducibility. Default is 1927.
n_ctrl_set (int, optional) – Number of control gene sets to be used for calculating P values. Default is 100.
key_added (str, optional) – If provided, the computed scores will be stored in adata.obs[key_added]. The scores and P values will be stored in adata.uns[key_added] as well. Default is None, and the INFOG_score will be used as the key.
verbosity (int, optional (default: 0)) – Level of verbosity for logging information.
- Returns:
Modifies the adata object in-place, see key_added.
- Return type:
None