emergene.tools package#

Tools module for emergene analysis.

This module contains the main analysis functions for identifying emergent genes, marker genes, computing gene set scores, and identifying gene modules.

emergene.tools.identifyGeneModule(adata, gene_list, use_rep: str = 'X_pca', resolution: float = 0.5, n_components: int = 30, verbosity: int = 0)#

emergene.tools.runEMERGENE(adata: AnnData, use_rep: str = 'X_pca', use_rep_acrossDataset: str = 'X_pca', layer: str | None = None, n_nearest_neighbors: int = 10, condition_key: str = 'Sample', random_seed: int = 27, n_repeats: int = 3, mu: float = 1.0, beta: float = 1.0, sigma: float = 100.0, n_cells_expressed_threshold: int = 50, n_top_EG_genes: int = 500, remove_lowly_expressed: bool = True, expressed_pct: float = 0.1, inplace: bool = False, gene_list_as_string: bool = False, verbose: int = 1) → Tuple[Dict[str, str | DataFrame], DataFrame] | Dict[str, str | DataFrame]#

Identify emergent genes across different biological conditions.

emergene identifies genes that show coordinated local expression patterns within specific conditions by using graph-based diffusion and cosine similarity. The method compares gene expression patterns within a condition against random backgrounds and other conditions to identify truly emergent patterns.

Parameters:

adata (AnnData) – Annotated data matrix containing preprocessed single-cell RNA-seq data. Should contain low-dimensional representations (e.g., PCA) in .obsm.
use_rep (str, default='X_pca') – Key in adata.obsm for the low-dimensional embedding used for condition-specific diffusion. Common choices include ‘X_pca’, ‘X_umap’, or other dimensionality reduction results.
use_rep_acrossDataset (str, default='X_pca') – Key in adata.obsm for computing the across-dataset connectivity matrix using bbknn. Can be the same as use_rep or a different representation.
layer (str or None, default=None) – Key in adata.layers for the gene expression matrix to use. If None, uses the default expression matrix stored in adata.X. Recommended to use log-normalized counts (e.g., ‘log1p’) or infog-normalized data.
n_nearest_neighbors (int, default=10) – Number of nearest neighbors used when constructing adjacency matrices. Higher values increase smoothing but may dilute local patterns.
condition_key (str, default='Sample') – Column name in adata.obs that specifies the condition or batch label for each cell. Must be a categorical variable.
random_seed (int, default=27) – Seed for the random number generator to ensure reproducibility of randomization procedures.
n_repeats (int, default=3) – Number of randomizations to perform for background generation. Higher values provide more stable background estimates but increase computation. Minimum recommended: 3, typical: 10-30.
mu (float, default=1.0) – Weight for subtracting the random background specificity in the final emergene score. Set to 0 to disable random background correction.
beta (float, default=1.0) – Weight for subtracting the condition-wise background specificity in the final emergene score. Set to 0 to disable cross-condition correction.
sigma (float, default=100.0) – Scaling parameter for the exponential kernel in adjacency matrix construction. Larger values result in slower decay of edge weights.
n_cells_expressed_threshold (int, default=50) – Minimum number of cells (within the target condition) in which a gene must be expressed to be considered. Genes below this threshold receive minimum scores.
n_top_EG_genes (int, default=500) – Number of top emergent genes to select for each condition based on emergene scores.
remove_lowly_expressed (bool, default=True) – Flag indicating whether to filter lowly expressed genes. Currently implemented via n_cells_expressed_threshold.
expressed_pct (float, default=0.1) – Minimum percentage of cells in which a gene must be expressed. Note: Currently not implemented, planned for future versions.
inplace (bool, default=False) – If True, saves emergene scores directly into adata.var and modifies the AnnData object in-place. If False, returns scores as a DataFrame.
gene_list_as_string (bool, default=False) – If True, saves top genes and scores as a comma-separated string in the format “gene1:score1,gene2:score2,…”. If False, saves as a pandas DataFrame with separate columns for genes and scores.
verbose (int, default=1) – Verbosity level. 0: silent, 1: progress messages, 2: detailed output.

Returns:

If inplace=False –
Tuple[Dict, pd.DataFrame]
- Dictionary where keys are f’EG_{condition}’ and values are either strings (if gene_list_as_string=True) or DataFrames containing the top emergent genes and their scores for each condition.
- DataFrame with columns f’EmerGene_{condition}’ containing emergene scores for all genes across all conditions.
If inplace=True –

Dict
Dictionary of top emergent genes. The AnnData object is modified in-place with emergene scores added to adata.var and local fold changes added to adata.layers[‘localFC’].

Notes

The method computes three components for each gene:

Target specificity (GSP): Cosine similarity between original expression and diffused expression within the condition
Random background: Average GSP from randomly permuted adjacency matrices
Condition-wise background: GSP between target condition and other conditions

The final emergene score is: GSP - μ × random_GSP - β × condition_GSP

The local fold change matrix is stored in adata.layers[‘localFC’] and represents log1p-transformed fold changes of each gene in each cell relative to the cross-condition background.

Raises:

ValueError – If use_rep or use_rep_acrossDataset is not found in adata.obsm. If condition_key is not found in adata.obs. If layer is specified but not found in adata.layers. If numeric parameters are out of acceptable ranges.
ImportError – If required dependency bbknn is not installed.

Examples

Basic usage with default parameters:

>>> import scanpy as sc
>>> import emergene as eg
>>> adata = sc.read_h5ad("data.h5ad")
>>> gene_dict, scores = eg.tl.runEMERGENE(adata, condition_key='cell_type')
>>> print(gene_dict['EG_T_cell'].head())

Using custom parameters and saving in-place:

>>> gene_dict = eg.tl.runEMERGENE(
...     adata,
...     condition_key='treatment',
...     n_top_EG_genes=1000,
...     mu=1.5,
...     beta=0.5,
...     inplace=True
... )
>>> print(adata.var['EmerGene_treated'].head())

Table of Contents

emergene.tools package#