emergene.tools package#

Tools module for emergene analysis.

This module contains the main analysis functions for identifying emergent genes, marker genes, computing gene set scores, and identifying gene modules.

emergene.tools.identifyGeneModule(adata, gene_list, use_rep: str = 'X_pca', resolution: float = 0.5, n_components: int = 30, verbosity: int = 0)#
emergene.tools.runEMERGENE(adata: AnnData, use_rep: str = 'X_pca', use_rep_acrossDataset: str = 'X_pca', layer: str | None = None, n_nearest_neighbors: int = 10, condition_key: str = 'Sample', random_seed: int = 27, n_repeats: int = 3, mu: float = 1.0, beta: float = 1.0, sigma: float = 100.0, n_cells_expressed_threshold: int = 50, n_top_EG_genes: int = 500, remove_lowly_expressed: bool = True, expressed_pct: float = 0.1, inplace: bool = False, gene_list_as_string: bool = False, verbose: int = 1) Tuple[Dict[str, str | DataFrame], DataFrame] | Dict[str, str | DataFrame]#

Identify emergent genes across different biological conditions.

emergene identifies genes that show coordinated local expression patterns within specific conditions by using graph-based diffusion and cosine similarity. The method compares gene expression patterns within a condition against random backgrounds and other conditions to identify truly emergent patterns.

Parameters:
  • adata (AnnData) – Annotated data matrix containing preprocessed single-cell RNA-seq data. Should contain low-dimensional representations (e.g., PCA) in .obsm.

  • use_rep (str, default='X_pca') – Key in adata.obsm for the low-dimensional embedding used for condition-specific diffusion. Common choices include ‘X_pca’, ‘X_umap’, or other dimensionality reduction results.

  • use_rep_acrossDataset (str, default='X_pca') – Key in adata.obsm for computing the across-dataset connectivity matrix using bbknn. Can be the same as use_rep or a different representation.

  • layer (str or None, default=None) – Key in adata.layers for the gene expression matrix to use. If None, uses the default expression matrix stored in adata.X. Recommended to use log-normalized counts (e.g., ‘log1p’) or infog-normalized data.

  • n_nearest_neighbors (int, default=10) – Number of nearest neighbors used when constructing adjacency matrices. Higher values increase smoothing but may dilute local patterns.

  • condition_key (str, default='Sample') – Column name in adata.obs that specifies the condition or batch label for each cell. Must be a categorical variable.

  • random_seed (int, default=27) – Seed for the random number generator to ensure reproducibility of randomization procedures.

  • n_repeats (int, default=3) – Number of randomizations to perform for background generation. Higher values provide more stable background estimates but increase computation. Minimum recommended: 3, typical: 10-30.

  • mu (float, default=1.0) – Weight for subtracting the random background specificity in the final emergene score. Set to 0 to disable random background correction.

  • beta (float, default=1.0) – Weight for subtracting the condition-wise background specificity in the final emergene score. Set to 0 to disable cross-condition correction.

  • sigma (float, default=100.0) – Scaling parameter for the exponential kernel in adjacency matrix construction. Larger values result in slower decay of edge weights.

  • n_cells_expressed_threshold (int, default=50) – Minimum number of cells (within the target condition) in which a gene must be expressed to be considered. Genes below this threshold receive minimum scores.

  • n_top_EG_genes (int, default=500) – Number of top emergent genes to select for each condition based on emergene scores.

  • remove_lowly_expressed (bool, default=True) – Flag indicating whether to filter lowly expressed genes. Currently implemented via n_cells_expressed_threshold.

  • expressed_pct (float, default=0.1) – Minimum percentage of cells in which a gene must be expressed. Note: Currently not implemented, planned for future versions.

  • inplace (bool, default=False) – If True, saves emergene scores directly into adata.var and modifies the AnnData object in-place. If False, returns scores as a DataFrame.

  • gene_list_as_string (bool, default=False) – If True, saves top genes and scores as a comma-separated string in the format “gene1:score1,gene2:score2,…”. If False, saves as a pandas DataFrame with separate columns for genes and scores.

  • verbose (int, default=1) – Verbosity level. 0: silent, 1: progress messages, 2: detailed output.

Returns:

  • If inplace=False

    Tuple[Dict, pd.DataFrame]
    • Dictionary where keys are f’EG_{condition}’ and values are either strings (if gene_list_as_string=True) or DataFrames containing the top emergent genes and their scores for each condition.

    • DataFrame with columns f’EmerGene_{condition}’ containing emergene scores for all genes across all conditions.

  • If inplace=True

    Dict

    Dictionary of top emergent genes. The AnnData object is modified in-place with emergene scores added to adata.var and local fold changes added to adata.layers[‘localFC’].

Notes

The method computes three components for each gene:

  1. Target specificity (GSP): Cosine similarity between original expression and diffused expression within the condition

  2. Random background: Average GSP from randomly permuted adjacency matrices

  3. Condition-wise background: GSP between target condition and other conditions

The final emergene score is: GSP - μ × random_GSP - β × condition_GSP

The local fold change matrix is stored in adata.layers[‘localFC’] and represents log1p-transformed fold changes of each gene in each cell relative to the cross-condition background.

Raises:
  • ValueError – If use_rep or use_rep_acrossDataset is not found in adata.obsm. If condition_key is not found in adata.obs. If layer is specified but not found in adata.layers. If numeric parameters are out of acceptable ranges.

  • ImportError – If required dependency bbknn is not installed.

Examples

Basic usage with default parameters:

>>> import scanpy as sc
>>> import emergene as eg
>>> adata = sc.read_h5ad("data.h5ad")
>>> gene_dict, scores = eg.tl.runEMERGENE(adata, condition_key='cell_type')
>>> print(gene_dict['EG_T_cell'].head())

Using custom parameters and saving in-place:

>>> gene_dict = eg.tl.runEMERGENE(
...     adata,
...     condition_key='treatment',
...     n_top_EG_genes=1000,
...     mu=1.5,
...     beta=0.5,
...     inplace=True
... )
>>> print(adata.var['EmerGene_treated'].head())

See also

runMarkG

Marker gene identification without condition comparison

infog

INFOG normalization for preprocessing

score

Gene set enrichment scoring

References

emergene.tools.runMarkG(adata, use_rep: str = 'X_pca', layer: str = 'log1p', n_nearest_neighbors: int = 10, random_seed: int = 27, n_repeats: int = 3, mu: float = 1, sigma: float = 100, remove_lowly_expressed=True, expressed_pct=0.1)#
emergene.tools.score(adata, gene_list, gene_weights=None, n_nearest_neighbors: int = 30, leaf_size: int = 40, layer: str = 'infog', random_seed: int = 1927, n_ctrl_set: int = 100, key_added: str | None = None, verbosity: int = 0)#

For a given gene set, compute gene expression enrichment scores and P values for all the cells.

Parameters:
  • adata (AnnData) – The AnnData object for the gene expression matrix.

  • gene_list (list of str) – A list of gene names for which the score will be computed.

  • gene_weights (list of floats, optional) – A list of weights corresponding to the genes in gene_list. The length of gene_weights must match the length of gene_list. If None, all genes in gene_list are weighted equally. Default is None.

  • n_nearest_neighbors (int, optional) – Number of nearest neighbors to consider for randomly selecting control gene sets based on the similarity of genes’ mean and variance among all cells. Default is 30.

  • leaf_size (int, optional) – Leaf size for the KD-tree or Ball-tree used in nearest neighbor calculations. Default is 40.

  • layer (str, optional) – The name of the layer in adata.layers to use for gene expression values. Default is ‘infog’.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

  • n_ctrl_set (int, optional) – Number of control gene sets to be used for calculating P values. Default is 100.

  • key_added (str, optional) – If provided, the computed scores will be stored in adata.obs[key_added]. The scores and P values will be stored in adata.uns[key_added] as well. Default is None, and the INFOG_score will be used as the key.

  • verbosity (int, optional (default: 0)) – Level of verbosity for logging information.

Returns:

Modifies the adata object in-place, see key_added.

Return type:

None