cosg package#
- cosg.cosg(adata, groupby: str = 'CellTypes', groups: Literal['all'] | Iterable[str] = 'all', mu: float = 1, remove_lowly_expressed: bool = False, expressed_pct: float | None = 0.1, n_genes_user: int = 50, batch_key: str = None, batch_cell_number_threshold: int = 3, key_added: str | None = None, calculate_logfoldchanges: bool = False, use_raw: bool = False, layer: str | None = None, reference: str = 'rest', return_by_group: bool = True, column_delimiter: str = '::', verbosity: int = 0, copy: bool = False)#
Marker gene identification for single-cell sequencing data using COSG.
- Parameters:
adata – AnnData object with cell group information.
groupby (str, default 'CellTypes') – Key in adata.obs that defines the cell group labels. Defaults to ‘CellTypes’.
groups ({'all'} or iterable of str, default 'all') – Subset of cell groups, e.g. [‘g1’, ‘g2’, ‘g3’], to which comparison shall be restricted. The default value is ‘all’, and all groups will be compared.
mu (float, default 1) – The penalty parameter restricting marker genes expressing in non-target cell groups. Larger value represents more strict restrictions. mu should be non-negative, and by default, mu = 1.
remove_lowly_expressed (bool, default False) – If True, genes that express a percentage of target cells smaller than a specific value (expressed_pct) are not considered as marker genes for the target cells. The default value is False.
expressed_pct (float, optional, default 0.1) – When remove_lowly_expressed is set to True, genes that express a percentage of target cells smaller than a specific value (expressed_pct) are not considered as marker genes for the target cells. The default value for expressed_pct is 0.1 (10%).
n_genes_user (int, default 50) – The number of genes that appear in the returned tables. The default value is 50.
batch_key (str, optional, default None) – Used to indicate which batch info in adata.obs to use for calculate the cosine similarities separately for each batch and then average them. The default value is None.
batch_cell_number_threshold (int, default 3) – Minimum number of cells required in a batch for a given cluster to be considered for computing the cosine similarity score when batch_key is not None. If a cluster has fewer than this number of cells in a batch, the cosine similarity from that batch for the cluster will be ignored in the average.
key_added (str, optional, default None) – Key under which the COSG results will be stored in adata.uns. If None, the default key ‘cosg’ is used.
calculate_logfoldchanges (bool, default False) – If True, computes log fold-changes for the marker genes.
use_raw (bool, default False) – If True and adata.raw exists, raw UMI counts saved in adata.raw.X are used for calculations.
layer (str, optional, default None) – Key in adata.layers to specify an alternative gene expression layer for COSG.
reference (str, default 'rest') – Specifies the reference group for comparison. If ‘rest’, compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.
return_by_group (bool, default True) – If True, the COSG scores are also summarized and stored separately by group in adata.uns[key_added][‘COSG’]. This will output another extra copy of the results. Defaults to True.
column_delimiter (str, default "::") – Delimiter to use when combining multi-index column names for adata.write() compatibility. Default is “::”.
verbosity (int, default 0) – Controls the verbosity of logging messages, defaults to 0. Higher values produce more detailed messages.
copy (bool, default False) – If True, returns a copy of adata with the computed embeddings instead of modifying in place. Defaults to False.
- Returns:
The function stores the COSG marker gene identification results in adata.uns[key_added] (if copy is False) or returns a modified copy of adata (if copy is True). The results include structured arrays containing gene names and their corresponding COSG scores for each cell group: names : structured np.ndarray (.uns[‘cosg’])
Structured array to be indexed by group id storing the gene names. Ordered according to scores.
- scoresstructured np.ndarray (.uns[‘cosg’])
- Structured array to be indexed by group id storing COSG scores for each gene for each
group. Ordered according to scores.
The marker genes and their COSG scores are also summarized and stored separately by group in adata.uns[key_added][‘COSG’] if return_by_group = True.
- Return type:
None or AnnData
Examples
>>> import cosg as cosg >>> import scanpy as sc >>> adata = sc.datasets.pbmc68k_reduced() >>> cosg.cosg(adata, key_added='cosg', groupby='bulk_labels') >>> ### Visualize the marker genes in a dot plot >>> cosg.plotMarkerDotplot( ... adata, ... groupby='bulk_labels', ... top_n_genes=3, ... key_cosg='cosg', ... use_rep=None, ... swap_axes=False, ... standard_scale='var', ... cmap='Spectral_r' ... )
- cosg.indexByGene(df: DataFrame, gene_key: str = 'names', score_key: str = 'scores', column_delimiter: str = '::', set_nan_to_zero: bool = False, convert_negative_one_to_zero: bool = True)#
Reshapes a DataFrame with flattened column names (converted from MultiIndex columns) where gene names are under the specified key and scores are under the corresponding key. The resulting DataFrame will have gene names as the index and scores for each cell group in separate columns.
Note
This function is designed for reindexing COSG’s output stored in adata.uns[‘cosg’][‘COSG’]. It works with the flattened column format where attribute and cell group are joined by a delimiter in the format ‘attribute{delimiter}cell_group’. It is recommended to set n_genes_user=adata.n_vars when calling the cosg.cosg function to ensure that scores for all genes are returned.
- Parameters:
df (pd.DataFrame) – Input DataFrame with MultiIndex columns.
gene_key (str, optional, default="names") – The key used for gene names in the first level of MultiIndex.
score_key (str, optional, default="scores") – The key used for scores in the first level of MultiIndex.
column_delimiter (str, optional, default="::") – Delimiter used to separate attribute and cell group in the column names.
set_nan_to_zero (bool, optional, default=False) – If True, replaces all NaN values in the resulting DataFrame with 0. Set this parameter to True if n_genes_user=adata.n_vars is not set when calling the cosg.cosg function.
convert_negative_one_to_zero (bool, optional, default=True) – If True, replaces all occurrences of -1.0 in the resulting DataFrame with 0.
- Returns:
Reshaped DataFrame with genes as index and scores per cell type.
- Return type:
pd.DataFrame
Example
>>> cosg_df = indexByGene( ... adata.uns['cosg']['COSG'], ... gene_key="names", ... score_key="scores", ... column_delimiter="::", ... set_nan_to_zero=True, ... convert_negative_one_to_zero=True ... ) >>> ### Check the reindexed data frame >>> cosg_df.head()
- cosg.iqrLogNormalize(df: DataFrame, q_upper: float = 0.95, q_lower: float = 0.75)#
Applies an IQR-based scaling to a DataFrame followed by a log1p transformation.
The function computes the interquartile range (IQR) for each column as the difference between the q_upper and q_lower quantiles. Each column is then divided by its respective IQR.
If any IQR value is zero, it is replaced with the smallest nonzero IQR found among all columns. If all IQR values are zero, a replacement value of 1e-6 is used.
Optionally, the log1p-transformed result can be further scaled using min–max normalization to rescale the values to the [0, 1] range (applied per column).
- Parameters:
df (pd.DataFrame) – Input DataFrame containing numerical values.
q_upper (float, optional, default=0.95) – The upper quantile used for IQR calculation.
q_lower (float, optional, default=0.75) – The lower quantile used for IQR calculation.
- Returns:
A DataFrame with the same shape as the input containing the IQR-scaled and log1p-transformed values.
- Return type:
pd.DataFrame
- Raises:
TypeError – If the input df is not a pandas DataFrame.
ValueError – If q_lower is not in the range [0, q_upper) or if q_upper is not in (q_lower, 1].
Example
>>> ## cosg_df is a DataFrame with gene scores, output from the cosg.indexByGene function >>> cosg_df_transformed = iqr_log_normalize(cosg_df, q_upper=0.95, q_lower=0.75)
- cosg.plotMarkerDendrogram(adata, group_by: str, use_rep: str = 'X_pca', calculate_dendrogram_on_cosg_scores: bool = True, top_n_genes: int = 3, cosg_key: str = 'cosg', radius_step: float = 1.5, cmap: str = 'Purples', cell_type_label_offset: float = 0, gene_label_offset: float = 0.25, gene_label_color: str = None, linkage_method: str = 'ward', distance_metric: str = 'euclidean', hierarchy_merge_scale: float = None, collapse_scale: float = None, add_cluster_node_for_single_node_cluster: bool = True, palette=None, gene_color_min: float = 0, gene_color_max: float = None, font_outline: float = 2, figure_size: tuple = (10, 10), node_shape_cell_type: str = 'o', node_shape_gene: str = 's', node_shape_internal: str = 'o', colorbar_width: float = 0.01, layer: str = None, gene_size_scale: float = 300, map_cell_type_gene: dict = None, cell_type_selected: list = None, color_root_node: str = '#D6EFD5', color_internal_node: str = 'lightgray', color_edge: str = 'lightgray', edge_curved: float = 0.0, show_figure: bool = True, save: str = None)#
Visualizes a radial dendrogram of cell types with attached top marker genes.
Computes a dendrogram in two modes: - If calculate_dendrogram_on_cosg_scores is True, uses COSG scores from adata.uns[‘cosg’][‘COSG’],
processed with indexByGene() and iqrLogNormalize(), then computes the dendrogram on the transposed DataFrame with the specified distance_metric and linkage_method.
If False, aggregates adata.obsm[use_rep] by adata.obs[group_by], computing distances with pdist and linkage with the given distance_metric and linkage_method.
When collapse_scale (0 to 1) is set and yields multiple clusters, cell types are grouped with cluster nodes; if only one cluster, they attach directly to the root unless add_cluster_node_for_single_node_cluster is True, which adds a cluster node for single-member clusters. If collapse_scale is None, hierarchy_merge_scale (0 to 1) controls merging binary nodes into multi-child nodes based on distance similarity, with no merging if None. Top marker genes (from COSG data) are added as nodes to cell type leaves, with labels offset by gene_label_offset and colored by gene_label_color if provided.
Cell type node colors come from palette: - Dictionary: Maps cell types to colors. - List: Assigns colors by cell type order. - None: Uses adata.uns[f”{group_by}_colors”] if available, else defaults to “lightblue”. Marker gene node colors are scaled between gene_color_min and gene_color_max (max defaults to the highest score). Node sizes reflect expression percentage (fraction of cells with expression > 0) from adata.X or adata.layers[layer].
- Parameters:
adata (AnnData) – An AnnData object.
group_by (str) – The observation key in adata.obs to group cell types.
use_rep (str, optional, default='X_pca') – The representation to use when aggregating data from adata.obsm.
calculate_dendrogram_on_cosg_scores (bool, optional, default=True) – If True, compute the dendrogram on COSG scores derived using cosg.cosg, cosg.indexByGene and cosg.iqrLogNormalize. If False, compute the dendrogram on the aggregated representation from adata.obsm[use_rep].
top_n_genes (int, optional, default=3) – Number of top marker genes (per cell type) to attach.
cosg_key (str, optional, default='cosg') – The key used to access the COSG marker gene identification results. Defaults to “cosg”.
radius_step (float, optional, default=1.5) – Radial distance between successive levels in the layout.
cmap (str, optional, default="Purples") – The matplotlib colormap to use for gene nodes.
cell_type_label_offset (float, optional, default=0) – Fractional radial offset for cell type labels from the cell type node.
gene_label_offset (float, optional, default=0.25) – Fractional radial offset for gene labels from the marker node.
gene_label_color (str, optional, default=None) – If provided, this color is used for gene labels; otherwise, the gene node’s colormap color is used.
linkage_method (str, optional, default="ward") – Linkage method to use when computing the dendrogram.
distance_metric (str, optional, default="euclidean") – Distance metric to use when computing the dendrogram.
hierarchy_merge_scale (float or None, optional, default=None) – Controls the merging of binary nodes into multi-child nodes to simulate a non-binary hierarchy when collapse_scale is None. If provided, must be a float between 0 and 1, scaling the threshold relative to the range of linkage distances in Z. Nodes with distance differences below this scaled threshold are merged with their parent, allowing nodes to have more than two children. - 0: No merging (retains binary structure). - 1: Maximal merging (merges nodes if their distances differ by less than the full distance range). If None, no merging is performed, preserving the default binary dendrogram structure from Z. Raises ValueError if not between 0 and 1 when provided.
collapse_scale (float or None, optional, default=None) – Controls the level of clustering in the dendrogram. If None, builds a full hierarchical dendrogram where nodes may have more than two children based on distance similarity. If a float between 0 and 1, scales the threshold relative to the min and max linkage distances in Z, collapsing leaves and internal nodes with distances below this scaled threshold into cluster nodes. - 0: Maximal clustering (collapses at the minimum distance). - 1: Minimal clustering (collapses at the maximum distance). If only one cluster is found, no extra cluster node is added between the root and leaves. Raises ValueError if not between 0 and 1 when provided.
add_cluster_node_for_single_node_cluster (bool, optional, default=True) – Determines whether to create a cluster node for clusters containing only a single cell type when collapse_scale is provided. If True, a cluster node is added between the root and the single cell type node, maintaining a consistent hierarchy. If False, the single cell type node is connected directly to the root without an intermediate cluster node. Only applies when collapse_scale is not None and clustering results in single-member clusters.
palette (dict, list, or None, optional, default=None) – Colors for cell type nodes. If a dict, keys are cell type names and values are colors. If a list, colors are assigned in order of cell types. If None and if adata.uns contains f”{group_by}_colors”, that palette is used. Otherwise, cell type nodes default to “lightblue”.
gene_color_min (float, optional, default=0) – Minimum value for normalizing marker gene node colors.
gene_color_max (float or None, optional, default=None) – Maximum value for normalizing marker gene node colors. If None, the maximum among marker scores is used.
font_outline (float, optional, default=2) – Outline width for text labels.
figure_size (tuple, optional, default=(10, 10)) – Size of the figure.
node_shape_cell_type (str, optional, default='d') – Shape of the cell type nodes. Default is ‘d’ (diamond). Can be any valid NetworkX node shape. Specification is as matplotlib.scatter marker, one of ‘so^>v<dph8’. In detail: - ‘o’ : Circle - ‘s’ : Square - ‘d’ : Diamond - ‘v’ : Triangle Down - ‘^’ : Triangle Up - ‘<’ : Triangle Left - ‘>’ : Triangle Right - ‘p’ : Pentagon - ‘h’ : Hexagon - ‘8’ : Octagon
node_shape_gene (str, optional, default='o') – Shape of marker gene nodes. Default is ‘o’ (circle).
node_shape_internal (str, optional, default='o') – Shape of internal dendrogram nodes. Default is ‘o’ (circle).
colorbar_width (float, optional, default=0.01) – Width (in normalized figure coordinates) for the colorbar.
layer (str, optional, default=None) – If provided, use adata.layers[layer] to calculate expression; otherwise, use adata.X.
gene_size_scale (float, optional, default=300) – Base size for marker gene nodes; final size = gene_size_scale * (expression_percentage / 100).
map_cell_type_gene (dict, optional, default=None) – Custom mapping of cell types to marker genes. If provided, this will be used instead of the top marker genes. Should be a dictionary where keys are cell type names and values are lists of gene names. Only genes present in adata.var_names will be included. It’s okay if some cell types are not in the dict.
cell_type_selected (list, optional, default=None) – List of cell types to include in the visualization. If provided, only these cell types will be shown. If None, all cell types will be included. Raises ValueError if none of the provided cell types are valid.
color_root_node (str, optional, default='#D6EFD5') – Color for the root node. Default is a dark gray (#404040).
color_internal_node (str, optional, default='lightgray') – Color for internal nodes and cluster nodes. Default is lightgray.
color_edge (str, optional, default='lightgray') – Color for all edges in the dendrogram. Default is lightgray.
edge_curved (float, optional, default=0.0) – Controls the curvature of edges. 0.0 means straight lines, positive values increase curvature. Recommended range: 0.0 to 0.3. Default is 0.0 (straight lines).
show_figure (bool, optional (default=True)) – Whether to display the figure after plotting.
save (str or None, optional (default=None)) – File path to save the resulting figure. If None, the figure will not be saved.
- Returns:
Displays a matplotlib figure of the radial dendrogram if show_figure=True.
- Return type:
None
Example
>>> import cosg >>> cosg.plotMarkerDendrogram( ... adata, ... group_by="CellTypes", ... use_rep="X_pca", ... calculate_dendrogram_on_cosg_scores=False, ... top_n_genes=3, ... radius_step=1.5, ... cmap="Purples", ... gene_label_offset=0.25, ... gene_label_color="black", ... linkage_method="ward", ... distance_metric="correlation", ... collapse_threshold=0.3, ... palette=None, ... gene_color_min=0, ... gene_color_max=None, ... font_outline=2, ... figure_size=(10,10), ... colorbar_width=0.02, ... layer=None, ... gene_size_scale=300 ... )
- cosg.plotMarkerDotplot(adata, groupby, top_n_genes: int = 3, use_rep: str = 'X_pca', layer: str = None, key_cosg: str = 'cosg', swap_axes: bool = False, standard_scale: str = 'var', cmap: str = 'Spectral_r', save: str = None, **dotplot_kwargs)#
Generate a dot plot of top marker genes identified by COSG.
The function computes the cell cluster ordering using a dendrogram (if use_rep is provided) or derives it from adata.obs[groupby], extracts the top marker genes identified by COSG, and plots a dotplot using Scanpy’s sc.pl.dotplot.
- Parameters:
adata – Annotated data object that includes COSG results.
groupby (str) – The cell group key in adata.obs, should match with the groupby parameter used in COSG.
top_n_genes (int, optional (default: 3)) – The number of top marker genes to show for each group.
use_rep (str, optional (default: 'X_pca')) – The cell low-dimensional representation key (e.g., PCA, UMAP) in adata.obsm used to compute the dendrogram.
layer (str or None, optional) – The layer key to use for expression values in the dotplot (default: None).
key_cosg (str, optional (default: 'cosg')) – The key in adata.uns where COSG results are stored.
swap_axes (bool, optional (default: False)) – Whether to swap axes in the dot plot.
standard_scale (str or None, optional) – Whether to standardize expression values across ‘var’ (genes) or ‘group’ (cell groups or clusters). Can be ‘var’, ‘group’, or None (default: ‘var’).
cmap (str, optional (default: 'Spectral_r')) – The colormap used for the dot plot.
save (str or None, optional) – If provided, saves the plot to a file. The filename should include an extension (e.g., “cosg_markers.pdf”).
**dotplot_kwargs (dict) – Additional keyword arguments to pass to sc.pl.dotplot.
- Returns:
Displays the dot plot.
- Return type:
None
- Raises:
ValueError – If required COSG results or dendrogram information are missing, or if the provided groupby does not match the one stored in COSG parameters.
Example
>>> import scanpy as sc >>> import cosg # Assuming plotDotPlot is part of the cosg package >>> adata = sc.datasets.pbmc68k_reduced() >>> # Using a specific low-dimensional representation for dendrogram computation and cell type ordering: >>> cosg.plotMarkerDotplot( ... adata, ... groupby='bulk_labels', ... top_n_genes=3, ... key_cosg='cosg', ... use_rep='X_pca', ... swap_axes=False, ... standard_scale='var', ... cmap='Spectral_r' ... ) >>> # Deriving cell order from adata.obs when use_rep is None: >>> cosg.plotMarkerDotplot( ... adata, ... groupby='bulk_labels', ... top_n_genes=3, ... key_cosg='cosg', ... use_rep=None, ... swap_axes=False, ... standard_scale='var', ... cmap='Spectral_r' ... )