piaso.tools package#

piaso.tools.runSVD(adata: AnnData, use_highly_variable: bool = True, n_components: int = 50, random_state: int | None = 10, scale_data: bool = False, n_iter: int = 7, key_added: str = 'X_svd', layer: str | None = None, verbosity: int = 0)#

Performs Truncated Singular Value Decomposition (SVD) on the specified gene expression matrix (adata.X or a specified layer) within an AnnData object and stores the resulting low-dimensional representation in adata.obsm.

Parameters:
  • adata (AnnData) – An AnnData object.

  • use_highly_variable (bool, optional, default=True) – If True, the decomposition is performed only on highly variable genes/features.

  • n_components (int, optional, default=50) – The number of principal components to retain.

  • random_state (int, optional, default=10) – A random seed to ensure reproducibility.

  • scale_data (bool, optional, default=False) – If True, standardizes the input data before performing SVD.

  • n_iter (int, optional, default=7) – Number of iterations for randomized SVD solver. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum. Also larger than the n_iter default value (5) in the TruncatedSVD function.

  • key_added (str, optional, default='X_svd') – The key under which the resulting cell embeddings are stored in adata.obsm.

  • layer (str, optional, default=None) – Specifies which layer of adata to use for the transformation. If None, adata.X is used.

  • verbosity (int, optional, default=0) – Controls the verbosity of logging messages.

Returns:

The function modifies adata in place, storing the cell embeddings in adata.obsm[key_added].

Return type:

None

Example

>>> import piaso
>>> piaso.tl.runSVD(adata, use_highly_variable=True, n_components=50, random_state=42,
...        scale_data=False, key_added='X_svd', layer=None)
>>>
>>> # Access the transformed data
>>> adata.obsm['X_svd']
piaso.tools.runSVDLazy(adata, copy: bool = False, n_components: int = 50, use_highly_variable: bool = True, n_top_genes: int = 3000, verbosity: int = 0, batch_key: str | None = None, random_state: int | None = 1927, scale_data: bool = False, n_iter: int = 7, infog_trim: bool = True, key_added: str = 'X_svd', layer: str | None = None, infog_layer: str | None = None)#

Performs Truncated Singular Value Decomposition (SVD) in a “lazy” mode, based on the piaso.tl.runSVD function.

Compared to piaso.tl.runSVD, this function includes the step of highly variable gene section. If layer is set to infog, both the highly variable genes and normalized gene expression values were taken from the INFOG normalization outputs.

This function performs on the specified gene expression matrix (adata.X or a specified layer) within an AnnData object and stores the resulting low-dimensional representation in adata.obsm.

Parameters:
  • adata (AnnData) – An AnnData object.

  • copy (bool, optional, default=False) – If True, returns a copy of adata with the computed embeddings instead of modifying in place.

  • n_components (int, optional, default=50) – The number of singular value decomposition (SVD) components to retain.

  • use_highly_variable (bool, optional, default=True) – If True, uses only highly variable genes for the decomposition.

  • n_top_genes (int, optional, default=3000) – The number of top highly variable genes to retain before performing SVD.

  • verbosity (int, optional, default=0) – Controls the verbosity of logging messages.

  • batch_key (str, optional, default=None) – Specifies the key in adata.obs containing batch labels for batch_key used in the sc.pp.highly_variable_genes function.

  • random_state (int, optional, default=1927) – A random seed to ensure reproducibility.

  • scale_data (bool, optional, default=False) – If True, standardizes the input data before performing SVD.

  • n_iter (int, optional, default=7) – Number of iterations for randomized SVD solver. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum. Also larger than the n_iter default value (5) in the TruncatedSVD function.

  • infog_trim (bool, optional, default=True) – Used for the trim parameter in piaso.tl.infog function, effective only when layer set to infog.

  • key_added (str, optional, default='X_svd') – The key under which the resulting cell embeddings are stored in adata.obsm.

  • layer (str, optional, default=None) – Specifies which layer of adata to use for the transformation. If None, adata.X is used.

  • infog_layer (str, optional, default=None) – Used for the layer parameter in piaso.tl.infog function, effective only when layer set to infog.

Return type:

If copy is True, returns a modified AnnData object. Otherwise, modifies adata in place.

Example

>>> import piaso
>>> adata = piaso.tl.runSVDLazy(
...     adata, n_components=50, n_top_genes=3000,
...     use_highly_variable=True, key_added="X_svd",
...     layer=None
... )
>>>
>>> # Access the cell embedding
>>> adata.obsm['X_svd']
piaso.tools.runGDR(adata, batch_key: str = None, groupby: str = None, n_gene: int = 30, mu: float = 1.0, layer: str = None, score_layer: str = None, infog_layer: str = None, use_highly_variable: bool = True, n_highly_variable_genes: int = 5000, n_svd_dims: int = 50, n_svd_iter: int = 7, resolution: float = 1.0, scoring_method: str = None, key_added: str = None, verbosity: int = 0, random_seed: int = 1927)#

Run GDR (marker Gene-guided dimensionality reduction) on single-cell data.

GDR performs dimensionality reduction guided by marker genes to better preserve biological signals.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str, optional) – Key in adata.obs representing batch information. Defaults to None. If provided, marker gene identifications will be performed for each batch separately.

  • groupby (str, optional) – Key in adata.obs to specify which cell group information to use. Defaults to None. If none, de novo clustering will be performed.

  • n_gene (int, optional) – Number of genes, parameter used in COSG. Defaults to 30.

  • mu (float, optional) – Gene expression specificity parameter, used in COSG. Defaults to 1.0.

  • layer (str, optional) – Layer in adata.layers to use for the analysis. Defaults to None, which uses adata.X.

  • score_layer (str, optional) – If specified, the gene scoring will be calculated using this layer of adata.layers. Defaults to None.

  • infog_layer (str, optional) – If specified, INFOG normalization will be applied using this layer, which should contain the raw UMI count matrix. Defaults to None.

  • use_highly_variable (bool, optional) – Whether to use only highly variable genes when rerunning the dimensionality reduction. Defaults to True. Only effective when groupby=None.

  • n_highly_variable_genes (int, optional) – Number of highly variable genes to use when use_highly_variable is True. Defaults to 5000. Only effective when groupby=None.

  • n_svd_dims (int, optional) – Number of dimensions to use for SVD. Defaults to 50. Only effective when groupby=None.

  • n_svd_iter (int, optional, default=7) – Number of iterations for randomized SVD solver. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum. Also larger than the n_iter default value (5) in the TruncatedSVD function.

  • resolution (float, optional) – Resolution parameter for de novo clustering. Defaults to 1.0. Only effective when groupby=None.

  • scoring_method (str, optional) – Specifies the gene set scoring method used to compute gene scores.

  • key_added (str, optional) – Key under which the GDR dimensionality reduction results will be stored in adata.obsm. If None, results will be saved to adata.obsm[X_gdr].

  • verbosity (int, optional) – Verbosity level of the function. Higher values provide more detailed logs. Defaults to 0.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

The function modifies adata in place by adding GDR dimensionality reduction result to adata.obsm[key_added].

Return type:

None

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> adata = sc.read_h5ad("example.h5ad")
>>> piaso.tl.runGDR(
...     adata,
...     batch_key="batch",
...     groupby="CellTypes",
...     n_gene=30,
...     verbosity=0
... )
>>> print(adata.obsm["X_gdr"])
piaso.tools.calculateScoreParallel(adata, gene_set, score_method: Literal['scanpy', 'piaso'], random_seed: int = 1927, score_layer=None, max_workers=None, return_pvals: bool = False, verbosity: int = 0)#

Compute gene set scores in parallel using shared memory for efficiency.

Parameters:
  • adata (AnnData) – The input AnnData object.

  • gene_set (dict, list of lists, or pandas.DataFrame) –

    A collection of gene sets, where each gene set is either:
    • A dictionary: Keys are gene set names, values are lists of genes.

    • A list of lists: Each sublist contains genes in a gene set.

    • A pandas.DataFrame: Each column represents a gene set, and column names are gene set names.

  • score_method ({'scanpy', 'piaso'}, optional) – The method used for gene set scoring. Must be either ‘scanpy’ (default) or ‘piaso’. - ‘scanpy’: Uses the Scanpy’s built-in gene set scoring method. - ‘piaso’: Uses the PIASO’s gene set scoring method, which is more robust to sequencing depth variations.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

  • score_layer (str or None, optional) – Layer of the AnnData object to use. If None, adata.X is used.

  • max_workers (int or None, optional) – Number of parallel workers to use. Defaults to the number of CPUs.

  • return_pvals (bool, optional) – Whether to return -log10 p-values when using ‘piaso’ method. Default is False.

  • verbosity (int, optional) – Level of verbosity. Default is 0.

Returns:

If score_method is ‘scanpy’:
  • np.ndarray: A 2D array where each column contains the scores for a gene set.

  • list: The names of the gene sets.

If score_method is ‘piaso’ and return_pvals is True:
  • np.ndarray: A 2D array where each column contains the scores for a gene set.

  • list: The names of the gene sets.

  • np.ndarray: A 2D array where each column contains the -log10(p-values) for a gene set.

If score_method is ‘piaso’ and return_pvals is False:
  • np.ndarray: A 2D array where each column contains the scores for a gene set.

  • list: The names of the gene sets.

Return type:

tuple

piaso.tools.calculateScoreParallel_multiBatch(adata: AnnData, batch_key: str, marker_gene: DataFrame, marker_gene_n_groups_indices: list, score_method: Literal['scanpy', 'piaso'], score_layer: str = None, max_workers: int = 8, random_seed: int = 1927)#

Calculate gene set scores for each adata batch in parallel using shared memory. Different marker gene sets will be calculated in parallel as well.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str) – The key in adata.obs used to identify batches.

  • marker_gene (DataFrame) – The marker gene DataFrame.

  • marker_gene_n_groups_indices (list) – Indices specifying the marker gene set group boundaries, used for score normalization within each marker gene set group.

  • max_workers (int) – Maximum number of parallel workers to use.

  • score_layer (str) – The layer of adata to use for scoring.

  • score_method ({'scanpy', 'piaso'}, optional) – The method used for gene set scoring. Must be either ‘scanpy’ (default) or ‘piaso’. - ‘scanpy’: Uses the Scanpy’s built-in gene set scoring method. - ‘piaso’: Uses the PIASO’s gene set scoring method, which is more robust to sequencing depth variations.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

  • list: A list of normalized score arrays for each batch.

  • list: A list of cell barcodes for each batch.

  • list: A list of gene set names.

Return type:

tuple

Examples

>>> import scanpy as sc
>>> from piaso
>>> adata = sc.read_h5ad('example_data.h5ad')
>>> score_list, cellbarcode_info, gene_set_names = piaso.tl.calculateScoreParallel_multiBatch(
...     adata=adata,
...     batch_key='batch',
...     marker_gene=marker_gene,
...     marker_gene_n_groups_indices=marker_gene_n_groups_indices,
...     score_layer='piaso',
...     max_workers=8
... )
>>> print(score_list)
>>> print(cellbarcode_info)
piaso.tools.runGDRParallel(adata, batch_key: str = None, groupby: str = None, n_gene: int = 30, mu: float = 1.0, layer: str = None, score_layer: str = None, infog_layer: str = None, use_highly_variable: bool = True, n_highly_variable_genes: int = 5000, n_svd_dims: int = 50, n_svd_iter: int = 7, resolution: float = 1.0, scoring_method: str = None, key_added: str = None, max_workers: int = 8, calculate_score_multiBatch: bool = False, verbosity: int = 0, random_seed: int = 1927)#

Run GDR (marker Gene-guided dimensionality reduction) in parallel using multi-cores and shared memeory.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str, optional) – Key in adata.obs representing batch information. Defaults to None. If specified, different batches will be processed separately and in parallel, otherwise, the input data will be processed as one batch.

  • groupby (str, optional) – Key in adata.obs to specify which cell group information to use. Defaults to None. If none, de novo clustering will be performed.

  • n_gene (int, optional) – Number of genes, parameter used in COSG. Defaults to 30.

  • mu (float, optional) – Gene expression specificity parameter, used in COSG. Defaults to 1.0.

  • layer (str, optional) – Layer in adata.layers to use for the analysis. Defaults to None, which uses adata.X.

  • score_layer (str, optional) – If specified, the gene scoring will be calculated using this layer of adata.layers. Defaults to None.

  • infog_layer (str, optional) – If specified, the INFOG normalization will be calculated using this layer of adata.layers, which is expected to contain the UMI count matrix. Defaults to None.

  • use_highly_variable (bool, optional) – Whether to use only highly variable genes when rerunning the dimensionality reduction. Defaults to True. Only effective when groupby=None.

  • n_highly_variable_genes (int, optional) – Number of highly variable genes to use when use_highly_variable is True. Defaults to 5000. Only effective when groupby=None.

  • n_svd_dims (int, optional) – Number of dimensions to use for SVD. Defaults to 50. Only effective when groupby=None.

  • n_svd_iter (int, optional, default=7) – Number of iterations for randomized SVD solver. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum. Also larger than the n_iter default value (5) in the TruncatedSVD function.

  • resolution (float, optional) – Resolution parameter for de novo clustering. Defaults to 1.0. Only effective when groupby=None.

  • scoring_method (str, optional) – Specifies the gene set scoring method used to compute gene scores. If set to None, use PIASO’s scoring method as default.

  • key_added (str, optional) – Key under which the GDR dimensionality reduction results will be stored in adata.obsm. If None, results will be saved to adata.obsm[X_gdr].

  • max_workers (int, optional) – Maximum number of workers to use for parallel computation. Defaults to 8.

  • calculate_score_multiBatch (bool, optional) – Whether to calculate gene scores across multiple adata batches (if batch_key is specified). Defaults to False.

  • verbosity (int, optional) – Verbosity level of the function. Higher values provide more detailed logs. Defaults to 0.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

The function modifies adata in place by adding GDR dimensionality reduction result to adata.obsm[key_added].

Return type:

None

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> adata = sc.read_h5ad("example.h5ad")
>>> piaso.tl.runGDRParallel(
...     adata,
...     batch_key="batch",
...     groupby="CellTypes",
...     n_gene=30,
...     max_workers=8,
...     verbosity=0
... )
>>> print(adata.obsm["X_gdr"])
piaso.tools.runCOSGParallel(adata: AnnData, batch_key: str, groupby: str = None, layer: str = None, infog_layer: str = None, n_svd_dims: int = 50, n_svd_iter: int = 7, n_highly_variable_genes: int = 5000, verbosity: int = 0, resolution: float = 1.0, mu: float = 1.0, n_gene: int = 30, use_highly_variable: bool = True, return_gene_names: bool = False, max_workers: int = 8, random_seed: int = 1927)#

Run COSG on batches in parallel using shared memory and multiprocessing.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str) – The key in adata.obs used to identify batches.

  • groupby (str, optional (default: None)) – The key in adata.obs used to group observations for clustering. If None, clustering will be performed.

  • n_svd_dims (int, optional (default: 50)) – Number of SVD components to compute.

  • n_svd_iter (int, optional, default=7) – Number of iterations for randomized SVD solver. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum. Also larger than the n_iter default value (5) in the TruncatedSVD function.

  • n_highly_variable_genes (int, optional (default: 5000)) – Number of highly variable genes to use for SVD.

  • verbosity (int, optional (default: 0)) – Level of verbosity for logging information.

  • resolution (float, optional (default: 1.0)) – Resolution parameter for clustering.

  • layer (str, optional (default: None)) – Layer of the adata object to use for COSG.

  • infog_layer (str, optional (default: None)) – If specified, the INFOG normalization will be calculated using this layer of adata.layers, which is expected to contain the UMI count matrix. Defaults to None.

  • mu (float, optional (default: 1.0)) – COSG parameter to control regularization.

  • n_gene (int, optional (default: 30)) – Number of marker genes to compute for each cluster.

  • use_highly_variable (bool, optional (default: True)) – Whether to use highly variable genes for SVD.

  • return_gene_names (bool, optional (default: False)) – Whether to return gene names instead of indices in the marker gene DataFrame.

  • max_workers (int, optional (default: 8)) – Maximum number of parallel workers to use. If None, defaults to the number of available CPU cores.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

Combined marker gene DataFrame with batch-specific suffixes.

Return type:

DataFrame

Examples

>>> import scanpy as sc
>>> import piaso
>>> adata = sc.read_h5ad('example_data.h5ad')
>>> marker_genes = piaso.tl.runCOSGParallel(
...     adata=adata,
...     batch_key='batch',
...     groupby=None,
...     n_svd_dims=50,
...     n_highly_variable_genes=5000,
...     verbosity=1,
...     resolution=1.0,
...     layer='log1p',
...     mu=1.0,
...     n_gene=30,
...     use_highly_variable=True,
...     return_gene_names=True,
...     max_workers=4
... )
>>> print(marker_genes.head())
piaso.tools.leiden_local(adata, clustering_type: str = 'each', groupby: str = 'Leiden', groups: Sequence[str] | None = None, resolution: float = 0.25, batch_key: Sequence[str] | None = None, key_added: str = 'Leiden_local', dr_method: str = 'X_pca', gdr_resolution: float = 1.0, copy: bool = False)#

Perform Leiden clustering locally, i.e., on selected group(s), on an AnnData object. This function enables flexible clustering within specified groups, supports batch effect handling, and stores results in the AnnData object.

Parameters:
  • adata (AnnData)

  • clustering_type (str, optional (default: 'each')) – Specifies the clustering approach: - ‘each’: Perform clustering independently within each group. - ‘all’: Perform clustering across all selected groups.

  • groupby (str, optional (default: 'Leiden')) – The key in adata.obs specifying the cell labels to be used for selecting groups.

  • groups (Sequence[str], optional (default: None)) – A list of specific group(s) to be clustered. If None, all groups in the groupby category will be used.

  • resolution (float, optional (default: 0.25)) – Resolution parameter for the Leiden algorithm, controlling clustering granularity. Higher values result in more clusters.

  • batch_key (Sequence[str], optional (default: None)) – Key in adata.obs specifying batch labels. If provided, it handles batch effects during clustering. If None, batch effects are ignored.

  • key_added (str, optional (default: 'Leiden_local')) – The name of the key under which the local Leiden clustering results will be stored in adata.obs.

  • dr_method (str, optional (default: 'X_pca')) – Dimensionality reduction method to be used for local clustering. Allowed values are: ‘X_pca’, ‘X_gdr’, ‘X_pca_harmony’, ‘X_svd_full’, ‘X_svd_full_harmony’.

  • gdr_resolution (float, optional (default: 1.0)) – Resolution parameter for the GDR dimensionality reduction method if ‘dr_method’ is set to ‘X_gdr’.

  • copy (bool, optional (default: False)) – If False, the operation is performed in-place. If True, a copy of the adata object is returned with the clustering results added.

Returns:

  • If copy=True: Returns a new AnnData object with clustering results added to adata.obs[key_added].

  • If copy=False: Modifies the input adata object in-place by adding clustering results to adata.obs[key_added].

Return type:

AnnData or None

Example

>>> # Example usage
>>> leiden_local(
...     adata,
...     clustering_type='each',
...     groupby='Leiden',
...     groups=['0', '1'],
...     resolution=0.2,
...     batch_key=None,
...     key_added='Leiden_local',
...     dr_method='X_pca',
...     copy=False
... )
piaso.tools.infog(adata, copy: bool = False, inplace: bool = False, n_top_genes: int = 3000, key_added: str = 'infog', key_added_highly_variable_gene: str = 'highly_variable', trim: bool = True, verbosity: int = 1, layer: str | None = None)#

Performs INFOG normalization of single-cell RNA sequencing data based on “biological information”.

This function outputs the selected highly variable genes and normalized gene expression values based on the raw UMI counts.

Parameters:
  • adata (AnnData) – An AnnData object.

  • copy (bool, optional, default=False) – If True, returns a new AnnData object with the normalized data instead of modifying adata in place.

  • inplace (bool, optional, default=False) – If True, the normalized data is stored in adata.X rather than in adata.layers[key_added].

  • n_top_genes (int, optional, default=3000) – The number of top highly variable genes to select.

  • key_added (str, optional, default='infog') – The key under which the normalized gene expression matrix is stored in adata.layers.

  • key_added_highly_variable_gene (str, optional, default='highly_variable') – The key under which the selection of highly variable genes is stored in adata.var.

  • trim (bool, optional, default=True) – If True, trim the normalized gene expression values.

  • verbosity (int, optional, default=1) – Controls the level of logging and output messages.

  • layer (str, optional, default=None) – Specifies which layer of adata to use for INFOG normalization. If None, adata.X is used. Note: the raw UMIs counts should be used.

Returns:

  • If copy is True, returns a modified AnnData object with the normalized expression matrix.

  • Otherwise, modifies adata in place.

  • The normalized gene expression values will be saved in adata.X if inplace is True, or in adata.layers

  • with the key key_added by default if inplace is False.

Example

>>> import piaso
>>> adata = piaso.tl.infog(
...     adata, n_top_genes=3000, key_added="infog",
...     trim=True, layer="raw"
... )
>>>
>>> # Access the normalized data
>>> adata.layers['infog']
>>> # Access the highly variable genes
>>> adata.var['highly_variable']
piaso.tools.score(adata, gene_list, gene_weights=None, n_nearest_neighbors: int = 30, leaf_size: int = 40, layer: str = 'infog', random_seed: int = 1927, n_ctrl_set: int = 100, key_added: str = None, verbosity: int = 0)#

For a given gene set, compute gene expression enrichment scores and P values for all the cells.

Parameters:
  • adata (AnnData) – The AnnData object for the gene expression matrix.

  • gene_list (list of str) – A list of gene names for which the score will be computed.

  • gene_weights (list of floats, optional) – A list of weights corresponding to the genes in gene_list. The length of gene_weights must match the length of gene_list. If None, all genes in gene_list are weighted equally. Default is None.

  • n_nearest_neighbors (int, optional) – Number of nearest neighbors to consider for randomly selecting control gene sets based on the similarity of genes’ mean and variance among all cells. Default is 30.

  • leaf_size (int, optional) – Leaf size for the KD-tree or Ball-tree used in nearest neighbor calculations. Default is 40.

  • layer (str, optional) – The name of the layer in adata.layers to use for gene expression values. Default is ‘infog’.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

  • n_ctrl_set (int, optional) – Number of control gene sets to be used for calculating P values. Default is 100.

  • key_added (str, optional) – If provided, the computed scores will be stored in adata.obs[key_added]. The scores and P values will be stored in adata.uns[key_added] as well. Default is None, and the INFOG_score will be used as the key.

  • verbosity (int, optional (default: 0)) – Level of verbosity for logging information.

Returns:

Modifies the adata object in-place, see key_added.

Return type:

None

piaso.tools.predictCellTypeByGDR(adata, adata_ref, layer: str = 'log1p', layer_reference: str = 'log1p', reference_groupby: str = 'CellTypes', query_groupby: str = 'Leiden', mu: float = 10.0, n_genes: int = 15, return_integration: bool = False, use_highly_variable: bool = True, n_highly_variable_genes: int = 5000, n_svd_dims: int = 50, resolution: float = 1.0, scoring_method: str = None, key_added: str = None, verbosity: int = 0)#

Predicts cell types in a query dataset (adata) using the GDR dimensionality reduction method based on a reference dataset (adata_ref). To use GDR for dimensionality reduction, please refer to piaso.tl.runGDR or piaso.tl.runGDRParallel.

Parameters:
  • adata (AnnData) – The query single-cell AnnData object for which cell types are to be predicted.

  • adata_ref (AnnData) – The reference single-cell AnnData object with known cell type annotations.

  • layer (str, optional (default: 'log1p')) – The layer in adata to use for gene expression data. If None, uses the .X matrix.

  • layer_reference (str, optional (default: 'log1p')) – The layer in adata_ref to use for reference gene expression data. If None, uses the .X matrix.

  • reference_groupby (str, optional (default: 'CellTypes')) – The column in adata_ref.obs used to define reference cell type groupings.

  • query_groupby (str, optional (default: 'Leiden')) – The column in adata.obs used to for GDR dimensionality reduction, such as clusters identified using Leiden or Louvain algorithms.

  • mu (float, optional (default: 10.0)) – A regularization parameter for controlling the gene expression specificity, used in COSG (marker gene identification) and GDR.

  • n_genes (int, optional (default: 15)) – The number of top specific genes per group, used in COSG and GDR.

  • return_integration (bool, optional (default: False)) – If True, the function will return the integrated low-dimensional cell embeddings of the query dataset and reference dataset.

  • use_highly_variable (bool, optional (default: True)) – Whether to use highly variable genes, used in GDR.

  • n_highly_variable_genes (int, optional (default: 5000)) – The number of highly variable genes to select, if use_highly_variable is True, used in GDR.

  • n_svd_dims (int, optional (default: 50)) – The number of dimensions to retain during SVD, used in GDR.

  • resolution (float, optional (default: 1.0)) – Resolution parameter for clustering, used in GDR.

  • scoring_method (str, optional (default: None)) – The method used for gene set scoring, used in GDR.

  • key_added (str, optional (default: None)) – A key to add the predicted cell types or integration results to adata.obs. If None, CellTypes_gdr will be used.

  • verbosity (int, optional (default: 0)) – The level of logging output. Higher values produce more detailed logs for debugging and monitoring progress.

Returns:

If return_integration is True, returns an AnnData object of merged reference and query datasets with integrated cell embeddings and predicted cell types. Otherwise, updates adata in place with the predicted cell types.

Return type:

None or AnnData

Example

>>> import scanpy as sc
>>> # Load query dataset
>>> adata = sc.read_h5ad("query_data.h5ad")
>>>
>>> # Load reference dataset with known cell type annotations
>>> adata_ref = sc.read_h5ad("reference_data.h5ad")
>>>
>>> # Predict cell types for the query dataset
>>> piaso.tl.predictCellTypeByGDR(
>>>     adata=adata,
>>>     adata_ref=adata_ref,
>>>     layer='log1p',
>>>     layer_reference='log1p',
>>>     reference_groupby='CellTypes',
>>>     query_groupby='Leiden',
>>>     mu=10.0,
>>>     n_genes=20,
>>>     return_integration=False,
>>>     use_highly_variable=True,
>>>     n_highly_variable_genes=3000,
>>>     n_svd_dims=50,
>>>     resolution=0.8,
>>>     key_added='CellTypes_gdr',
>>>     verbosity=0
>>> )
>>>
>>> # Access the predicted cell types in the query dataset
>>> print(adata.obs['CellTypes_gdr'])
piaso.tools.smoothCellTypePrediction(adata, groupby: str, use_rep: str = 'X_pca', k_nearest_neighbors: int = 5, return_confidence: bool = False, inplace: bool = True, use_existing_adjacency_graph: bool = True, use_faiss: bool = False, key_added: str = None, verbosity: int = 1, n_jobs: int = -1)#

Smooth cell type predictions using k-nearest neighbors in a low-dimensional embedding.

Parameters:
  • adata (AnnData) – AnnData object containing single-cell data

  • groupby (str) – Key in adata.obs containing the cell type predictions to smooth

  • use_rep (str, default='X_pca') – Key in adata.obsm containing the low-dimensional embedding to use for finding neighbors

  • k_nearest_neighbors (int, default=5) – Number of neighbors to consider (including the cell itself)

  • return_confidence (bool, default=False) – Whether to return confidence scores (proportion of neighbors with the majority label)

  • inplace (bool, default=True) – Whether to modify adata inplace or return a copy

  • use_existing_adjacency_graph (bool, default=True) – Whether to use existing neighborhood graph (adata.obsp[‘connectivities’]) if available

  • use_faiss (bool, default=False) – Whether to use FAISS for faster neighbor search (requires faiss package)

  • key_added (str or None, default=None) – If provided, use this key as the output key in adata.obs instead of ‘{groupby}_smoothed’

  • verbosity (int, default=1) – Level of verbosity (0=no output, 1=basic info, 2=detailed info)

  • n_jobs (int, default=-1) – Number of jobs for parallel processing. -1 means using all processors.

Returns:

  • If inplace=True – None, but adds ‘groupby_smoothed’ (or key_added) to adata.obs If return_confidence=True, also adds ‘groupby_confidence’ (or key_added_confidence) to adata.obs

  • If inplace=False – Copy of adata with added columns

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> # Basic usage
>>> piaso.tl.smoothCellTypePrediction(
...     adata,
...     groupby='CellTypes_pred',
...     use_rep='X_pca',
...     key_added='CellTypes_pred_smoothed'
... )
>>>
>>> # With confidence scores
>>> piaso.tl.smoothCellTypePrediction(
...     adata,
...     groupby='CellTypes_pred',
...     k_nearest_neighbors=15,
...     return_confidence=True,
...     key_added='CellTypes_pred_smoothed'
... )
piaso.tools.predictCellTypeByMarker(adata, marker_gene_set: List | Dict | DataFrame, score_method: Literal['scanpy', 'piaso'] = 'piaso', score_layer: str | None = 'infog', use_score: bool = True, max_workers: int | None = None, smooth_prediction: bool = True, use_rep: str = 'X_gdr', k_nearest_neighbors: int = 7, return_confidence: bool = True, use_existing_adjacency_graph: bool = False, use_faiss: bool = False, key_added: str = 'CellTypes_predicted', extract_cell_type: bool = False, delimiter_cell_type: str = '-', inplace: bool = True, random_seed: int = 1927, verbosity: int = 1, n_jobs: int = -1)#

Predict cell types using marker genes and optionally smooth predictions.

This function performs cell type prediction using marker genes in two steps: 1. Calculate gene set scores for marker genes 2. Optionally smooth the predictions using k-nearest neighbors

Parameters:
  • adata (AnnData) – AnnData object containing single-cell data

  • marker_gene_set (list, dict, or pandas.DataFrame) – Collection of marker genes for different cell types (pre-filtered/prepared)

  • score_method ({'scanpy', 'piaso'}, default='piaso') – Method to use for scoring marker genes

  • score_layer (str or None, default='infog') – Layer of the AnnData object to use for scoring

  • use_score (bool, default=True) – Whether to use scores (True) or p-values (False) for cell type prediction

  • max_workers (int or None, default=None) – Number of parallel workers for score calculation

  • smooth_prediction (bool, default=True) – Whether to smooth predictions using k-nearest neighbors

  • use_rep (str, default='X_gdr') – Key in adata.obsm containing the low-dimensional embedding to use for neighbor search

  • k_nearest_neighbors (int, default=7) – Number of neighbors to consider for smoothing

  • return_confidence (bool, default=True) – Whether to return confidence scores for smoothed predictions

  • use_existing_adjacency_graph (bool, default=False) – Whether to use existing neighborhood graph if available

  • use_faiss (bool, default=False) – Whether to use FAISS for faster neighbor search

  • key_added (str, default='CellTypes_predicted') – Key to use for storing cell type predictions in adata.obs

  • extract_cell_type (bool, default=False) – Whether to extract cell type name by removing suffix after delimiter

  • delimiter_cell_type (str, default='-') – Delimiter to use when extracting cell type names (only used if extract_cell_type=True)

  • inplace (bool, default=True) – Whether to modify adata in place or return a copy

  • random_seed (int, default=1927) – Random seed for reproducibility

  • verbosity (int, default=1) – Level of verbosity (0=quiet, 1=basic info, 2=detailed)

  • n_jobs (int, default=-1) – Number of jobs for parallel processing during smoothing

Returns:

  • If inplace=False – AnnData: Copy of adata with cell type predictions added

  • If inplace=True – None, but adata is modified in place

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> # Basic usage
>>> piaso.tl.predictCellTypeByMarker(
...     adata,
...     marker_gene_set=cosgMarkerDB,
...     score_method='piaso',
...     use_score=False,
...     smooth_prediction=True,
...     inplace=True
... )
piaso.tools.stitchSpace(adata: AnnData, batch_key: str, use_rep: str = 'X_pca', key_added: str = 'X_stitch', filter_cluster_key_added: str | None = None, filter_pruned_graph_key: str | None = None, filter_use_global_markers: bool = False, filter_leiden_resolution: float = 0.5, filter_leiden_n_neighbors: int = 15, filter_n_markers: int = 50, filter_marker_overlap_threshold: float = 0.1, filter_cosg_layer: str | None = None, filter_cosg_mu: float = 100.0, filter_cosg_expressed_pct: float = 0.1, filter_cosg_remove_lowly_expressed: bool = True, filter_bbknn_neighbors_within_batch: int = 3, filter_bbknn_trim: int | None = None, random_state: int | None = 1927, correction_smooth_within_batch: bool = True, correction_use_mutual_sqrt_weights: bool = False, copy: bool = False, verbosity: int = 0) AnnData | None#

Performs a batch correction using a BBKNN graph that has been pruned based on marker gene overlap between batch-specific clusters. Overlap check uses local markers and optionally global markers (controlled by filter_use_global_markers).

Clusters are identified internally using Leiden and stored in adata.obs[filter_cluster_key_added]. Markers identified by COSG. Intermediate results like the compatibility ‘hypergraph’, marker gene lists, and the pruned graph structure are stored in adata.uns and adata.obsp/adata.uns.

The correction moves each cell towards the average position of its neighbors in the pruned graph.

Parameters:
  • adata – Annotated data matrix. Needs expression data for COSG (in .X or specified layer).

  • batch_key – Key in adata.obs for batch information.

  • use_rep – Representation in adata.obsm for BBKNN, clustering, and correction (e.g., ‘X_pca’).

  • key_added – Base key for storing results. Corrected embedding will be in adata.obsm[key_added]. Intermediate results stored in adata.uns.

  • filter_cluster_key_added – Key in adata.obs where generated batch-cluster labels will be stored. If None, a default key is generated (e.g., f”{batch_key}@leiden@res{res}”).

  • filter_pruned_graph_key – Base key for storing the pruned graph structure in adata.obsp and adata.uns. If None, defaults to “pruned_markers”. Connectivities/distances will be stored as {filter_pruned_graph_key}_connectivities/_distances.

  • filter_use_global_markers – If True, run global COSG and require BOTH local AND global marker overlap for inter-batch cluster compatibility. If False (default), only local overlap is used.

  • filter_leiden_resolution – Resolution parameter for internal within-batch Leiden clustering.

  • filter_leiden_n_neighbors – KNN parameter for internal within-batch Leiden clustering’s graph.

  • filter_n_markers – Number of top COSG markers to compare between clusters.

  • filter_marker_overlap_threshold – Minimum Jaccard index for marker overlap to consider clusters compatible.

  • filter_cosg_layer – Layer in adata.layers to use for COSG marker identification. If None (default), uses adata.X.

  • filter_cosg_mumu parameter for COSG (default: 100.0). Higher values increase sparsity.

  • filter_cosg_expressed_pctexpressed_pct parameter for COSG (default: 0.1). Minimum expression pct for a gene.

  • filter_cosg_remove_lowly_expressedremove_lowly_expressed parameter for COSG (default: True). Filter lowly expressed genes.

  • filter_bbknn_neighbors_within_batchneighbors_within_batch parameter for the initial bbknn.bbknn call.

  • filter_bbknn_trim – Optional trim parameter passed to the initial bbknn.bbknn call.

  • random_state – Seed for the random number generator used in Leiden clustering for reproducibility. Default: 1927.

  • correction_smooth_within_batch – If True, smooth the correction vector within batches using the pruned graph structure.

  • correction_use_mutual_sqrt_weights – If True, applies symmetrization and sqrt weighting to the pruned graph before the correction step.

  • copy – If True, return a modified copy of adata. Otherwise, modify adata inplace.

  • verbosity – Level of detail to print: 0 (minimal), 1 or higher (more progress messages and intermediate storage locations). Default: 0. Controls BBKNN logging level.

Returns:

If copy=True, returns the modified AnnData object. Otherwise, modifies the input adata object inplace and returns None. Adds/updates: - adata.obsm[key_added]: The corrected embedding. - adata.obs[filter_cluster_key_added]: Generated batch-cluster labels (using ‘@’ delimiter). - adata.uns[f’{key_added}_hypergraph_compatibility’]: Compatibility dict. - adata.uns[f’{key_added}_local_markers’]: Local marker dict. - adata.uns[f’{key_added}_global_markers’]: Global marker dict. - adata.obsp[f’{filter_pruned_graph_key}_connectivities’]: Pruned graph connectivities. - adata.obsp[f’{filter_pruned_graph_key}_distances’]: Pruned graph dummy distances. - adata.uns[filter_pruned_graph_key]: Neighbors dictionary for pruned graph. - adata.uns[f’{key_added}_params’]: Dictionary of parameters used.

Return type:

AnnData or None

Example

>>> import scanpy as sc
>>> import stitchSpaceModule # Assuming the code is saved as stitchSpaceModule.py
>>> adata = sc.datasets.pbmc68k_reduced()
>>> # Simulate batches (replace with actual batch info)
>>> adata.obs['batch'] = ['A' if i % 2 == 0 else 'B' for i in range(adata.n_obs)]
>>> # Assume normalized data is in adata.layers['log1p']
>>> adata.layers['log1p'] = adata.X.copy()
>>> # Precompute PCA if not done
>>> sc.tl.pca(adata)
>>> # Run correction using log1p layer for COSG, increased verbosity
>>> piaso.tl.stitchSpace(
...     adata,
...     batch_key='batch',
...     use_rep='X_pca',
...     key_added='X_stitch_corrected',
...     filter_cluster_key_added='batch@cluster_stitch', # Optional: specify key
...     filter_cosg_layer='log1p', # Specify layer for COSG
...     random_state=1927,
...     verbosity=1
... )
>>> # Visualize results
>>> sc.pp.neighbors(adata, use_rep='X_stitch_corrected') # Compute neighbors on corrected embedding
>>> sc.tl.umap(adata)
>>> sc.pl.umap(adata, color=['batch', 'batch@cluster_stitch'])
>>> # Visualize UMAP based on the pruned graph itself
>>> sc.tl.umap(adata, neighbors_key='pruned_markers') # Use default key or specify if changed
>>> sc.pl.umap(adata, color=['batch', 'batch@cluster_stitch'], title="UMAP on Pruned Graph")
piaso.tools.runSCALAR(adata: AnnData, specificity_matrix: DataFrame, lr_pairs: DataFrame, ligand_col: str = 'ligand', receptor_col: str = 'receptor', annotation_col: str | None = None, sender_cell_types: List[str] | None = None, receiver_cell_types: List[str] | None = None, n_permutations: int = 1000, n_nearest_neighbors: int = 30, layer: str = None, random_seed: int = 42, rank_by_score: bool = True, chunk_size: int = 50000, prefilter_fdr: bool = True, prefilter_threshold: float = 0.0) DataFrame#

Calculates ligand-receptor interaction scores, computes permutation-based p-values using a vectorized approach, and corrects for multiple testing using FDR for each cell type-cell type pair independently.

Parameters:
  • adata – AnnData object with gene expression data.

  • specificity_matrix – DataFrame with genes as rows, cell types as columns, and specificity scores as values.

  • lr_pairs – DataFrame listing interacting gene pairs.

  • ligand_col – Column name for ligands in lr_pairs.

  • receptor_col – Column name for receptors in lr_pairs.

  • annotation_col – Optional column in lr_pairs to carry over.

  • sender_cell_types – List of cell types to use as senders. If None, all are used.

  • receiver_cell_types – List of cell types to use as receivers. If None, all are used.

  • n_permutations – Number of permutations for the null distribution.

  • n_nearest_neighbors – Number of control genes to sample from.

  • layer – Layer in adata to use for expression.

  • random_seed – Seed for reproducibility.

  • rank_by_score – If True, sorts the final output by interaction_score.

  • chunk_size – The number of interactions to process in each vectorized chunk to manage memory usage.

  • prefilter_fdr – If True, interactions with scores <= prefilter_threshold are excluded from FDR calculation within each group and assigned an FDR of 1.0.

  • prefilter_threshold – The score threshold used for pre-filtering before FDR calculation.

Returns:

A pandas DataFrame with interaction scores, p-values, and FDR-corrected p-values.