piaso.tools package#

piaso.tools.leiden_local(adata, clustering_type: str = 'each', groupby: str = 'Leiden', groups: Sequence[str] | None = None, resolution: float = 0.25, batch_key: Sequence[str] | None = None, key_added: str = 'Leiden_local', dr_method: str = 'X_pca', gdr_resolution: float = 1.0, copy: bool = False)#

Perform Leiden clustering locally, i.e., on selected group(s), on an AnnData object. This function enables flexible clustering within specified groups, supports batch effect handling, and stores results in the AnnData object.

Parameters:
  • adata (AnnData)

  • clustering_type (str, optional (default: 'each')) – Specifies the clustering approach: - ‘each’: Perform clustering independently within each group. - ‘all’: Perform clustering across all selected groups.

  • groupby (str, optional (default: 'Leiden')) – The key in adata.obs specifying the cell labels to be used for selecting groups.

  • groups (Sequence[str], optional (default: None)) – A list of specific group(s) to be clustered. If None, all groups in the groupby category will be used.

  • resolution (float, optional (default: 0.25)) – Resolution parameter for the Leiden algorithm, controlling clustering granularity. Higher values result in more clusters.

  • batch_key (Sequence[str], optional (default: None)) – Key in adata.obs specifying batch labels. If provided, it handles batch effects during clustering. If None, batch effects are ignored.

  • key_added (str, optional (default: 'Leiden_local')) – The name of the key under which the local Leiden clustering results will be stored in adata.obs.

  • dr_method (str, optional (default: 'X_pca')) – Dimensionality reduction method to be used for local clustering. Allowed values are: ‘X_pca’, ‘X_gdr’, ‘X_pca_harmony’, ‘X_svd_full’, ‘X_svd_full_harmony’.

  • gdr_resolution (float, optional (default: 1.0)) – Resolution parameter for the GDR dimensionality reduction method if ‘dr_method’ is set to ‘X_gdr’.

  • copy (bool, optional (default: False)) – If False, the operation is performed in-place. If True, a copy of the adata object is returned with the clustering results added.

Returns:

  • If copy=True: Returns a new AnnData object with clustering results added to adata.obs[key_added].

  • If copy=False: Modifies the input adata object in-place by adding clustering results to adata.obs[key_added].

Return type:

AnnData or None

Example

>>> # Example usage
>>> leiden_local(
...     adata,
...     clustering_type='each',
...     groupby='Leiden',
...     groups=['0', '1'],
...     resolution=0.2,
...     batch_key=None,
...     key_added='Leiden_local',
...     dr_method='X_pca',
...     copy=False
... )
piaso.tools.infog(adata, copy: bool = False, inplace: bool = False, n_top_genes: int = 3000, key_added: str = 'infog', key_added_highly_variable_gene: str = 'highly_variable', trim: bool = True, verbosity: int = 1, layer: str | None = None)#

Performs INFOG normalization of single-cell RNA sequencing data based on “biological information”.

This function outputs the selected highly variable genes and normalized gene expression values based on the raw UMI counts.

Parameters:
  • adata (AnnData) – An AnnData object.

  • copy (bool, optional, default=False) – If True, returns a new AnnData object with the normalized data instead of modifying adata in place.

  • inplace (bool, optional, default=False) – If True, the normalized data is stored in adata.X rather than in adata.layers[key_added].

  • n_top_genes (int, optional, default=3000) – The number of top highly variable genes to select.

  • key_added (str, optional, default='infog') – The key under which the normalized gene expression matrix is stored in adata.layers.

  • key_added_highly_variable_gene (str, optional, default='highly_variable') – The key under which the selection of highly variable genes is stored in adata.var.

  • trim (bool, optional, default=True) – If True, trim the normalized gene expression values.

  • verbosity (int, optional, default=1) – Controls the level of logging and output messages.

  • layer (str, optional, default=None) – Specifies which layer of adata to use for INFOG normalization. If None, adata.X is used. Note: the raw UMIs counts should be used.

Returns:

  • If copy is True, returns a modified AnnData object with the normalized expression matrix.

  • Otherwise, modifies adata in place.

  • The normalized gene expression values will be saved in adata.X if inplace is True, or in adata.layers

  • with the key key_added by default if inplace is False.

Example

>>> import piaso
>>> adata = piaso.tl.infog(
...     adata, n_top_genes=3000, key_added="infog",
...     trim=True, layer="raw"
... )
>>>
>>> # Access the normalized data
>>> adata.layers['infog']
>>> # Access the highly variable genes
>>> adata.var['highly_variable']
piaso.tools.score(adata, gene_list, gene_weights=None, n_nearest_neighbors: int = 30, leaf_size: int = 40, layer: str = 'infog', random_seed: int = 1927, n_ctrl_set: int = 100, key_added: str = None, verbosity: int = 0)#

For a given gene set, compute gene expression enrichment scores and P values for all the cells.

Parameters:
  • adata (AnnData) – The AnnData object for the gene expression matrix.

  • gene_list (list of str) – A list of gene names for which the score will be computed.

  • gene_weights (list of floats, optional) – A list of weights corresponding to the genes in gene_list. The length of gene_weights must match the length of gene_list. If None, all genes in gene_list are weighted equally. Default is None.

  • n_nearest_neighbors (int, optional) – Number of nearest neighbors to consider for randomly selecting control gene sets based on the similarity of genes’ mean and variance among all cells. Default is 30.

  • leaf_size (int, optional) – Leaf size for the KD-tree or Ball-tree used in nearest neighbor calculations. Default is 40.

  • layer (str, optional) – The name of the layer in adata.layers to use for gene expression values. Default is ‘infog’.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

  • n_ctrl_set (int, optional) – Number of control gene sets to be used for calculating P values. Default is 100.

  • key_added (str, optional) – If provided, the computed scores will be stored in adata.obs[key_added]. The scores and P values will be stored in adata.uns[key_added] as well. Default is None, and the INFOG_score will be used as the key.

  • verbosity (int, optional (default: 0)) – Level of verbosity for logging information.

Returns:

Modifies the adata object in-place, see key_added.

Return type:

None

piaso.tools.calculateScoreParallel(adata, gene_set, score_method: Literal['scanpy', 'piaso'], random_seed: int = 1927, score_layer=None, max_workers=None, return_pvals: bool = False, verbosity: int = 0)#

Compute gene set scores in parallel using shared memory for efficiency.

Parameters:
  • adata (AnnData) – The input AnnData object.

  • gene_set (dict, list of lists, or pandas.DataFrame) –

    A collection of gene sets, where each gene set is either:
    • A dictionary: Keys are gene set names, values are lists of genes.

    • A list of lists: Each sublist contains genes in a gene set.

    • A pandas.DataFrame: Each column represents a gene set, and column names are gene set names.

  • score_method ({'scanpy', 'piaso'}, optional) – The method used for gene set scoring. Must be either ‘scanpy’ (default) or ‘piaso’. - ‘scanpy’: Uses the Scanpy’s built-in gene set scoring method. - ‘piaso’: Uses the PIASO’s gene set scoring method, which is more robust to sequencing depth variations.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

  • score_layer (str or None, optional) – Layer of the AnnData object to use. If None, adata.X is used.

  • max_workers (int or None, optional) – Number of parallel workers to use. Defaults to the number of CPUs.

  • return_pvals (bool, optional) – Whether to return -log10 p-values when using ‘piaso’ method. Default is False.

  • verbosity (int, optional) – Level of verbosity. Default is 0.

Returns:

If score_method is ‘scanpy’:
  • np.ndarray: A 2D array where each column contains the scores for a gene set.

  • list: The names of the gene sets.

If score_method is ‘piaso’ and return_pvals is True:
  • np.ndarray: A 2D array where each column contains the scores for a gene set.

  • list: The names of the gene sets.

  • np.ndarray: A 2D array where each column contains the -log10(p-values) for a gene set.

If score_method is ‘piaso’ and return_pvals is False:
  • np.ndarray: A 2D array where each column contains the scores for a gene set.

  • list: The names of the gene sets.

Return type:

tuple

piaso.tools.calculateScoreParallel_multiBatch(adata: AnnData, batch_key: str, marker_gene: DataFrame, marker_gene_n_groups_indices: list, score_method: Literal['scanpy', 'piaso'], score_layer: str = None, max_workers: int = 8, random_seed: int = 1927)#

Calculate gene set scores for each adata batch in parallel using shared memory. Different marker gene sets will be calculated in parallel as well.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str) – The key in adata.obs used to identify batches.

  • marker_gene (DataFrame) – The marker gene DataFrame.

  • marker_gene_n_groups_indices (list) – Indices specifying the marker gene set group boundaries, used for score normalization within each marker gene set group.

  • max_workers (int) – Maximum number of parallel workers to use.

  • score_layer (str) – The layer of adata to use for scoring.

  • score_method ({'scanpy', 'piaso'}, optional) – The method used for gene set scoring. Must be either ‘scanpy’ (default) or ‘piaso’. - ‘scanpy’: Uses the Scanpy’s built-in gene set scoring method. - ‘piaso’: Uses the PIASO’s gene set scoring method, which is more robust to sequencing depth variations.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

  • list: A list of normalized score arrays for each batch.

  • list: A list of cell barcodes for each batch.

  • list: A list of gene set names.

Return type:

tuple

Examples

>>> import scanpy as sc
>>> from piaso
>>> adata = sc.read_h5ad('example_data.h5ad')
>>> score_list, cellbarcode_info, gene_set_names = piaso.tl.calculateScoreParallel_multiBatch(
...     adata=adata,
...     batch_key='batch',
...     marker_gene=marker_gene,
...     marker_gene_n_groups_indices=marker_gene_n_groups_indices,
...     score_layer='piaso',
...     max_workers=8
... )
>>> print(score_list)
>>> print(cellbarcode_info)
piaso.tools.predictCellTypeByGDR(adata, adata_ref, layer: str = 'log1p', layer_reference: str = 'log1p', reference_groupby: str = 'CellTypes', query_groupby: str = 'Leiden', mu: float = 10.0, n_genes: int = 15, return_integration: bool = False, use_highly_variable: bool = True, n_highly_variable_genes: int = 5000, n_svd_dims: int = 50, resolution: float = 1.0, scoring_method: str = None, key_added: str = None, verbosity: int = 0)#

Predicts cell types in a query dataset (adata) using the GDR dimensionality reduction method based on a reference dataset (adata_ref). To use GDR for dimensionality reduction, please refer to piaso.tl.runGDR or piaso.tl.runGDRParallel.

Parameters:
  • adata (AnnData) – The query single-cell AnnData object for which cell types are to be predicted.

  • adata_ref (AnnData) – The reference single-cell AnnData object with known cell type annotations.

  • layer (str, optional (default: 'log1p')) – The layer in adata to use for gene expression data. If None, uses the .X matrix.

  • layer_reference (str, optional (default: 'log1p')) – The layer in adata_ref to use for reference gene expression data. If None, uses the .X matrix.

  • reference_groupby (str, optional (default: 'CellTypes')) – The column in adata_ref.obs used to define reference cell type groupings.

  • query_groupby (str, optional (default: 'Leiden')) – The column in adata.obs used to for GDR dimensionality reduction, such as clusters identified using Leiden or Louvain algorithms.

  • mu (float, optional (default: 10.0)) – A regularization parameter for controlling the gene expression specificity, used in COSG (marker gene identification) and GDR.

  • n_genes (int, optional (default: 15)) – The number of top specific genes per group, used in COSG and GDR.

  • return_integration (bool, optional (default: False)) – If True, the function will return the integrated low-dimensional cell embeddings of the query dataset and reference dataset.

  • use_highly_variable (bool, optional (default: True)) – Whether to use highly variable genes, used in GDR.

  • n_highly_variable_genes (int, optional (default: 5000)) – The number of highly variable genes to select, if use_highly_variable is True, used in GDR.

  • n_svd_dims (int, optional (default: 50)) – The number of dimensions to retain during SVD, used in GDR.

  • resolution (float, optional (default: 1.0)) – Resolution parameter for clustering, used in GDR.

  • scoring_method (str, optional (default: None)) – The method used for gene set scoring, used in GDR.

  • key_added (str, optional (default: None)) – A key to add the predicted cell types or integration results to adata.obs. If None, CellTypes_gdr will be used.

  • verbosity (int, optional (default: 0)) – The level of logging output. Higher values produce more detailed logs for debugging and monitoring progress.

Returns:

If return_integration is True, returns an AnnData object of merged reference and query datasets with integrated cell embeddings and predicted cell types. Otherwise, updates adata in place with the predicted cell types.

Return type:

None or AnnData

Example

>>> import scanpy as sc
>>> # Load query dataset
>>> adata = sc.read_h5ad("query_data.h5ad")
>>>
>>> # Load reference dataset with known cell type annotations
>>> adata_ref = sc.read_h5ad("reference_data.h5ad")
>>>
>>> # Predict cell types for the query dataset
>>> piaso.tl.predictCellTypeByGDR(
>>>     adata=adata,
>>>     adata_ref=adata_ref,
>>>     layer='log1p',
>>>     layer_reference='log1p',
>>>     reference_groupby='CellTypes',
>>>     query_groupby='Leiden',
>>>     mu=10.0,
>>>     n_genes=20,
>>>     return_integration=False,
>>>     use_highly_variable=True,
>>>     n_highly_variable_genes=3000,
>>>     n_svd_dims=50,
>>>     resolution=0.8,
>>>     key_added='CellTypes_gdr',
>>>     verbosity=0
>>> )
>>>
>>> # Access the predicted cell types in the query dataset
>>> print(adata.obs['CellTypes_gdr'])
piaso.tools.runCOSGParallel(adata: AnnData, batch_key: str, groupby: str = None, layer: str = None, infog_layer: str = None, n_svd_dims: int = 50, n_highly_variable_genes: int = 5000, verbosity: int = 0, resolution: float = 1.0, mu: float = 1.0, n_gene: int = 30, use_highly_variable: bool = True, return_gene_names: bool = False, max_workers: int = 8, random_seed: int = 1927)#

Run COSG on batches in parallel using shared memory and multiprocessing.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str) – The key in adata.obs used to identify batches.

  • groupby (str, optional (default: None)) – The key in adata.obs used to group observations for clustering. If None, clustering will be performed.

  • n_svd_dims (int, optional (default: 50)) – Number of SVD components to compute.

  • n_highly_variable_genes (int, optional (default: 5000)) – Number of highly variable genes to use for SVD.

  • verbosity (int, optional (default: 0)) – Level of verbosity for logging information.

  • resolution (float, optional (default: 1.0)) – Resolution parameter for clustering.

  • layer (str, optional (default: None)) – Layer of the adata object to use for COSG.

  • infog_layer (str, optional (default: None)) – If specified, the INFOG normalization will be calculated using this layer of adata.layers, which is expected to contain the UMI count matrix. Defaults to None.

  • mu (float, optional (default: 1.0)) – COSG parameter to control regularization.

  • n_gene (int, optional (default: 30)) – Number of marker genes to compute for each cluster.

  • use_highly_variable (bool, optional (default: True)) – Whether to use highly variable genes for SVD.

  • return_gene_names (bool, optional (default: False)) – Whether to return gene names instead of indices in the marker gene DataFrame.

  • max_workers (int, optional (default: 8)) – Maximum number of parallel workers to use. If None, defaults to the number of available CPU cores.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

Combined marker gene DataFrame with batch-specific suffixes.

Return type:

DataFrame

Examples

>>> import scanpy as sc
>>> import piaso
>>> adata = sc.read_h5ad('example_data.h5ad')
>>> marker_genes = piaso.tl.runCOSGParallel(
...     adata=adata,
...     batch_key='batch',
...     groupby=None,
...     n_svd_dims=50,
...     n_highly_variable_genes=5000,
...     verbosity=1,
...     resolution=1.0,
...     layer='log1p',
...     mu=1.0,
...     n_gene=30,
...     use_highly_variable=True,
...     return_gene_names=True,
...     max_workers=4
... )
>>> print(marker_genes.head())
piaso.tools.runGDR(adata, batch_key: str = None, groupby: str = None, n_gene: int = 30, mu: float = 1.0, layer: str = None, score_layer: str = None, infog_layer: str = None, use_highly_variable: bool = True, n_highly_variable_genes: int = 5000, n_svd_dims: int = 50, resolution: float = 1.0, scoring_method: str = None, key_added: str = None, verbosity: int = 0, random_seed: int = 1927)#

Run GDR (marker Gene-guided dimensionality reduction) on single-cell data.

GDR performs dimensionality reduction guided by marker genes to better preserve biological signals.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str, optional) – Key in adata.obs representing batch information. Defaults to None. If provided, marker gene identifications will be performed for each batch separately.

  • groupby (str, optional) – Key in adata.obs to specify which cell group information to use. Defaults to None. If none, de novo clustering will be performed.

  • n_gene (int, optional) – Number of genes, parameter used in COSG. Defaults to 30.

  • mu (float, optional) – Gene expression specificity parameter, used in COSG. Defaults to 1.0.

  • layer (str, optional) – Layer in adata.layers to use for the analysis. Defaults to None, which uses adata.X.

  • score_layer (str, optional) – If specified, the gene scoring will be calculated using this layer of adata.layers. Defaults to None.

  • infog_layer (str, optional) – If specified, INFOG normalization will be applied using this layer, which should contain the raw UMI count matrix. Defaults to None.

  • use_highly_variable (bool, optional) – Whether to use only highly variable genes when rerunning the dimensionality reduction. Defaults to True. Only effective when groupby=None.

  • n_highly_variable_genes (int, optional) – Number of highly variable genes to use when use_highly_variable is True. Defaults to 5000. Only effective when groupby=None.

  • n_svd_dims (int, optional) – Number of dimensions to use for SVD. Defaults to 50. Only effective when groupby=None.

  • resolution (float, optional) – Resolution parameter for de novo clustering. Defaults to 1.0. Only effective when groupby=None.

  • scoring_method (str, optional) – Specifies the gene set scoring method used to compute gene scores.

  • key_added (str, optional) – Key under which the GDR dimensionality reduction results will be stored in adata.obsm. If None, results will be saved to adata.obsm[X_gdr].

  • verbosity (int, optional) – Verbosity level of the function. Higher values provide more detailed logs. Defaults to 0.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

The function modifies adata in place by adding GDR dimensionality reduction result to adata.obsm[key_added].

Return type:

None

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> adata = sc.read_h5ad("example.h5ad")
>>> piaso.tl.runGDR(
...     adata,
...     batch_key="batch",
...     groupby="CellTypes",
...     n_gene=30,
...     verbosity=0
... )
>>> print(adata.obsm["X_gdr"])
piaso.tools.runGDRParallel(adata, batch_key: str = None, groupby: str = None, n_gene: int = 30, mu: float = 1.0, layer: str = None, score_layer: str = None, infog_layer: str = None, use_highly_variable: bool = True, n_highly_variable_genes: int = 5000, n_svd_dims: int = 50, resolution: float = 1.0, scoring_method: str = None, key_added: str = None, max_workers: int = 8, calculate_score_multiBatch: bool = False, verbosity: int = 0, random_seed: int = 1927)#

Run GDR (marker Gene-guided dimensionality reduction) in parallel using multi-cores and shared memeory.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • batch_key (str, optional) – Key in adata.obs representing batch information. Defaults to None. If specified, different batches will be processed separately and in parallel, otherwise, the input data will be processed as one batch.

  • groupby (str, optional) – Key in adata.obs to specify which cell group information to use. Defaults to None. If none, de novo clustering will be performed.

  • n_gene (int, optional) – Number of genes, parameter used in COSG. Defaults to 30.

  • mu (float, optional) – Gene expression specificity parameter, used in COSG. Defaults to 1.0.

  • layer (str, optional) – Layer in adata.layers to use for the analysis. Defaults to None, which uses adata.X.

  • score_layer (str, optional) – If specified, the gene scoring will be calculated using this layer of adata.layers. Defaults to None.

  • infog_layer (str, optional) – If specified, the INFOG normalization will be calculated using this layer of adata.layers, which is expected to contain the UMI count matrix. Defaults to None.

  • use_highly_variable (bool, optional) – Whether to use only highly variable genes when rerunning the dimensionality reduction. Defaults to True. Only effective when groupby=None.

  • n_highly_variable_genes (int, optional) – Number of highly variable genes to use when use_highly_variable is True. Defaults to 5000. Only effective when groupby=None.

  • n_svd_dims (int, optional) – Number of dimensions to use for SVD. Defaults to 50. Only effective when groupby=None.

  • resolution (float, optional) – Resolution parameter for de novo clustering. Defaults to 1.0. Only effective when groupby=None.

  • scoring_method (str, optional) – Specifies the gene set scoring method used to compute gene scores.

  • key_added (str, optional) – Key under which the GDR dimensionality reduction results will be stored in adata.obsm. If None, results will be saved to adata.obsm[X_gdr].

  • max_workers (int, optional) – Maximum number of workers to use for parallel computation. Defaults to 8.

  • calculate_score_multiBatch (bool, optional) – Whether to calculate gene scores across multiple adata batches (if batch_key is specified). Defaults to False.

  • verbosity (int, optional) – Verbosity level of the function. Higher values provide more detailed logs. Defaults to 0.

  • random_seed (int, optional) – Random seed for reproducibility. Default is 1927.

Returns:

The function modifies adata in place by adding GDR dimensionality reduction result to adata.obsm[key_added].

Return type:

None

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> adata = sc.read_h5ad("example.h5ad")
>>> piaso.tl.runGDRParallel(
...     adata,
...     batch_key="batch",
...     groupby="CellTypes",
...     n_gene=30,
...     max_workers=8,
...     verbosity=0
... )
>>> print(adata.obsm["X_gdr"])
piaso.tools.runSVD(adata: AnnData, use_highly_variable: bool = True, n_components: int = 50, random_state: int | None = 10, scale_data: bool = False, key_added: str = 'X_svd', layer: str | None = None, verbosity: int = 0)#

Performs Truncated Singular Value Decomposition (SVD) on the specified gene expression matrix (adata.X or a specified layer) within an AnnData object and stores the resulting low-dimensional representation in adata.obsm.

Parameters:
  • adata (AnnData) – An AnnData object.

  • use_highly_variable (bool, optional, default=True) – If True, the decomposition is performed only on highly variable genes/features.

  • n_components (int, optional, default=50) – The number of principal components to retain.

  • random_state (int, optional, default=10) – A random seed to ensure reproducibility.

  • scale_data (bool, optional, default=False) – If True, standardizes the input data before performing SVD.

  • key_added (str, optional, default='X_svd') – The key under which the resulting cell embeddings are stored in adata.obsm.

  • layer (str, optional, default=None) – Specifies which layer of adata to use for the transformation. If None, adata.X is used.

  • verbosity (int, optional, default=0) – Controls the verbosity of logging messages.

Returns:

The function modifies adata in place, storing the cell embeddings in adata.obsm[key_added].

Return type:

None

Example

>>> import piaso
>>> piaso.tl.runSVD(adata, use_highly_variable=True, n_components=50, random_state=42,
...        scale_data=False, key_added='X_svd', layer=None)
>>>
>>> # Access the transformed data
>>> adata.obsm['X_svd']
piaso.tools.runSVDLazy(adata, copy: bool = False, n_components: int = 50, use_highly_variable: bool = True, n_top_genes: int = 3000, verbosity: int = 0, batch_key: str | None = None, random_state: int | None = 1927, scale_data: bool = False, infog_trim: bool = True, key_added: str = 'X_svd', layer: str | None = None, infog_layer: str | None = None)#

Performs Truncated Singular Value Decomposition (SVD) in a “lazy” mode, based on the piaso.tl.runSVD function.

Compared to piaso.tl.runSVD, this function includes the step of highly variable gene section. If layer is set to infog, both the highly variable genes and normalized gene expression values were taken from the INFOG normalization outputs.

This function performs on the specified gene expression matrix (adata.X or a specified layer) within an AnnData object and stores the resulting low-dimensional representation in adata.obsm.

Parameters:
  • adata (AnnData) – An AnnData object.

  • copy (bool, optional, default=False) – If True, returns a copy of adata with the computed embeddings instead of modifying in place.

  • n_components (int, optional, default=50) – The number of singular value decomposition (SVD) components to retain.

  • use_highly_variable (bool, optional, default=True) – If True, uses only highly variable genes for the decomposition.

  • n_top_genes (int, optional, default=3000) – The number of top highly variable genes to retain before performing SVD.

  • verbosity (int, optional, default=0) – Controls the verbosity of logging messages.

  • batch_key (str, optional, default=None) – Specifies the key in adata.obs containing batch labels for batch_key used in the sc.pp.highly_variable_genes function.

  • random_state (int, optional, default=1927) – A random seed to ensure reproducibility.

  • scale_data (bool, optional, default=False) – If True, standardizes the input data before performing SVD.

  • infog_trim (bool, optional, default=True) – Used for the trim parameter in piaso.tl.infog function, effective only when layer set to infog.

  • key_added (str, optional, default='X_svd') – The key under which the resulting cell embeddings are stored in adata.obsm.

  • layer (str, optional, default=None) – Specifies which layer of adata to use for the transformation. If None, adata.X is used.

  • infog_layer (str, optional, default=None) – Used for the layer parameter in piaso.tl.infog function, effective only when layer set to infog.

Return type:

If copy is True, returns a modified AnnData object. Otherwise, modifies adata in place.

Example

>>> import piaso
>>> adata = piaso.tl.runSVDLazy(
...     adata, n_components=50, n_top_genes=3000,
...     use_highly_variable=True, key_added="X_svd",
...     layer=None
... )
>>>
>>> # Access the cell embedding
>>> adata.obsm['X_svd']
piaso.tools.smoothCellTypePrediction(adata, groupby: str, use_rep: str = 'X_pca', k_nearest_neighbors: int = 5, return_confidence: bool = False, inplace: bool = True, use_existing_adjacency_graph: bool = True, use_faiss: bool = False, key_added: str = None, verbosity: int = 1, n_jobs: int = -1)#

Smooth cell type predictions using k-nearest neighbors in a low-dimensional embedding.

Parameters:
  • adata (AnnData) – AnnData object containing single-cell data

  • groupby (str) – Key in adata.obs containing the cell type predictions to smooth

  • use_rep (str, default='X_pca') – Key in adata.obsm containing the low-dimensional embedding to use for finding neighbors

  • k_nearest_neighbors (int, default=5) – Number of neighbors to consider (including the cell itself)

  • return_confidence (bool, default=False) – Whether to return confidence scores (proportion of neighbors with the majority label)

  • inplace (bool, default=True) – Whether to modify adata inplace or return a copy

  • use_existing_adjacency_graph (bool, default=True) – Whether to use existing neighborhood graph (adata.obsp[‘connectivities’]) if available

  • use_faiss (bool, default=False) – Whether to use FAISS for faster neighbor search (requires faiss package)

  • key_added (str or None, default=None) – If provided, use this key as the output key in adata.obs instead of ‘{groupby}_smoothed’

  • verbosity (int, default=1) – Level of verbosity (0=no output, 1=basic info, 2=detailed info)

  • n_jobs (int, default=-1) – Number of jobs for parallel processing. -1 means using all processors.

Returns:

  • If inplace=True – None, but adds ‘groupby_smoothed’ (or key_added) to adata.obs If return_confidence=True, also adds ‘groupby_confidence’ (or key_added_confidence) to adata.obs

  • If inplace=False – Copy of adata with added columns

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> # Basic usage
>>> piaso.tl.smoothCellTypePrediction(
...     adata,
...     groupby='CellTypes_pred',
...     use_rep='X_pca',
...     key_added='CellTypes_pred_smoothed'
... )
>>>
>>> # With confidence scores
>>> piaso.tl.smoothCellTypePrediction(
...     adata,
...     groupby='CellTypes_pred',
...     k_nearest_neighbors=15,
...     return_confidence=True,
...     key_added='CellTypes_pred_smoothed'
... )
piaso.tools.predictCellTypeByMarker(adata, marker_gene_set: List | Dict | DataFrame, score_method: Literal['scanpy', 'piaso'] = 'piaso', score_layer: str | None = 'infog', use_score: bool = True, max_workers: int | None = None, smooth_prediction: bool = True, use_rep: str = 'X_gdr', k_nearest_neighbors: int = 7, return_confidence: bool = True, use_existing_adjacency_graph: bool = False, use_faiss: bool = False, key_added: str = 'CellTypes_predicted', extract_cell_type: bool = False, delimiter_cell_type: str = '-', inplace: bool = True, random_seed: int = 1927, verbosity: int = 1, n_jobs: int = -1)#

Predict cell types using marker genes and optionally smooth predictions.

This function performs cell type prediction using marker genes in two steps: 1. Calculate gene set scores for marker genes 2. Optionally smooth the predictions using k-nearest neighbors

Parameters:
  • adata (AnnData) – AnnData object containing single-cell data

  • marker_gene_set (list, dict, or pandas.DataFrame) – Collection of marker genes for different cell types (pre-filtered/prepared)

  • score_method ({'scanpy', 'piaso'}, default='piaso') – Method to use for scoring marker genes

  • score_layer (str or None, default='infog') – Layer of the AnnData object to use for scoring

  • use_score (bool, default=True) – Whether to use scores (True) or p-values (False) for cell type prediction

  • max_workers (int or None, default=None) – Number of parallel workers for score calculation

  • smooth_prediction (bool, default=True) – Whether to smooth predictions using k-nearest neighbors

  • use_rep (str, default='X_gdr') – Key in adata.obsm containing the low-dimensional embedding to use for neighbor search

  • k_nearest_neighbors (int, default=7) – Number of neighbors to consider for smoothing

  • return_confidence (bool, default=True) – Whether to return confidence scores for smoothed predictions

  • use_existing_adjacency_graph (bool, default=False) – Whether to use existing neighborhood graph if available

  • use_faiss (bool, default=False) – Whether to use FAISS for faster neighbor search

  • key_added (str, default='CellTypes_predicted') – Key to use for storing cell type predictions in adata.obs

  • extract_cell_type (bool, default=False) – Whether to extract cell type name by removing suffix after delimiter

  • delimiter_cell_type (str, default='-') – Delimiter to use when extracting cell type names (only used if extract_cell_type=True)

  • inplace (bool, default=True) – Whether to modify adata in place or return a copy

  • random_seed (int, default=1927) – Random seed for reproducibility

  • verbosity (int, default=1) – Level of verbosity (0=quiet, 1=basic info, 2=detailed)

  • n_jobs (int, default=-1) – Number of jobs for parallel processing during smoothing

Returns:

  • If inplace=False – AnnData: Copy of adata with cell type predictions added

  • If inplace=True – None, but adata is modified in place

Examples

>>> import scanpy as sc
>>> import piaso
>>>
>>> # Basic usage
>>> piaso.tl.predictCellTypeByMarker(
...     adata,
...     marker_gene_set=cosgMarkerDB,
...     score_method='piaso',
...     use_score=False,
...     smooth_prediction=True,
...     inplace=True
... )