Pre-processing CellRanger outputs (single sample, human PBMCs)#

This notebook demonstrates the pipeline for processing scRNA-seq data, covering metric computation, normalization, clustering, cell type prediction, and the identification/removal of doublets and low-quality clusters. The final output is pre-processed, annotated scRNA-seq data.

[1]:

import piaso
import cosg
import numpy as np
import pandas as pd
import scanpy as sc
import logging
from sklearn.preprocessing import StandardScaler
import warnings

from matplotlib import rcParams
sc.settings.set_figure_params(scanpy=True, dpi=80, dpi_save=300, frameon=True, vector_friendly=False, fontsize=14)
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42

/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[2]:

data_dir = "/n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023"
save_dir = "/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023"
prefix = "HumanPBMCs_Multiome_RNA"

Download the data#

Human PBMC snMultiome data (SAN2) from: De Rop, F.V., Hulselmans, G., Flerin, C. et al. Systematic benchmarking of single-cell ATAC-sequencing protocols. Nat Biotechnol 42, 916–926 (2024). https://doi.org/10.1038/s41587-023-01881-x

Also available via https://drive.google.com/file/d/1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T/view?usp=drive_link

[3]:

# !/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023 1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T

Load the data#

[4]:

data_path = data_dir + "/filtered_feature_bc_matrix.h5"
adata=sc.read_10x_h5(data_path)
adata

/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

[4]:

AnnData object with n_obs × n_vars = 4360 × 36601
    var: 'gene_ids', 'feature_types', 'genome', 'interval'

[5]:

adata.var_names_make_unique()

This dataset contains a single sample, but we will add a Sample column to adata.obs and assign a sample name to facilitate downstream analysis.

[6]:

adata.obs['Sample']="HumamPBMCs"

[7]:

adata.X.data

[7]:

array([ 2.,  1.,  1., ..., 43., 18., 56.], shape=(6669764,), dtype=float32)

Create the ‘raw’ counts layer based on the counts data from the adata object.

[8]:

adata.layers['raw'] = adata.X
adata

[8]:

AnnData object with n_obs × n_vars = 4360 × 36601
    obs: 'Sample'
    var: 'gene_ids', 'feature_types', 'genome', 'interval'
    layers: 'raw'

[9]:

adata.var

[9]:

	gene_ids	feature_types	genome	interval
MIR1302-2HG	ENSG00000243485	Gene Expression	GRCh38	chr1:29553-30267
FAM138A	ENSG00000237613	Gene Expression	GRCh38	chr1:36080-36081
OR4F5	ENSG00000186092	Gene Expression	GRCh38	chr1:65418-69055
AL627309.1	ENSG00000238009	Gene Expression	GRCh38	chr1:120931-133723
AL627309.3	ENSG00000239945	Gene Expression	GRCh38	chr1:91104-91105
...	...	...	...	...
AC141272.1	ENSG00000277836	Gene Expression	GRCh38	KI270728.1:1270983-1270984
AC023491.2	ENSG00000278633	Gene Expression	GRCh38	KI270731.1:13000-13001
AC007325.1	ENSG00000276017	Gene Expression	GRCh38	KI270734.1:72410-72411
AC007325.4	ENSG00000278817	Gene Expression	GRCh38	KI270734.1:131493-131494
AC007325.2	ENSG00000277196	Gene Expression	GRCh38	KI270734.1:161749-161852

36601 rows × 4 columns

Quality Control#

Filter cells#

Filter out cells with fewer than 200 detected genes.

[10]:

sc.pp.filter_cells(adata, min_genes=200)

Calculate cell quality metrics#

We identify mitochondrial and ribosomal protein genes and compute their proportion in each cell’s total read count. A high proportion of these reads often indicates low-quality cells. Note: In human data, the gene names have uppercase letters (MT-, RPS-, RPL-). For mouse data, these gene names will start with Mt-, Rps-, Rpl-

[11]:

adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

[12]:

adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')
sc.pp.calculate_qc_metrics(adata, qc_vars=['ribo'], percent_top=None, log1p=False, inplace=True)

/tmp/ipykernel_2039358/3834852883.py:1: FutureWarning: Allowing a non-bool 'na' in obj.str.startswith is deprecated and will raise in a future version.
  adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')

[13]:

piaso.pl.plot_features_violin(adata,
                              ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                              width_single=2.0)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_22_0.png

Doublet prediction#

Next, we compute the scrublet score to identify and predict potential doublets.

[14]:

experiments=np.unique(adata.obs['Sample'])
adata.obs['scrublet_score']=np.repeat(0,adata.n_obs)
adata.obs['predicted_doublets']=np.repeat(False,adata.n_obs)

[15]:

import scrublet as scr

for experiment in experiments:
    print(experiment)
    adatai=adata[adata.obs['Sample']==experiment]

    scrub = scr.Scrublet(adatai.X.todense(),random_state=10)
    doublet_scores, predicted_doublets = scrub.scrub_doublets()

    adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets

    adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores

HumamPBMCs
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.14
Detected doublet rate = 17.0%
Estimated detectable doublet fraction = 79.6%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 21.4%
Elapsed time: 6.7 seconds

/tmp/ipykernel_2039358/3464195482.py:10: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2039358/3464195482.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2039358/3464195482.py:12: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2039358/3464195482.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2039358/3464195482.py:12: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[0.07518797 0.29180328 0.03955842 ... 0.13812155 0.07239264 0.06466513]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores

[16]:

piaso.pl.plot_features_violin(adata,
                              ['scrublet_score'],
                              groupby='Sample',
                              width_single=2)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_26_0.png

Remove doublets based on a scrublet score threshold.

[17]:

adata = adata[adata.obs['scrublet_score']<0.3].copy()
adata

[17]:

AnnData object with n_obs × n_vars = 4026 × 36601
    obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets'
    var: 'gene_ids', 'feature_types', 'genome', 'interval', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo'
    uns: 'Sample_colors'
    layers: 'raw'

INFOG Normalization#

We use PIASO’s INFOG to normalize the data and identify a highly variable set of genes.

[18]:

%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')

The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 361 ms, sys: 67.9 ms, total: 429 ms
Wall time: 429 ms

Dimensionality reduction#

[19]:

%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')

CPU times: user 2.71 s, sys: 342 μs, total: 2.71 s
Wall time: 438 ms

[20]:

%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)

CPU times: user 25.8 s, sys: 285 ms, total: 26.1 s
Wall time: 23.7 s

[21]:

sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_34_0.png

Save the data#

[22]:

save_dir+'/'+prefix+'_raw_QC.h5ad'

[22]:

'/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023/HumanPBMCs_Multiome_RNA_raw_QC.h5ad'

[23]:

adata.write(save_dir+'/'+prefix+'_raw_QC.h5ad')

[24]:

# adata = sc.read(save_dir+'/'+prefix+'_raw_QC.h5ad')

Filtering cells based on gene counts#

Outlier cells of these thresholds are likely low quality cells, so we filter them out before further downstream analysis.

[25]:

adata=adata[adata.obs['n_genes_by_counts']>500].copy()

[26]:

adata=adata[adata.obs['n_genes_by_counts']<10000].copy()

[27]:

sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_42_0.png

Annotation#

We used the precomputed Human PBMC marker gene sets from the AllenHumanImmuneHealthAtlas_L2 study in PIASOmarkerDB. The original dataset comes from https://www.nature.com/articles/s41586-025-09686-5.

[28]:

marker_gene_df = pd.read_csv("/n/scratch/users/v/vas744/Data/Public/10x_Human_PBMC/PIASOmarkerDB_AllenHumanImmuneHealthAtlas_L2_251219.csv")
marker_gene_df

[28]:

	Cell_Type	Gene	Specificity_Score	Study_Publication	Species	Tissue	Condition
0	Platelet	GP9	16.455748	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1	Erythrocyte	ALAS2	16.422475	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
2	Platelet	CMTM5	16.400534	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
3	Platelet	TMEM40	16.373334	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
4	Erythrocyte	AHSP	16.338337	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
...	...	...	...	...	...	...	...
1445	Memory CD4 T cell	FXYD7	1.783520	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1446	Memory CD4 T cell	CRIP1	1.779265	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1447	Memory CD4 T cell	CCDC167	1.775660	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1448	Memory CD4 T cell	CD6	1.775110	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1449	Memory CD4 T cell	TMEM238	1.769377	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal

1450 rows × 7 columns

Preprocessing cell type-specific marker gene sets#

PIASO’s predictCellTypeByMarker expects the marker_gene_list in DataFrame or dict format. So here, we will create a dictionary with each cell type and their respective marker gene lists.

[29]:

marker_gene_dict = {}
cell_types = np.unique(marker_gene_df['Cell_Type'])
for cell_type in cell_types:
    marker_gene_dict[cell_type] = list(marker_gene_df[marker_gene_df['Cell_Type']==cell_type]['Gene'])

1. INFOG Normalization#

[30]:

%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')

The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 355 ms, sys: 43.8 ms, total: 399 ms
Wall time: 400 ms

Dimensionality reduction#

[31]:

%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')

CPU times: user 2.6 s, sys: 1.61 ms, total: 2.6 s
Wall time: 452 ms

[32]:

%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)

CPU times: user 18 s, sys: 25.9 ms, total: 18.1 s
Wall time: 14.4 s

Predict cell types based on marker gene sets#

Here, we use the use_score = False to make the predictions based on p-values, but you can also use the gene set scores with use_score = True.

[33]:

%%time
piaso.tl.predictCellTypeByMarker(adata,
                                 marker_gene_set=marker_gene_dict,
                                 score_method='piaso',
                                 use_rep='X_svd',
                                 use_score=False,
                                 smooth_prediction=True,
                                 inplace=True)

Calculating gene set scores using piaso method...

Scoring gene sets: 100%|██████████| 29/29 [00:18<00:00,  1.59set/s]

Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 934 cell labels (23.42% of total)
Cell type prediction completed. Results saved to:
  - adata.obs['CellTypes_predicted']: predicted cell types
  - adata.obsm['CellTypes_predicted_score']: full score matrix
  - adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
  - adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
  - adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
  - adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 4.05 s, sys: 141 ms, total: 4.19 s
Wall time: 23.1 s

[34]:

sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_56_0.png

Clustering#

[35]:

%%time
sc.tl.leiden(adata,resolution=2.5,key_added='Leiden')

<timed eval>:1: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

 To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.

CPU times: user 1.71 s, sys: 15 ms, total: 1.73 s
Wall time: 1.73 s

[36]:

sc.pl.umap(adata,
           color=['CellTypes_predicted','Leiden'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_59_0.png

Identify marker genes with COSG#

[37]:

%%time
n_gene=30
cosg.cosg(adata,
          key_added='cosg',
          use_raw=False,
          layer='infog',
          mu=100,
          expressed_pct=0.1,
          remove_lowly_expressed=True,
          n_genes_user=n_gene,
          groupby='Leiden')

CPU times: user 571 ms, sys: 69.8 ms, total: 640 ms
Wall time: 641 ms

[38]:

sc.tl.dendrogram(adata,groupby='Leiden',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'Leiden']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}

sc.pl.dotplot(adata,
              marker_genes_list,
              groupby='Leiden',
              layer='infog',
              dendrogram=True,
              swap_axes=False,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_62_0.png

[39]:

marker_gene=pd.DataFrame(adata.uns['cosg']['names'])

Removal of low-quality cells#

Here, we identify and remove some low quality clusters based on cell quality metrics and marker gene sets.

[40]:

piaso.pl.plot_features_violin(adata,
                              ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
                              groupby='Leiden')

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_65_0.png

Remove low quality cells#

Some of the cells seems to have higher percentage of mitochondrial and ribosomal gene expression, indicating that they are low quality.

[41]:

adata = adata[adata.obs['pct_counts_mt']<20].copy()

[42]:

adata = adata[adata.obs['pct_counts_ribo']<5].copy()

[43]:

sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_69_0.png

[44]:

sc.pl.umap(adata,
           color=['Leiden'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_70_0.png

Checking potential low quality clusters#

[45]:

cluster_check = '12'
marker_gene[cluster_check].values

[45]:

array(['FP236383.3', 'SIGLEC9', 'MTRNR2L8', 'SIRPA', 'GABARAPL1',
       'MT-ND1', 'FAM157C', 'SMIM13', 'ADAMTSL4-AS1', 'OLR1', 'EPB41L3',
       'ATG16L2', 'CSF3R', 'ADAM15', 'NAMPT', 'CCL3L1', 'NACC2', 'S100A9',
       'S100A12', 'YBX3', 'TYMP', 'STAB1', 'PLAUR', 'CD14', 'PTPRE',
       'KDM6B', 'ACSS2', 'NR4A3', 'FRMD4B', 'NEAT1'], dtype=object)

[46]:

sc.pl.umap(adata,
           color=['FP236383.3', 'SIGLEC9', 'MTRNR2L8', 'SIRPA', 'GABARAPL1',
                  'MT-ND1', 'FAM157C', 'SMIM13', 'ADAMTSL4-AS1', 'OLR1', 'EPB41L3'],
           cmap=piaso.pl.color.c_color1,
           palette=piaso.pl.color.d_color1,
           ncols=3,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_73_0.png

[47]:

sc.pl.dotplot(adata,
              marker_gene[cluster_check].values[:30],
              groupby='Leiden',
              dendrogram=False,
              swap_axes=True,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_74_0.png

[48]:

sc.pl.umap(adata,
           color=['Leiden'],
           groups=['12','13','23','25'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_75_0.png

[49]:

adata=adata[~adata.obs['Leiden'].isin(['12','13','23','25'])].copy()

2. INFOG Normalization#

It is important to re-run INFOG normalization and dimensionality reduction after filtering out cells, to accurately capture the current cell population.

[50]:

%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')

The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 321 ms, sys: 54.8 ms, total: 376 ms
Wall time: 375 ms

Dimensionality reduction#

[51]:

%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')

CPU times: user 2.11 s, sys: 1.75 ms, total: 2.11 s
Wall time: 366 ms

[52]:

%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)

CPU times: user 16.4 s, sys: 4.84 ms, total: 16.4 s
Wall time: 13.2 s

Predict cell types based on marker gene sets#

[53]:

%%time
piaso.tl.predictCellTypeByMarker(adata,
                                 marker_gene_set=marker_gene_dict,
                                 score_method='piaso',
                                 use_rep='X_svd',
                                 use_score=False,
                                 key_added = 'CellTypes_predicted',
                                 smooth_prediction=True,
                                 inplace=True)

Calculating gene set scores using piaso method...

Scoring gene sets: 100%|██████████| 29/29 [00:21<00:00,  1.33set/s]

Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 766 cell labels (22.07% of total)
Cell type prediction completed. Results saved to:
  - adata.obs['CellTypes_predicted']: predicted cell types
  - adata.obsm['CellTypes_predicted_score']: full score matrix
  - adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
  - adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
  - adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
  - adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 3.62 s, sys: 138 ms, total: 3.75 s
Wall time: 26.1 s

[54]:

sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_84_0.png

Identify marker genes with COSG#

[55]:

%%time
n_gene=30
cosg.cosg(adata,
          key_added='cosg',
          use_raw=False,
          layer='infog',
          mu=100,
          expressed_pct=0.1,
          remove_lowly_expressed=True,
          n_genes_user=n_gene,
          groupby='CellTypes_predicted')

CPU times: user 471 ms, sys: 5.97 ms, total: 477 ms
Wall time: 477 ms

[56]:

marker_gene=pd.DataFrame(adata.uns['cosg']['names'])
marker_gene

[56]:

	ASDC	CD8aa	CD14 monocyte	CD16 monocyte	CD56bright NK cell	CD56dim NK cell	DN T cell	Effector B cell	Erythrocyte	Intermediate monocyte	...	Naive B cell	Naive CD4 T cell	Naive CD8 T cell	Plasma cell	Proliferating NK cell	Transitional B cell	Treg	cDC1	cDC2	gdT
0	AC097375.1	PDGFB	DYSF	AC104809.2	SPTSSB	BNC2	LINC01726	SOX5	IGLV5-52	SASH1	...	AL139020.1	ANKRD55	LINC02446	IGHA1	AC130456.5	LINC01266	AC022929.2	AC092828.1	FCER1A	TRDC
1	LINC01478	STX1A	CRISPLD2	CDKN1C	LINC01023	MYOM2	TRBJ1-2	FCRL5	TRBV5-5	PID1	...	COL19A1	AL109930.1	NELL2	IGHJ5	AC104073.4	JAM2	LINC02694	AL355312.4	ENHO	TRGC1
2	EPHB1	BICRA-AS1	LIN7A	LINC02085	CABLES1	GRIK4	LINC02252	CNFN	SLC24A5	SRC	...	SLC38A11	FHIT	CD8B	MIXL1	LINC02537	MME	FOXP3	TKTL2	MRC1	TRGV9
3	LRRC26	SLC35C1	VCAN	LYPD2	TNFRSF11A	SPON2	OOSP1	WSCD2	LONRF2	CPVL	...	IGHD	TSHZ2	NRCAM	JCHAIN	ACRV1	PYCR3	FANK1	HOXA10	CD1C	KLRC1
4	AC018643.1	MLST8	PADI4	GPR20	XCL1	LINGO2	LINC01169	TLCD4	CR382287.2	AC009093.2	...	LIX1-AS1	KANK1	GLIS3	SDC1	AC022517.1	LINC00494	AC013652.1	TACSTD2	CLEC10A	PTMS
5	CUX2	ZSCAN26	NLRP12	AC020651.2	AC002429.2	MIR181A2HG	AC016727.3	UQCRHL	LINC00664	EPB41L3	...	AKAP6	LEF1	AL590006.1	IGHV3-15	AC007179.1	NEIL1	HACD1	AC109347.1	PKIB	IL12RB2
6	AC023590.1	MAMLD1	WLS	PPP1R17	PPP1R9A	CD160	AC023154.1	SIGLEC6	RND2	IGF2BP2	...	PCDH9	PLCL1	CA6	VPREB1	LINC02289	AC103831.1	DUSP4	ZBTB46-AS1	LRP1B	CCL5
7	KCNK1	AGMAT	CXCL8	CKB	ZMAT4	FGFBP2	AC026765.3	PTCHD1-AS	AC023669.2	RTN1	...	LINC00926	AK5	LRRN3	IGKV3-15	AC005703.7	AL365272.1	RTKN2	ARHGEF3-AS1	NCKAP5	AC010175.1
8	AC104819.3	MRPS2	FCAR	HES4	SERPINE1	LDB2	SLC14A2	CLECL1	AC011477.9	PSAP	...	LIX1	EDAR	PDE3B	IGHV3-30	SNHG18	TIMP3	CDHR3	AC069410.1	CD1E	S100B
9	AP000146.1	AL137779.1	LUCAT1	LYNX1	AL590807.1	LINC01505	AC092683.2	WNT16	AL133343.2	IFI30	...	ABCB4	FAM153A	THEMIS	MZB1	Z92544.2	STIL	WEE1	AL356056.1	ST18	LINC00612
10	AC097375.2	PPP2R1B	AL163541.1	CROCC2	RAMP1	IGFBP7	C17orf98	SLCO4A1	AC067838.1	PTGIR	...	FCER2	LEF1-AS1	CACHD1	ITGA8	NADK2-AS1	TCL6	EVC2	AC141586.3	HTR7	KLRG1
11	CYYR1	AC004918.1	PRICKLE1	VMO1	EFHC2	LINC00298	ZNF843	GALNTL6	AC114971.1	CD300E	...	KHDRBS2	ITGA6	SCML1	GLDC	LINC02721	AC139795.2	CCR10	CLNK	DTNA	A2M
12	LAMP5	CRTAM	GLT1D1	FMNL2	KLRC2	MMP23B	AC004223.2	AC116534.1	TESC-AS1	AC020656.1	...	STEAP1B	MAN1C1	APBA2	TXNDC5	AC022296.4	SMIM15	STAM	CLEC9A	FLT3	PZP
13	NRP1	AC025917.1	CSF3R	PAPSS2	NCAM1	PRF1	OSGEPL1-AS1	AL592429.2	AC003988.1	MARCO	...	FCRL1	CHRM3-AS2	SNTG2	TNFRSF17	PPP1R14B-AS1	PPAN	FGFR1	BCHE	SLAMF8	CACNA2D2
14	LINC01724	CDC14B	VCAN-AS1	NEURL1	KLRC3	PRSS23	AC008914.1	TNFRSF13B	AKR1C4	ZC3H12C	...	SCN3A	PRKCA	CRTAM	AL357873.1	AL138759.1	ZNF486	GMNN	HLA-DQB1-AS1	CST3	TNFSF14
15	LINC00865	STMN3	NRG1	SMIM25	C1orf141	GZMB	AKR1C1	LINC02669	AC134407.1	LYZ	...	IL4R	TRABD2A	CD8A	IGLV2-14	AC002563.1	AC078980.1	ZC2HC1A	CLSTN2	DEPTOR	A2M-AS1
16	PHEX	KIF21A	ACSL1	C1QA	C6orf226	CLIC3	AC243965.1	GPM6A	C20orf144	LILRB4	...	LARGE1	EPHX2	LRRC7	PLAAT2	SRARP	MAP2	SNX21	LRRC49	HLA-DRB1	COL6A2
17	LINC01226	AL645728.1	AQP9	ICAM4	TSPAN2	RHOBTB3	RAB3C	AL031728.1	SPACA3	FADS1	...	AFF3	MAL	EPHA1-AS1	CNKSR1	CR381670.2	AC061958.1	ICA1	SMO	ZNF366	HOPX
18	TCF4-AS1	IVD	SPOCK1	TCF7L2	TNFRSF18	AKR1C3	LINC00536	AC022509.2	AC112512.1	FCGRT	...	PAX5	RPL34-AS1	PRKCA-AS1	IGLC2	GRIK1-AS1	AC098829.1	MCF2L2	MIR924HG	HLA-DPA1	SH3BGRL3
19	PTPRS	GCLM	CYP1B1	LINC02345	LINC01146	KLRF1	EPPK1	AC090541.1	CYB5D1	FGL2	...	AC108879.1	MLLT3	PRKAR1B	IGHG2	FAM215A	C14orf119	TTN	AC022217.2	IL1R2	LINC00987
20	PTCRA	MGAT2	LINC00937	SFTPD	KLHL23	PDGFD	AL591684.1	SYT1	GRIA3	EMILIN2	...	SYN3	IL6ST	AC009041.2	GPRC5D	S100A2	ZNF350-AS1	OSBP2	AC104964.1	CLIC2	GNLY
21	LILRA4	SMYD2	PLBD1	CEACAM3	XCL2	TTC38	BNIPL	AC109446.3	SPDYE2B	MRAS	...	AC120193.1	AL163932.1	NR3C2	IGKC	DUSP13	CD180	IL2RA	LINC01134	HLA-DPB1	TGFBR3
22	ARPP21	PBDC1	TREM1	KNDC1	ROR1	NCR1	IGSF22	WDR11-AS1	AC005498.2	SECTM1	...	ADAM28	CCR7	LEF1	IGLV2-11	AL138976.2	ZNF669	SPSB1	SHC3	ITGB5	LYSMD2
23	TDRD1	C7orf26	CD36	LST1	IL12RB2	SH2D1B	GINS2	GEN1	MOBP	TSPAN4	...	KCNH8	CAMK4	ABLIM1	KCNMA1	AP005209.2	FAM3C	IKZF2	KCND3	NEGR1	GZMM
24	PLCB4	CORO2A	LPAR1	MS4A7	ATP8B4	SORBS2	AL033528.2	MS4A1	ZSCAN23	LGALS3	...	IGHM	IGF1R	CCR7	AC009570.2	TROAP	ADGRF3	MIAT	LAMB2	ADGRG6	C1orf21
25	SLC35F3	NIPSNAP3A	RBM47	SIGLEC10	GPR68	RNF165	LINC01949	PLPP3	LINC01182	CFP	...	SLC9A7	OXNAD1	PSMA1	EML6	LINC02354	MIR762HG	ZNF365	STEAP3	ZBTB46	ZDHHC12
26	NTM	SAP30L	TNFAIP6	CLEC4F	IL18R1	CCL4	PRMT6	GRAPL	VWCE	TTYH3	...	GLYATL1	SERINC5	OXNAD1	IGHG3	USP6	HAUS3	AC110611.1	TMEM201	FZD2	SH2D1A
27	COL24A1	CDKL3	FAM198B-AS1	FCGR3A	KLRC1	S1PR5	AC093227.1	EFNA4	PLEKHA4	TNFAIP2	...	LINC01800	RNF157	MAML2	IGHA2	PTPN3	OXSM	CTLA4	DZIP1L	CYP2S1	NDUFS3
28	AR	AC112722.1	AC020916.1	CTSL	KXD1	IL2RB	AC015911.6	AC025569.1	AC074366.1	NHSL1	...	AC019197.1	PRKCQ-AS1	NDFIP1	IGF1	AC091153.3	NCAM2	AK6	ENOX1	HLA-DRA	DUSP2
29	COL26A1	SEC23IP	AL034397.3	SPRED1	LINC01684	ANKRD20A4	PMP22	CD70	P3H3	CTSH	...	AP001636.3	TCF7	ACTN1	TSHR	TRAV18	AHCYL1	TOX	AC092794.1	LHFPL6	GZMH

30 rows × 24 columns

We can use a dendrogram dot plot to visualize the expression of the top three marker genes of each predicted cell type.

[57]:

sc.tl.dendrogram(adata,groupby='CellTypes_predicted',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'CellTypes_predicted']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}

sc.pl.dotplot(adata,
              marker_genes_list,
              groupby='CellTypes_predicted',
              layer='infog',
              dendrogram=True,
              swap_axes=False,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_89_0.png

Marker genes for CD16 monocyte#

[58]:

cluster_check = 'CD16 monocyte'
marker_gene[cluster_check].values

[58]:

array(['AC104809.2', 'CDKN1C', 'LINC02085', 'LYPD2', 'GPR20',
       'AC020651.2', 'PPP1R17', 'CKB', 'HES4', 'LYNX1', 'CROCC2', 'VMO1',
       'FMNL2', 'PAPSS2', 'NEURL1', 'SMIM25', 'C1QA', 'ICAM4', 'TCF7L2',
       'LINC02345', 'SFTPD', 'CEACAM3', 'KNDC1', 'LST1', 'MS4A7',
       'SIGLEC10', 'CLEC4F', 'FCGR3A', 'CTSL', 'SPRED1'], dtype=object)

[59]:

sc.pl.umap(adata,
           color=['AC104809.2', 'GPR20', 'CDKN1C', 'AC020651.2', 'C1QA', 'LINC02085',
                  'LYPD2', 'PPP1R17', 'CKB', 'HES4', 'CROCC2', 'FMNL2', 'VMO1'],
           cmap=piaso.pl.color.c_color1,
           palette=piaso.pl.color.d_color1,
           ncols=3,
           size=10,
           frameon=False)

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_92_0.png

[60]:

sc.pl.dotplot(adata,
              marker_gene[cluster_check].values[:30],
              groupby='CellTypes_predicted',
              dendrogram=False,
              swap_axes=True,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_scRNAseqProcessingTutorial_SAN2_93_0.png

Save the data#

[61]:

adata.write(save_dir + '/' + prefix + '_QC.h5ad')

[62]:

marker_gene.to_csv(save_dir + '/' + prefix + 'CellType_markerGenes.csv')

Table of Contents

Pre-processing CellRanger outputs (single sample, human PBMCs)#

Download the data#

Load the data#

Quality Control#

Filter cells#

Calculate cell quality metrics#

Doublet prediction#

INFOG Normalization#

Dimensionality reduction#

Save the data#

Filtering cells based on gene counts#

Annotation#

Preprocessing cell type-specific marker gene sets#

1. INFOG Normalization#

Dimensionality reduction#

Predict cell types based on marker gene sets#

Clustering#

Identify marker genes with COSG#

Removal of low-quality cells#

Remove low quality cells#

Checking potential low quality clusters#

2. INFOG Normalization#

Dimensionality reduction#

Predict cell types based on marker gene sets#

Identify marker genes with COSG#

Marker genes for CD16 monocyte#

Save the data#