Pre-processing CellRanger outputs (single sample)#

This notebook demonstrates the pipeline for processing scRNA-seq data, covering metric computation, normalization, clustering, cell type prediction, and the identification/removal of doublets and low-quality clusters. The final output is pre-processed, annotated scRNA-seq data.

[1]:
import sys
path = '/home/vas744/Analysis/Python/Packages/PIASO'
sys.path.append(path)
path = '/home/vas744/Analysis/Python/Packages/COSG'
sys.path.append(path)
import piaso
import cosg
/home/vas744/.local/lib/python3.9/site-packages/networkx/utils/backends.py:135: RuntimeWarning: networkx backend defined more than once: nx-loopback
  backends.update(_get_backends("networkx.backends"))
[2]:
import numpy as np
import pandas as pd
import scanpy as sc
import logging
from matplotlib import rcParams
from sklearn.preprocessing import StandardScaler
import warnings

# To modify the default figure size, use rcParams.
rcParams['figure.figsize'] = 4, 4
rcParams['font.sans-serif'] = "Arial"
rcParams['font.family'] = "Arial"
sc.settings.verbosity = 3
sc.logging.print_header()
sc.set_figure_params(dpi=80,dpi_save=300, color_map='viridis',facecolor='white')
scanpy==1.10.3 anndata==0.10.8 umap==0.5.7 numpy==1.26.4 scipy==1.13.0 pandas==2.2.3 scikit-learn==1.5.2 statsmodels==0.14.4 igraph==0.11.5 louvain==0.8.2 pynndescent==0.5.13
[3]:
warnings.simplefilter(action='ignore', category=FutureWarning)

Load the data#

The dataset used in this tutorial was obtained from: https://www.10xgenomics.com/datasets/10k-Mouse-Brain-CNIK-3p-gemx

Download the required dataset from Google Drive: https://drive.google.com/file/d/1nsaOC-__jUjXUjGqaUMaGBT2LWyu_L5_/view?usp=drive_link

[4]:
!/home/vas744/Software/gdrive files download --overwrite --destination /n/scratch/users/v/vas744/Data/Public/PIASO 1nsaOC-__jUjXUjGqaUMaGBT2LWyu_L5_
Downloading 10k_Mouse_Brain_CNIK_3p_gemx_10k_Mouse_Brain_CNIK_3p_gemx_count_sample_filtered_feature_bc_matrix.h5
Successfully downloaded 10k_Mouse_Brain_CNIK_3p_gemx_10k_Mouse_Brain_CNIK_3p_gemx_count_sample_filtered_feature_bc_matrix.h5
[5]:
data_path = "/n/scratch/users/v/vas744/Data/Public/PIASO/10k_Mouse_Brain_CNIK_3p_gemx_10k_Mouse_Brain_CNIK_3p_gemx_count_sample_filtered_feature_bc_matrix.h5"

adata=sc.read_10x_h5(data_path)
reading /n/scratch/users/v/vas744/Data/Public/PIASO/10k_Mouse_Brain_CNIK_3p_gemx_10k_Mouse_Brain_CNIK_3p_gemx_count_sample_filtered_feature_bc_matrix.h5
 (0:00:02)
/home/vas744/.local/lib/python3.9/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/home/vas744/.local/lib/python3.9/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[6]:
adata
[6]:
AnnData object with n_obs × n_vars = 11357 × 33696
    var: 'gene_ids', 'feature_types', 'genome'
[8]:
adata.var_names_make_unique()

This dataset contains a single sample, but we will add a Sample column to adata.obs and assign a sample name to facilitate downstream analysis.

[9]:
adata.obs['Sample'] = 'MouseBrain3primeV4'

Next, we filter out cells with fewer than 200 detected genes.

[10]:
sc.pp.filter_cells(adata, min_genes=200)

We identify mitochondrial and ribosomal protein genes and compute their proportion in each cell’s total read count. A high proportion of these reads often indicates low-quality cells.

[11]:
adata.var['mt'] = adata.var_names.str.startswith('mt-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
[12]:
ribo_cells = adata.var_names.str.startswith('Rps','Rpl')
adata.obs['pct_counts_ribo'] = np.ravel(100*np.sum(adata[:, ribo_cells].X, axis = 1) / np.sum(adata.X, axis = 1))
[13]:
sc.pl.violin(adata,
             ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
             groupby='Sample',
             rotation=75,
             jitter=0.35,
             multi_panel=True,
             size = 0.75)
../_images/notebooks_rnaSeqPipelineTutorial_17_0.png

Doublet prediction#

Next, we compute the Scrublet score to identify and predict potential doublets.

[14]:
experiments=np.unique(adata.obs['Sample'])
adata.obs['scrublet_score']=np.repeat(0,adata.n_obs)
adata.obs['predicted_doublets']=np.repeat(False,adata.n_obs)
[15]:
import scrublet as scr

for experiment in experiments:
    print(experiment)
    adatai=adata[adata.obs['Sample']==experiment]

    scrub = scr.Scrublet(adatai.X.todense(),random_state=10)
    doublet_scores, predicted_doublets = scrub.scrub_doublets()

    adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets

    adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
MouseBrain3primeV4
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.30
Detected doublet rate = 5.8%
Estimated detectable doublet fraction = 62.5%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 9.2%
Elapsed time: 70.9 seconds
/tmp/ipykernel_16745/2312138584.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_16745/2312138584.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
[16]:
piaso.pl.plot_features_violin(adata,
                              ['scrublet_score'],
                              groupby='Sample',
                              width_single=3,
                              height_single=3)
../_images/notebooks_rnaSeqPipelineTutorial_21_0.png
[17]:
tmp=np.repeat(False, adata.n_obs)
tmp[adata.obs['predicted_doublets'].values==True]=True
adata.obs['predicted_doublets']=tmp
[18]:
print(f"# of cells with scrublet score >= 0.2: {np.sum(adata.obs['scrublet_score']>=0.2)} \n# of predicted doublets: {np.sum(adata.obs['predicted_doublets'])}")
# of cells with scrublet score >= 0.2: 874
# of predicted doublets: 657

Normalization#

[19]:
adata.layers['raw']=adata.X.copy()

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

adata.layers['log1p']=adata.X.copy()
normalizing counts per cell
    finished (0:00:00)

INFOG normalization#

We use PIASO’s infog to normalize the data and identify a highly variable set of genes.

[20]:
%%time
piaso.tl.infog(adata,
               layer='raw',
               n_top_genes=3000,)
The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 3.74 s, sys: 3.86 s, total: 7.6 s
Wall time: 7.59 s

SVD Dimensionality reduction and visualization#

[21]:
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=50,
                random_state=10,
                key_added='X_svd',
                layer='infog')
[22]:
%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:36)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:14)
CPU times: user 1min 15s, sys: 267 ms, total: 1min 16s
Wall time: 51.2 s
[23]:
sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=3,
           size=10,
           frameon=False)
../_images/notebooks_rnaSeqPipelineTutorial_31_0.png

Leiden clustering#

[24]:
%%time
sc.tl.leiden(adata,resolution=0.5,key_added='Leiden')
running Leiden clustering
    finished: found 32 clusters and added
    'Leiden', the cluster labels (adata.obs, categorical) (0:00:00)
CPU times: user 877 ms, sys: 23 ms, total: 900 ms
Wall time: 894 ms
[25]:
sc.pl.umap(adata,
           color=['Leiden'],
           palette=piaso.pl.color.d_color1,
           legend_fontsize=12,
           legend_fontoutline=2,
           legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
WARNING: Length of palette colors is smaller than the number of categories (palette length: 19, categories length: 32. Some categories will have the same color.
../_images/notebooks_rnaSeqPipelineTutorial_34_1.png
[26]:
piaso.pl.plot_features_violin(adata,
                              ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
                              groupby='Leiden')
../_images/notebooks_rnaSeqPipelineTutorial_35_0.png

Identify marker genes with COSG#

[27]:
%%time
n_gene=30
cosg.cosg(adata,
          key_added='cosg',
          use_raw=False,
          layer='log1p',
          mu=100,
          expressed_pct=0.1,
          remove_lowly_expressed=True,
          n_genes_user=100,
          groupby='Leiden')
CPU times: user 3.01 s, sys: 825 ms, total: 3.83 s
Wall time: 3.83 s

We can use a dendrogram dot plot to visualize the expression of the top three marker genes of each Leiden cluster.

[28]:
sc.tl.dendrogram(adata,groupby='Leiden',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'Leiden']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}

sc.pl.dotplot(adata,
              marker_genes_list,
              groupby='Leiden',
              layer='log1p',
              dendrogram=True,
              swap_axes=False,
              standard_scale='var',
              cmap='Spectral_r')
Storing dendrogram info using `.uns['dendrogram_Leiden']`
../_images/notebooks_rnaSeqPipelineTutorial_39_1.png

Another way to visualize the expression of the top three marker genes from each Leiden cluster is by using COSG’s plotMarkerDendrogram method to create a circular dendrogram.

[29]:
cosg.plotMarkerDendrogram(
     adata,
     group_by="Leiden",
     use_rep="X_svd",
     calculate_dendrogram_on_cosg_scores=True,
     top_n_genes=3,
     radius_step=4.5,
     cmap="Purples",
     gene_label_offset=0.25,
     gene_label_color="black",
     linkage_method="ward",
     distance_metric="correlation",
     hierarchy_merge_scale=0,
     collapse_scale=0.5,
     add_cluster_node_for_single_node_cluster=True,
     palette=None,
     figure_size= (10, 10),
     colorbar_width=0.01,
     gene_color_min=0,
     gene_color_max=None,
     show_figure=True,
)
../_images/notebooks_rnaSeqPipelineTutorial_41_0.png

Marker genes of an individual cluster#

We can use dotplots and UMAPs to visualize the expression of the top marker genes in a selected cluster, which can be used to evaluate cluster quality.

[30]:
marker_gene=pd.DataFrame(adata.uns['cosg']['names'])
[31]:
cluster_check='8'
marker_gene[cluster_check].values
[31]:
array(['ENSMUSG00000120124', 'Nxph3', 'A630023P12Rik', 'Trpv6', 'Gm27040',
       'Chrna5', 'Nlrp6', 'Fezf2', '5330416C01Rik', 'Tbata',
       '4930551E15Rik', 'Hs3st4', 'Krt80', 'Abi3bp', 'Gm35161', 'Gm15942',
       'B4galnt3', 'Gm17171', 'Rxfp1', 'Htr1f', 'Cwh43', 'Tmem178',
       'Prss12', 'Gm49422', 'Prss35', 'Serinc2', 'Gm11762', 'Gm13335',
       'Gm36736', 'Gm31308', 'Rprm', 'Myl4', 'Hs3st2', 'Ipcef1', 'Hmga2',
       'Ccn4', 'Galnt9', 'Hcrtr2', 'Col5a1', 'Sdk2', 'Nptx1', 'Sel1l3',
       'Slc16a10', 'Gm13391', 'Sv2b', 'Ephb6', 'Rspo2', 'A830018L16Rik',
       'Scube2', 'Trbc2', 'Mirt1', 'Pcsk5', 'Adgra1', 'Gm15270',
       'Gm27234', 'Cpa6', 'Gm20878', 'Kcnmb4', 'Cdh18', 'Diras2',
       'Hcrtr1', 'Khdrbs3', 'Gm5468', 'Zmiz1os1', 'Etl4', 'Rgsl1',
       'Serpinb8', 'Gm17167', 'Gm9899', 'Garnl3', 'Gm12394', 'Vwc2l',
       'Ptpru', 'Slc17a7', 'Ano3', 'Zdhhc23', 'Tbr1', 'Grik3', 'Gm2824',
       'Pamr1', 'Gm42056', 'Ccl27a', 'Kcnmb4os2', 'Col24a1', 'Gm32679',
       'Sigmar1', 'Gm35853', 'Nrip3', 'Grp', 'Chgb', 'Gm47715', 'Ttc9b',
       '2900026A02Rik', 'Gramd2', 'Gm14120', 'Smim43', 'Grm8', 'Hcn1',
       'Rph3a', 'Gm5468-1'], dtype=object)

The dot plot below displays the expression of the top marker genes from cluster 8 across all clusters. For cluster 8 to be considered high quality, its top marker genes should be primarily specific to cluster 8, showing little to no expression in other clusters.

[32]:
sc.pl.dotplot(adata,
              marker_gene[cluster_check].values[:50],
              groupby='Leiden',
              dendrogram=False,
              swap_axes=True,
              standard_scale='var',
              cmap='Spectral_r')
../_images/notebooks_rnaSeqPipelineTutorial_46_0.png

We can visualize the expression of the top 12 genes from cluster 8 on UMAPs to assess whether their expression is specific to the cluster 8 region. This involves comparing the location of cluster 8 on the Leiden cluster UMAP with the gene expression UMAPs. In a high-quality cluster, marker gene expression should be predominantly localized to the cluster 8 area.

[33]:
sc.pl.umap(adata,
           color=marker_gene[cluster_check][:12],
           palette=piaso.pl.color.d_color1,
           cmap=piaso.pl.color.c_color1,
           layer='log1p',
           legend_fontsize=12,
           legend_fontoutline=2,
           legend_loc='on data',
           ncols=3,
           size=30,
           frameon=False)
../_images/notebooks_rnaSeqPipelineTutorial_48_0.png
[34]:
sc.pl.umap(adata,
           color=['Leiden'],
           palette=piaso.pl.color.d_color1,
           cmap=piaso.pl.color.c_color1,
           legend_fontsize=12,
           legend_fontoutline=2,
           legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
WARNING: Length of palette colors is smaller than the number of categories (palette length: 19, categories length: 32. Some categories will have the same color.
../_images/notebooks_rnaSeqPipelineTutorial_49_1.png

This pipeline is still in development. The next steps include using a reference dataset to annotate cell types with PIASO’s predictCellTypesByGDR, followed by multiple iterations of low-quality cluster removal and re-clustering until well-defined clusters are obtained.