Running INFOG and GDR on one million cells#

Load basic settings#

[1]:

import numpy as np
import pandas as pd
import scanpy as sc
sc.set_figure_params(dpi=80,dpi_save=300, color_map='viridis',facecolor='white')
from matplotlib import rcParams
# To modify the default figure size, use rcParams.
rcParams['figure.figsize'] = 4, 4
rcParams['font.sans-serif'] = "Arial"
rcParams['font.family'] = "Arial"
sc.settings.verbosity = 3
sc.logging.print_header()

/tmp/ipykernel_884065/1353975569.py:11: RuntimeWarning: Failed to import dependencies for application/vnd.jupyter.widget-view+json representation. (ModuleNotFoundError: No module named 'ipywidgets')
  sc.logging.print_header()

[1]:

Component	Info
Python	3.10.18 (main, Jun 5 2025, 13:14:17) [GCC 11.2.0]
OS	Linux-5.14.0-611.11.1.el9_7.x86_64-x86_64-with-glibc2.34
CPU	96 logical CPU cores, x86_64
GPU	No GPU found
Updated	2025-12-16 04:14

Dependencies

Dependency	Version
cycler	0.12.1
matplotlib-inline	0.1.7
jedi	0.19.2
natsort	8.4.0
ipython	8.37.0
pillow	11.2.1
executing	2.2.0
leidenalg	0.10.2
llvmlite	0.44.0
pytz	2025.2
joblib	1.5.1
prompt_toolkit	3.0.51
decorator	5.2.1
stack-data	0.6.3
wcwidth	0.2.13
igraph	0.11.9
parso	0.8.4
debugpy	1.8.14
python-dateutil	2.9.0.post0
six	1.17.0
texttable	1.7.0
psutil	7.0.0
asttokens	3.0.0
Cython	3.1.4
h5py	3.14.0
numba	0.61.2
pure_eval	0.2.3
kiwisolver	1.4.8
setuptools	78.1.1
tornado	6.5.1

Copyable Markdown

| Dependency        | Version     |
| ----------------- | ----------- |
| cycler            | 0.12.1      |
| matplotlib-inline | 0.1.7       |
| jedi              | 0.19.2      |
| natsort           | 8.4.0       |
| ipython           | 8.37.0      |
| pillow            | 11.2.1      |
| executing         | 2.2.0       |
| leidenalg         | 0.10.2      |
| llvmlite          | 0.44.0      |
| pytz              | 2025.2      |
| joblib            | 1.5.1       |
| prompt_toolkit    | 3.0.51      |
| decorator         | 5.2.1       |
| stack-data        | 0.6.3       |
| wcwidth           | 0.2.13      |
| igraph            | 0.11.9      |
| parso             | 0.8.4       |
| debugpy           | 1.8.14      |
| python-dateutil   | 2.9.0.post0 |
| six               | 1.17.0      |
| texttable         | 1.7.0       |
| psutil            | 7.0.0       |
| asttokens         | 3.0.0       |
| Cython            | 3.1.4       |
| h5py              | 3.14.0      |
| numba             | 0.61.2      |
| pure_eval         | 0.2.3       |
| kiwisolver        | 1.4.8       |
| setuptools        | 78.1.1      |
| tornado           | 6.5.1       |

| Component | Info                                                     |
| --------- | -------------------------------------------------------- |
| Python    | 3.10.18 (main, Jun  5 2025, 13:14:17) [GCC 11.2.0]       |
| OS        | Linux-5.14.0-611.11.1.el9_7.x86_64-x86_64-with-glibc2.34 |
| CPU       | 96 logical CPU cores, x86_64                             |
| GPU       | No GPU found                                             |
| Updated   | 2025-12-16 04:14                                         |

Setting paths#

[4]:

save_dir='/n/scratch/users/m/mid166/Result/single-cell/Methods/DataProcessing/AsianImmuneDiversityAtlasPhase1v2'
sc.settings.figdir = save_dir
prefix='AsianImmuneDiversityAtlasPhase1v2'
import os
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

sc.set_figure_params(dpi=80,dpi_save=300, color_map='viridis',facecolor='white')
rcParams['figure.figsize'] = 4, 4

Load the data#

The data is from Asian diversity in human immune cells Kock, Kian Hong et al. Cell, Volume 188, Issue 8, 2288 - 2306.e24.

cd /n/scratch/users/m/mid166/Result/single-cell/Methods/DataProcessing/AsianImmuneDiversityAtlasPhase1v2
wget https://datasets.cellxgene.cziscience.com/9deda9ad-6a71-401e-b909-5263919d85f9.h5ad
mv 9deda9ad-6a71-401e-b909-5263919d85f9.h5ad AsianImmuneDiversityAtlasPhase1v2.h5ad

[5]:

adata=sc.read('/n/scratch/users/m/mid166/Result/single-cell/Methods/DataProcessing/AsianImmuneDiversityAtlasPhase1v2/AsianImmuneDiversityAtlasPhase1v2.h5ad')

[6]:

adata

[6]:

AnnData object with n_obs × n_vars = 1265624 × 35477
    obs: 'reference_genome', 'gene_annotation_version', 'alignment_software', 'intronic_reads_counted', 'library_id', 'assay_ontology_term_id', 'sequenced_fragment', 'cell_number_loaded', 'institute', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'sample_id', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'sample_collection_method', 'donor_BMI_at_collection', 'tissue_type', 'suspension_derivation_process', 'suspension_enriched_cell_types', 'cell_viability_percentage', 'suspension_uuid', 'suspension_type', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'disease_ontology_term_id', 'sex_ontology_term_id', 'nCount_RNA', 'nFeature_RNA', 'pMito', 'NODG', 'nUMI', 'Country', 'Annotation_Level1', 'Annotation_Level2', 'Annotation_Level3', 'Annotation_Level4', 'Smoking Status', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'citation', 'default_embedding', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

Change the adata.var to gene symbols#

[10]:

adata.var.head()

[10]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length	feature_type
ENSG00000000003	False	TSPAN6	NCBITaxon:9606	gene	2396	protein_coding
ENSG00000000005	False	TNMD	NCBITaxon:9606	gene	873	protein_coding
ENSG00000000419	False	DPM1	NCBITaxon:9606	gene	1262	protein_coding
ENSG00000000457	False	SCYL3	NCBITaxon:9606	gene	2916	protein_coding
ENSG00000000460	False	FIRRM	NCBITaxon:9606	gene	2661	protein_coding

[12]:

adata.var['Ensemble_ID']=adata.var_names.copy()

[13]:

adata.var_names=adata.var['feature_name'].values.astype(str).copy()

[14]:

adata.var.head()

[14]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length	feature_type	Ensemble_ID
TSPAN6	False	TSPAN6	NCBITaxon:9606	gene	2396	protein_coding	ENSG00000000003
TNMD	False	TNMD	NCBITaxon:9606	gene	873	protein_coding	ENSG00000000005
DPM1	False	DPM1	NCBITaxon:9606	gene	1262	protein_coding	ENSG00000000419
SCYL3	False	SCYL3	NCBITaxon:9606	gene	2916	protein_coding	ENSG00000000457
FIRRM	False	FIRRM	NCBITaxon:9606	gene	2661	protein_coding	ENSG00000000460

[ ]:

adata.var.tail(30)

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length	feature_type	Ensemble_ID
ENSG00000288096	False	ENSG00000288096	NCBITaxon:9606	gene	2075	lncRNA	ENSG00000288096
ENSG00000288097	False	ENSG00000288097	NCBITaxon:9606	gene	2264	lncRNA	ENSG00000288097
ENSG00000288098	False	ENSG00000288098	NCBITaxon:9606	gene	653	lncRNA	ENSG00000288098
ENSG00000288099	False	ENSG00000288099	NCBITaxon:9606	gene	611	lncRNA	ENSG00000288099
ENSG00000288100	False	ENSG00000288100	NCBITaxon:9606	gene	1676	lncRNA	ENSG00000288100
ENSG00000288102	False	ENSG00000288102	NCBITaxon:9606	gene	872	lncRNA	ENSG00000288102
ENSG00000288103	False	ENSG00000288103	NCBITaxon:9606	gene	795	lncRNA	ENSG00000288103
ENSG00000288104	False	ENSG00000288104	NCBITaxon:9606	gene	2196	lncRNA	ENSG00000288104
ENSG00000288105	False	ENSG00000288105	NCBITaxon:9606	gene	1131	lncRNA	ENSG00000288105
ENSG00000288106	False	ENSG00000288106	NCBITaxon:9606	gene	1301	lncRNA	ENSG00000288106
ENSG00000288107	False	ENSG00000288107	NCBITaxon:9606	gene	2800	lncRNA	ENSG00000288107
ENSG00000288108	False	ENSG00000288108	NCBITaxon:9606	gene	509	lncRNA	ENSG00000288108
ENSG00000288109	False	ENSG00000288109	NCBITaxon:9606	gene	669	lncRNA	ENSG00000288109
ENSG00000288110	False	ENSG00000288110	NCBITaxon:9606	gene	2417	lncRNA	ENSG00000288110
ENSG00000288156	False	ENSG00000288156	NCBITaxon:9606	gene	2609	lncRNA	ENSG00000288156
ENSG00000288162	False	ENSG00000288162	NCBITaxon:9606	gene	1636	lncRNA	ENSG00000288162
ENSG00000288172	False	ENSG00000288172	NCBITaxon:9606	gene	1438	lncRNA	ENSG00000288172
ENSG00000288187	False	ENSG00000288187	NCBITaxon:9606	gene	1408	lncRNA	ENSG00000288187
ENSG00000288234	False	ENSG00000288234	NCBITaxon:9606	gene	812	lncRNA	ENSG00000288234
FAM106C	False	FAM106C	NCBITaxon:9606	gene	1623	lncRNA	ENSG00000288235
ENSG00000288245	False	ENSG00000288245	NCBITaxon:9606	gene	1317	lncRNA	ENSG00000288245
ENSG00000288252	False	ENSG00000288252	NCBITaxon:9606	gene	512	lncRNA	ENSG00000288252
ENSG00000288253	False	ENSG00000288253	NCBITaxon:9606	gene	615	lncRNA	ENSG00000288253
ENSG00000288300	False	ENSG00000288300	NCBITaxon:9606	gene	1622	lncRNA	ENSG00000288300
ENSG00000288302	False	ENSG00000288302	NCBITaxon:9606	gene	629	lncRNA	ENSG00000288302
ENSG00000288321	False	ENSG00000288321	NCBITaxon:9606	gene	478	lncRNA	ENSG00000288321
ENSG00000288330	False	ENSG00000288330	NCBITaxon:9606	gene	1601	TEC	ENSG00000288330
ENSG00000288398	False	ENSG00000288398	NCBITaxon:9606	gene	2662	lncRNA	ENSG00000288398
ENSG00000288459	False	ENSG00000288459	NCBITaxon:9606	gene	944	lncRNA	ENSG00000288459
SMIM42	False	SMIM42	NCBITaxon:9606	gene	1015	protein_coding	ENSG00000288460

[20]:

adata.var_names=[np.str_.split(i, '_ENSG')[0] for i in adata.var_names]

[21]:

adata.var_names_make_unique()

[22]:

adata.var.head()

[22]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length	feature_type	Ensemble_ID
TSPAN6	False	TSPAN6	NCBITaxon:9606	gene	2396	protein_coding	ENSG00000000003
TNMD	False	TNMD	NCBITaxon:9606	gene	873	protein_coding	ENSG00000000005
DPM1	False	DPM1	NCBITaxon:9606	gene	1262	protein_coding	ENSG00000000419
SCYL3	False	SCYL3	NCBITaxon:9606	gene	2916	protein_coding	ENSG00000000457
FIRRM	False	FIRRM	NCBITaxon:9606	gene	2661	protein_coding	ENSG00000000460

Set up the raw UMI counts layer#

[15]:

adata.raw.X.data

[15]:

array([1., 1., 1., ..., 2., 1., 1.], shape=(2545339623,), dtype=float32)

[16]:

adata.layers['raw']=adata.raw.X.copy()

[17]:

del adata.raw

[18]:

adata.X.data

[18]:

array([1.21232039, 1.21232039, 1.21232039, ..., 2.99794589, 1.82356588,
       3.96914141], shape=(2544339517,))

[23]:

adata

[23]:

AnnData object with n_obs × n_vars = 1265624 × 35477
    obs: 'reference_genome', 'gene_annotation_version', 'alignment_software', 'intronic_reads_counted', 'library_id', 'assay_ontology_term_id', 'sequenced_fragment', 'cell_number_loaded', 'institute', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'sample_id', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'sample_collection_method', 'donor_BMI_at_collection', 'tissue_type', 'suspension_derivation_process', 'suspension_enriched_cell_types', 'cell_viability_percentage', 'suspension_uuid', 'suspension_type', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'disease_ontology_term_id', 'sex_ontology_term_id', 'nCount_RNA', 'nFeature_RNA', 'pMito', 'NODG', 'nUMI', 'Country', 'Annotation_Level1', 'Annotation_Level2', 'Annotation_Level3', 'Annotation_Level4', 'Smoking Status', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type', 'Ensemble_ID'
    uns: 'citation', 'default_embedding', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'
    layers: 'raw'

[25]:

adata.obs.head().T

[25]:

index	AAACCTGAGAACAATC-1-IN_NIB_B001_L001	AAACCTGAGAAGCCCA-1-IN_NIB_B001_L001	AAACCTGAGCAAATCA-1-IN_NIB_B001_L001	AAACCTGAGCTAGCCC-1-IN_NIB_B001_L001	AAACCTGAGTGTACCT-1-IN_NIB_B001_L001
reference_genome	GRCh38	GRCh38	GRCh38	GRCh38	GRCh38
gene_annotation_version	v98	v98	v98	v98	v98
alignment_software	Cell Ranger count v7.0.1	Cell Ranger count v7.0.1	Cell Ranger count v7.0.1	Cell Ranger count v7.0.1	Cell Ranger count v7.0.1
intronic_reads_counted	yes	yes	yes	yes	yes
library_id	c18f20cd-6317-4059-bc5a-5341fe134124	c18f20cd-6317-4059-bc5a-5341fe134124	c18f20cd-6317-4059-bc5a-5341fe134124	c18f20cd-6317-4059-bc5a-5341fe134124	c18f20cd-6317-4059-bc5a-5341fe134124
assay_ontology_term_id	EFO:0009900	EFO:0009900	EFO:0009900	EFO:0009900	EFO:0009900
sequenced_fragment	5 prime tag	5 prime tag	5 prime tag	5 prime tag	5 prime tag
cell_number_loaded	40000 cells	40000 cells	40000 cells	40000 cells	40000 cells
institute	National Institute of Biomedical Genetics	National Institute of Biomedical Genetics	National Institute of Biomedical Genetics	National Institute of Biomedical Genetics	National Institute of Biomedical Genetics
is_primary_data	True	True	True	True	True
cell_type_ontology_term_id	CL:0000084	CL:0000084	CL:0000084	CL:0000763	CL:0000084
author_cell_type	T_unknown	T_unknown	T_unknown	Myeloid_unknown	T_unknown
sample_id	737b4d87-a88d-4425-8c76-41ec721b42ca	69809569-be81-49f0-bc1f-0904e410de0d	5bf5934a-d137-476a-a0b0-d317fa774291	356a6078-60cb-40c5-88e8-e0632ba8ea92	796fdcb9-0d62-499a-9132-70a51f39d465
sample_preservation_method	other	other	other	other	other
tissue_ontology_term_id	UBERON:0000178	UBERON:0000178	UBERON:0000178	UBERON:0000178	UBERON:0000178
development_stage_ontology_term_id	HsapDv:0000116	HsapDv:0000125	HsapDv:0000119	HsapDv:0000116	HsapDv:0000118
sample_collection_method	blood draw	blood draw	blood draw	blood draw	blood draw
donor_BMI_at_collection	26.1	33.2	22.2	23.7	22.9
tissue_type	tissue	tissue	tissue	tissue	tissue
suspension_derivation_process	density gradient centrifugation	density gradient centrifugation	density gradient centrifugation	density gradient centrifugation	density gradient centrifugation
suspension_enriched_cell_types	peripheral blood mononuclear cell	peripheral blood mononuclear cell	peripheral blood mononuclear cell	peripheral blood mononuclear cell	peripheral blood mononuclear cell
cell_viability_percentage	98.0	97.5	99.0	95.3	97.8
suspension_uuid	9399b949-af6a-4766-8d8a-75022bfdbbd4	f54130cc-3610-444a-99e9-186271a937cc	4e83d0bf-e4a9-4f94-8ab6-91cde22c71a2	52e71bbf-9ed8-4ab6-9537-d11eb1c1e77f	9c4f48d5-b74e-400b-9945-b0d5187a0cad
suspension_type	cell	cell	cell	cell	cell
donor_id	IN_NIB_H031	IN_NIB_H028	IN_NIB_H019	IN_NIB_H033	IN_NIB_H026
self_reported_ethnicity_ontology_term_id	HANCESTRO:0487	HANCESTRO:0487	HANCESTRO:0487	HANCESTRO:0487	HANCESTRO:0487
donor_living_at_sample_collection	True	True	True	True	True
disease_ontology_term_id	PATO:0000461	PATO:0000461	PATO:0000461	PATO:0000461	PATO:0000461
sex_ontology_term_id	PATO:0000384	PATO:0000384	PATO:0000383	PATO:0000384	PATO:0000383
nCount_RNA	4235.0	3153.0	3642.0	2229.0	3125.0
nFeature_RNA	1571	1596	1454	1314	1368
pMito	0.020543	0.021884	0.032125	0.026918	0.01376
NODG	1571	1596	1454	1314	1368
nUMI	4235	3153	3642	2229	3125
Country	IN	IN	IN	IN	IN
Annotation_Level1	T	T	T	Myeloid	T
Annotation_Level2	T	T	T	Myeloid	T
Annotation_Level3	T	T	T	Myeloid	T
Annotation_Level4	T_unknown	T_unknown	T_unknown	Myeloid_unknown	T_unknown
Smoking Status	1	0	0	0	0
cell_type	T cell	T cell	T cell	myeloid cell	T cell
assay	10x 5' v2	10x 5' v2	10x 5' v2	10x 5' v2	10x 5' v2
disease	normal	normal	normal	normal	normal
sex	male	male	female	male	female
tissue	blood	blood	blood	blood	blood
self_reported_ethnicity	Indian	Indian	Indian	Indian	Indian
development_stage	22-year-old stage	31-year-old stage	25-year-old stage	22-year-old stage	24-year-old stage
observation_joinid	L#&-yLt?V)	TiEL`2U3~o	534@tl7g>n	ZR3T+u}`$6	k}Z5EqiqDM

Import PIASO#

[27]:

import piaso

/n/data1/hms/neurobio/fishell/mindai/.conda/envs/scda5/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[28]:

sc.pl.umap(adata,
           color=['cell_type'],
           palette=piaso.pl.color.d_color4,
           cmap=piaso.pl.color.c_color4,
           # size=10,
           ncols=1,
           frameon=False)

../_images/notebooks_1millionCells_AsianImmuneDiversityAtlasPhase1v2_28_0.png

[29]:

sc.pl.umap(adata,
           color=['Annotation_Level4'],
           palette=piaso.pl.color.d_color4,
           cmap=piaso.pl.color.c_color4,
           # size=10,
           ncols=1,
           frameon=False)

../_images/notebooks_1millionCells_AsianImmuneDiversityAtlasPhase1v2_29_0.png

[30]:

sc.pl.umap(adata,
           color=['Annotation_Level3'],
           palette=piaso.pl.color.d_color4,
           cmap=piaso.pl.color.c_color4,
           # size=10,
           ncols=1,
           frameon=False)

../_images/notebooks_1millionCells_AsianImmuneDiversityAtlasPhase1v2_30_0.png

Run INFOG#

[36]:

%%time
piaso.tl.infog(adata,
               layer='raw',
               n_top_genes=3000,)

The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 3min 37s, sys: 2min 21s, total: 5min 58s
Wall time: 5min 59s

[37]:

piaso.pp.table(adata.obs['cell_type'])

[37]:

{'T cell': 54800,
 'myeloid cell': 26020,
 'natural killer cell': 23234,
 'B cell': 9144,
 'platelet': 14799,
 'plasma cell': 1672,
 'naive B cell': 36778,
 'memory B cell': 35114,
 'mature B cell': 4850,
 'hematopoietic stem cell': 135,
 'CD14-positive monocyte': 202124,
 'CD14-low, CD16-positive monocyte': 46628,
 'CD1c-positive myeloid dendritic cell': 13207,
 'CD141-positive myeloid dendritic cell': 554,
 'pre-conventional dendritic cell': 894,
 'plasmacytoid dendritic cell': 3796,
 'CD16-positive, CD56-dim natural killer cell, human': 151655,
 'CD8-positive, alpha-beta cytotoxic T cell': 46528,
 'CD8-positive, alpha-beta T cell': 53658,
 'CD16-negative, CD56-bright natural killer cell, human': 6152,
 'CD4-positive, alpha-beta cytotoxic T cell': 13594,
 'gamma-delta T cell': 30695,
 'CD8-positive, alpha-beta memory T cell': 28138,
 'CD4-positive, alpha-beta T cell': 40100,
 'innate lymphoid cell': 707,
 'naive thymus-derived CD8-positive, alpha-beta T cell': 89029,
 'mucosal invariant T cell': 23001,
 'naive thymus-derived CD4-positive, alpha-beta T cell': 157616,
 'central memory CD4-positive, alpha-beta T cell': 110004,
 'effector memory CD4-positive, alpha-beta T cell': 25012,
 'regulatory T cell': 15497,
 'double negative T regulatory cell': 489}

[38]:

piaso.pp.table(adata.obs['tissue'])

[38]:

{'blood': 1265624}

[39]:

piaso.pp.table(adata.obs['disease'])

[39]:

{'normal': 1265624}

[40]:

piaso.pp.table(adata.obs['assay'])

[40]:

{"10x 5' v2": 1265624}

[41]:

piaso.pp.table(adata.obs['library_id'])

[41]:

{'c18f20cd-6317-4059-bc5a-5341fe134124': 8341,
 '8c929df1-d96a-437e-94cf-795ba97ba226': 11777,
 'a539496c-999e-4935-865e-2c1c7506bbc9': 13902,
 '91f42a59-eeb0-4577-a6d0-52f17b4ea3b3': 12056,
 '25e6ef7a-298c-42d5-9be8-b09932e1fd9d': 12235,
 '10a42edf-8fd3-4e25-900c-5a8e4890bcc8': 16351,
 '3caa9824-a0cb-4d3c-a642-b2f374f690b5': 12486,
 'a6c4ebc3-beac-4ebe-8e27-85afbe0d55b2': 18336,
 '5205817f-44c5-468b-9d41-043ecacb7dbb': 17684,
 '163573e7-c093-4a02-8ce4-b38e61d4d78e': 15374,
 '1833343d-c74b-409a-a2ad-15bdfaddf876': 16129,
 'db0a8252-6f37-47d4-98d6-806b181c1520': 17231,
 '745201e8-508b-4484-b179-6a3dc332895a': 16823,
 '8eb9096b-f4be-4fbf-91a3-402621f6d7d2': 17066,
 '688adb38-2ca3-4efa-8183-a1130f6c7801': 18609,
 '354ba71c-599c-4cd0-8f31-07d7b5676ded': 16138,
 'fef1f5a0-b108-435e-a754-3cc995daaa1f': 15994,
 '7b848d92-f5d1-4b53-a681-889c581d5ec9': 17178,
 'd43d2448-bf59-4267-a458-1ceb83096c67': 17712,
 'bb48421f-5cb1-4f12-b725-dcfa41bc700b': 10244,
 '7da3c8a8-dfcf-4d98-bcdb-63281d8903f9': 12889,
 '1d7edd49-e8ab-494d-8d33-f452f138e945': 16113,
 '0ebbd2ac-4260-4342-af3b-ff24012e8681': 17663,
 '091b1af8-3dc5-494f-a485-4b647c9f2bfa': 17975,
 '57aa8942-f88d-4b83-9888-f6f5917a84a9': 17443,
 'acc5bec2-dd3f-44c3-bc54-6746e0295f2f': 18117,
 'fab2de88-119e-40ea-b5c1-e7b616c8483d': 16994,
 'f75482e2-85c9-414b-a2d3-fd6505fc3e0b': 18836,
 '5491dd9a-e8f1-4788-8495-3deaf7b3fa4d': 18710,
 '95b09160-b9c3-4035-8263-205a4a11e118': 19176,
 '537d845f-42e1-4498-b827-737cf9fd3df6': 17666,
 'dc6ba966-a025-4222-a671-ffb5519a0a24': 17200,
 '5b6184ee-60b3-4f4e-ad21-c8424f7a4a8b': 19120,
 '0b8994d6-7ead-4d27-913b-69cab52df357': 18837,
 '0b1175aa-350d-4e0a-8535-d667980cb949': 19320,
 'f2b4f75b-4de2-4894-bdaa-3cbae7efc951': 17506,
 '60d897bb-bd81-4794-a1bd-f89200faa1c1': 17851,
 '25561afb-5a09-4eb7-aebb-be2dc52258c1': 16646,
 '62e4c041-4c8d-4b07-808d-e888b97ceab0': 16768,
 'c2b6c7fa-e925-4715-a0b8-cd02bd1d2469': 18281,
 '6a2dc1c5-9a2c-4490-b720-b39c49c26d5b': 18051,
 'ba7dd120-1236-43ec-ae73-1961577a71f0': 17757,
 'ba4b2831-5211-4c45-a814-467012f6c356': 15092,
 '2afec21c-9b5c-43e6-9bb2-cfb41181758c': 13693,
 'f7b36e3c-39e1-4aac-88ce-12f31ef4bf33': 15753,
 '925ed908-cb3d-4eeb-8db9-cf8967de971d': 19209,
 'c8ee9b8a-793e-4712-bc91-0c9cd291a698': 16323,
 '49dc02c4-bd12-4148-a760-d5631154a019': 12727,
 '735e1302-a6a2-49f2-9c7e-117a8706aab0': 12765,
 'e1c3cd5d-70b7-4eb1-8028-218da445dc3e': 8608,
 'ea8c3957-1abc-43b2-b412-93c104e8acbb': 9506,
 'daf47ff8-958c-4aa1-b50e-acb15f88279c': 10040,
 'c818b1d0-12f3-4768-b395-daaf8f5fd70b': 10822,
 '40c0cad6-88ca-4e19-9ead-6f48dfca8823': 9890,
 '902e88a3-8007-4ea4-9520-299190672fdf': 9783,
 'b9119f90-58cd-42ae-83f6-2f7f82827972': 9667,
 '215f5ef8-70d9-4efd-8564-60b05a1a1a52': 8531,
 '3eba0346-ae6a-4a8b-9a33-d637004f4352': 11012,
 'e689732d-7e84-4467-ab20-b85678d81dd9': 8919,
 'bc3f63cf-404c-4f5c-8e9b-15f00d830d84': 8890,
 '155f18d1-1754-4353-ba9d-7107adc68223': 8191,
 '2361585a-ad3d-4236-af8e-e3188a71fa26': 8496,
 '87bba90e-4389-405c-adec-ca27e2c94219': 8947,
 '74224522-53fc-4a99-aff4-091f642f9c90': 7275,
 'ad6e9dda-40b1-4ae8-9b29-2a64ddb8ab9f': 8913,
 '5397ecce-4f99-4f7c-b833-000c8d57cdb1': 6823,
 'c0eb5cbc-7a22-4e4a-bc48-f15219ea2d0e': 6030,
 '91f8d6ee-8ed5-40c8-9597-f0d87a9f9922': 6991,
 'df470eeb-1e73-482a-a850-aa90c736f4cd': 7681,
 'ff9b156b-5ca7-4c1f-9a2d-32cc7d729713': 8369,
 '53539254-c074-4fc6-b902-ce2a795057ef': 9168,
 'bd5eaeae-ee8b-44c3-93d6-e639160df738': 9549,
 '9c9d26f4-487c-46b2-98c2-51c17f1972e5': 9416,
 'e6e2a3ba-84af-43b7-b913-0ad174393cae': 8900,
 '89156b35-22e6-42db-80ed-72ed15b9cffc': 7865,
 '38f769ce-583e-4e9b-9434-2042d3649f6e': 8805,
 'f22affcb-67e9-458e-81e8-4f7c6fc2bcf1': 8144,
 'de842ea6-cb26-47c3-8957-5e58c178c2a1': 7763,
 '57a6d9c9-ec4a-4591-b341-b50a4f99f8a3': 9188,
 '9cded2d7-90ec-4dcb-bb64-6e46ff06323a': 13011,
 'ed813075-c243-46c8-bcc9-bd2f855dd332': 9522,
 '82dcae80-03e7-4bbb-befc-67ac9b3c4aef': 8991,
 '39e88fce-97eb-4e7f-977e-68d23176c372': 9954,
 '5cfc2c07-b9c5-42bf-8307-cfb51e677c3e': 16719,
 '78bf1fed-bf59-4a2e-a791-f7bb5737ad56': 13120,
 'dbb06e58-a137-4a92-bbaa-053a936de27c': 17390,
 'bc30a108-b78f-4157-93e2-90b661a3d639': 20564,
 '2cd6ef50-de2b-450e-8c43-db151315302b': 15427,
 '3dfdc900-2c20-45fe-80a7-8d821f362ee5': 17156,
 '72c440dc-7d5c-494b-8240-6bd4fd260ea0': 16492,
 'ac74f903-c5cb-42cc-a2c4-6241904081c3': 16071,
 '3961d088-d775-4784-ab8e-e892d1f70166': 16566,
 '9f3fb493-13bd-4e80-9cd2-83948a26e01d': 16312}

[42]:

adata.X.data

[42]:

array([1.21232039, 1.21232039, 1.21232039, ..., 2.99794589, 1.82356588,
       3.96914141], shape=(2544339517,))

Run GDR#

[43]:

%%time
piaso.tl.runGDRParallel(adata,
                        batch_key='library_id',
                        groupby=None,
                        n_gene=20,
                        mu=10,
                        resolution=3.0,
                        layer='infog',
                        infog_layer='raw',
                        score_layer='infog',
                        scoring_method='piaso',
                        use_highly_variable=True,
                        n_highly_variable_genes=5000,
                        n_svd_dims=50,
                        key_added='X_gdr',
                        max_workers=32,
                        calculate_score_multiBatch = False,
                        verbosity=0)

Calculating marker genes: 100%|██████████| 93/93 [11:38<00:00,  7.51s/batch]
Calculating cell embeddings: 100%|██████████| 93/93 [21:13:01<00:00, 821.31s/batch]

The cell embeddings calculated by GDR were saved as `X_gdr` in adata.obsm.
CPU times: user 14min 15s, sys: 5min 11s, total: 19min 27s
Wall time: 21h 26min 6s

It took ~21.5 hours to run GDR on 1.2 million cells from 93 libraries.

[52]:

adata

[52]:

AnnData object with n_obs × n_vars = 1265624 × 35477
    obs: 'reference_genome', 'gene_annotation_version', 'alignment_software', 'intronic_reads_counted', 'library_id', 'assay_ontology_term_id', 'sequenced_fragment', 'cell_number_loaded', 'institute', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'sample_id', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'sample_collection_method', 'donor_BMI_at_collection', 'tissue_type', 'suspension_derivation_process', 'suspension_enriched_cell_types', 'cell_viability_percentage', 'suspension_uuid', 'suspension_type', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'disease_ontology_term_id', 'sex_ontology_term_id', 'nCount_RNA', 'nFeature_RNA', 'pMito', 'NODG', 'nUMI', 'Country', 'Annotation_Level1', 'Annotation_Level2', 'Annotation_Level3', 'Annotation_Level4', 'Smoking Status', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type', 'Ensemble_ID', 'infog_var', 'highly_variable'
    uns: 'Annotation_Level3_colors', 'Annotation_Level4_colors', 'cell_type_colors', 'citation', 'default_embedding', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title', 'gdr', 'neighbors', 'umap', 'assay_colors'
    obsm: 'X_umap', 'X_gdr', 'X_umap_author'
    layers: 'raw', 'infog'
    obsp: 'distances', 'connectivities'

[ ]:

adata.obsm['X_umap_author']=adata.obsm['X_umap'].copy()

[ ]:

%%time
sc.pp.neighbors(adata,
                use_rep='X_gdr',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")
sc.tl.umap(adata)

computing neighbors
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:35:54)
CPU times: user 2h 22min, sys: 25min 7s, total: 2h 47min 8s
Wall time: 2h 31min 48s

[48]:

sc.pl.umap(adata,
           color=['assay'],
           palette=piaso.pl.color.d_color4,
           cmap=piaso.pl.color.c_color4,
           # size=10,
           ncols=1,
           frameon=False)

../_images/notebooks_1millionCells_AsianImmuneDiversityAtlasPhase1v2_45_0.png

The UMAP from GDR:

[50]:

sc.pl.umap(adata,
           color=['cell_type'],
           palette=piaso.pl.color.d_color4,
           cmap=piaso.pl.color.c_color4,
           # size=10,
           ncols=1,
           frameon=False)

../_images/notebooks_1millionCells_AsianImmuneDiversityAtlasPhase1v2_47_0.png

Compared to the UMAP from the publication:

[61]:

sc.pl.embedding(adata,
           basis='X_umap_author',
           color=['cell_type'],
           palette=piaso.pl.color.d_color4,
           cmap=piaso.pl.color.c_color4,
           # size=10,
           ncols=1,
           frameon=False)

../_images/notebooks_1millionCells_AsianImmuneDiversityAtlasPhase1v2_49_0.png

[53]:

adata.obsm['X_umap_gdr']=adata.obsm['X_umap'].copy()

[54]:

adata

[54]:

AnnData object with n_obs × n_vars = 1265624 × 35477
    obs: 'reference_genome', 'gene_annotation_version', 'alignment_software', 'intronic_reads_counted', 'library_id', 'assay_ontology_term_id', 'sequenced_fragment', 'cell_number_loaded', 'institute', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'sample_id', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'sample_collection_method', 'donor_BMI_at_collection', 'tissue_type', 'suspension_derivation_process', 'suspension_enriched_cell_types', 'cell_viability_percentage', 'suspension_uuid', 'suspension_type', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'disease_ontology_term_id', 'sex_ontology_term_id', 'nCount_RNA', 'nFeature_RNA', 'pMito', 'NODG', 'nUMI', 'Country', 'Annotation_Level1', 'Annotation_Level2', 'Annotation_Level3', 'Annotation_Level4', 'Smoking Status', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type', 'Ensemble_ID', 'infog_var', 'highly_variable'
    uns: 'Annotation_Level3_colors', 'Annotation_Level4_colors', 'cell_type_colors', 'citation', 'default_embedding', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title', 'gdr', 'neighbors', 'umap', 'assay_colors'
    obsm: 'X_umap', 'X_gdr', 'X_umap_author', 'X_umap_gdr'
    layers: 'raw', 'infog'
    obsp: 'distances', 'connectivities'

Save the data#

[55]:

adata.write(save_dir+'/'+prefix+'_gdr.h5ad')

[56]:

save_dir+'/'+prefix+'_gdr.h5ad'

[56]:

'/n/scratch/users/m/mid166/Result/single-cell/Methods/DataProcessing/AsianImmuneDiversityAtlasPhase1v2/AsianImmuneDiversityAtlasPhase1v2_gdr.h5ad'

The h5ad file with GDR embeddings could be downloaded from https://drive.google.com/file/d/1eXbUzpZxzbKHhEbhoiCM2MUgpuYCfgtE/view.

Table of Contents