piaso module#
- piaso.addNonOverlappingPeaks(base_peak_file: str, additional_peak_file: str, output_peak_file: str, bedtools_path: str | None = None) None #
Adds peaks from the ‘additional_peak_file’ that don’t overlap with peaks in ‘base_peak_file’ and then merges the result to ensure a sorted non-overlapping peak list while retaining original peak names.
- Parameters:
base_peak_file – Path to the main peak file.
additional_peak_file – Path to the additional peak file.
output_peak_file – Path where the merged and sorted peak list will be saved.
bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.
- piaso.buildADJ(adata, use_rep: str = 'X_svd', n_nearest_neighbors: int = 10, include_self: bool = True)#
- piaso.calculateAdjacentPeakSimilarityChunkwindow(peaks1: List[str], mat1: csr_matrix, peaks2: List[str], mat2: csr_matrix, window_size: int = 100000, chunk_size: int = 100, return_df: bool = False)#
Calculate cosine similarities between each peak in mat1 and its adjacent peaks in mat2 within a specified window size. The peaks need to be pre-sorted.
- Parameters:
peaks1 – List of sorted peak strings associated with mat1.
mat1 – A csr_matrix where each row corresponds to a peak in the peaks1 list.
peaks2 – List of sorted peak strings associated with mat2.
mat2 – A csr_matrix where each row corresponds to a peak in the peaks2 list.
window_size – The genomic distance within which to calculate similarities.
chunk_size – The number of rows to be processed in each chunk.
return_df – Whether to return the result as a DataFrame. If False, returns a csr_matrix.
- Returns:
A csr_matrix or DataFrame representing the cosine similarity between peaks in mat1 and mat2.
- piaso.calculateCellMetrics(adata: AnnData, layer: str | None = None)#
Calculate the number of fragments overlapping peaks and the number of peaks for each cell and store them in adata.obs.
Parameters: adata (AnnData): An AnnData object with cells in rows and peaks in columns. layer (str, optional): The layer of AnnData to use. If None, the default .X layer is used.
- piaso.calculatePeakMetrics(adata: AnnData, layer: str | None = None)#
Calculate the number of cells in which each peak is active (non-zero counts) for an AnnData object, and then add this information to the adata.var[‘n_cells’] attribute.
- Parameters:
adata – AnnData object containing cell x peak matrix.
layer – Specify which layer of the AnnData object to use. If None, use the main matrix (.X attribute).
- piaso.check_software_availability()#
Checks the availability of the required software in the system’s PATH. Raises an EnvironmentError if any software is not available.
- piaso.chr_split_func(line: str)#
- piaso.convertBamToFragment(bam_path, output_gz_path, bgzip_path=None, tabix_path=None, bgzip_n_cores=20, buffer_size=1000000, barcode_tag: str = 'CB')#
Convert a BAM file to a fragment count file with cell barcodes and then create a tabix index.
Parameters: - bam_path (str): Path to the input BAM file. - output_gz_path (str): Path to the output bgzipped fragment count file. - bgzip_path (str, optional): Path to the bgzip executable. Defaults to None (system path). - tabix_path (str, optional): Path to the tabix executable. Defaults to None (system path). - bgzip_n_cores (int, optional): Number of cores for bgzip compression. Defaults to 20. - buffer_size (int, optional): Number of lines to buffer before writing to output. Helps control memory usage. Defaults to 1 million. - barcode_tag (str): Tag for cell barcodes in bam file. Default is ‘CB’.
Returns: - None. Writes results to output_gz_path and creates a .tbi index.
- piaso.createCOSGDataFrameLong(adata, key='cosg')#
Converts marker genes and their corresponding scores stored in adata.uns into a long-format DataFrame.
Parameters: - adata (anndata.AnnData): An annotated data matrix. - key (str): The key used to access the names and scores in adata.uns dictionary.
Returns: - cosg_long (pd.DataFrame): A long-format DataFrame with columns ‘Group’, ‘Name’, ‘Score’, and ‘Rank’.
- piaso.filterFragmentByBarcode(input_fragment_file: str, output_fragment_file: str, filtered_barcodes: ndarray, bgzip_n_cores: int = 20, bgzip_dir=None, tabix_dir=None, chunk_size: int = 10000000)#
Filters the input fragment file and keeps only those rows where the barcode is in the provided list of filtered barcodes.
- Parameters:
input_fragment_file – str, Path to the input gz compressed fragment file.
output_fragment_file – str, Path to the output file where the filtered fragments will be written.
filtered_barcodes – np.ndarray, Numpy array containing the list of filtered cell barcodes.
bgzip_n_cores – int, Optional; Number of cores to be used by bgzip. Default is 20.
chunk_size – int, Optional; Size of the chunk to be written at once to bgzip.stdin during the bgzipping process. Default is 10_000_000.
bgzip_dir – Optional; Directory path containing the bgzip executable. Default is None (uses system PATH).
tabix_dir – Optional; Directory path containing the tabix executable. Default is None (uses system PATH).
- piaso.filterFragmentByRegion(fragment_file: str, sorted_bed: str, output_fragment_file: str, bedtools_path: str | None = None, bgzip_path: str | None = None, tabix_path: str | None = None, bgzip_n_cores: int = 20) None #
Filter fragments using a sorted BED file and then compress and index the results.
- Parameters:
fragment_file – Path to the input fragment file.
sorted_bed – Path to the sorted BED file.
output_fragment_file – Path to the output compressed fragment file.
bedtools_path – Optional path to the directory containing the bedtools executable.
bgzip_path – Optional path to the directory containing the bgzip executable.
tabix_path – Optional path to the directory containing the tabix executable.
bgzip_n_cores – Number of cores for bgzip. Defaults to 20.
- piaso.filterFragments(input_fragment_file: str, output_fragment_file: str, fragment_count_file: str, min_fragments: int = 1000, bgzip_n_cores: int = 20, chunk_size: int = 10000000)#
Filters the fragments based on the minimum number of fragments in each cell.
- Parameters:
input_fragment_file – str, Path to the input gz compressed fragment file.
output_fragment_file – str, Path to the output file where the filtered fragments will be written.
fragment_count_file – str, Path to the csv file where the number of unique fragments per cell will be written.
min_fragments – int, Optional; Minimum number of fragments required for a cell barcode to be considered valid. Default is 1000.
bgzip_n_cores – int, Optional; Number of cores to be used by bgzip. Default is 1.
chunk_size – int, Optional; Size of the chunk to be written at once to bgzip.stdin during the bgzipping process. Default is 1_000_000.
- piaso.generateBigWigByCellType(input_fragment_file, chrom_sizes_file, celltype_barcode_df, output_dir, bedtools_exec_dir, samtools_exec_dir, bamCoverage_exec_dir, bamCoverage_ncores=10, bamCoverage_binSize=1, bamCoverage_normalizeMethod='BPM', samtools_ncores=20, sort_bam=False)#
Generates BigWig files for each cell type from the fragment file.
Parameters: - input_fragment_file (str): Path to the input fragment file (gzip format). - chrom_sizes_file (str): Path to the chromosome sizes file. - celltype_barcode_df (DataFrame): A DataFrame with cell types and their corresponding barcodes. - output_dir (str): Directory to save the generated BAM and BigWig files. - bedtools_exec_dir (str): Directory containing bedtools executable. - samtools_exec_dir (str): Directory containing samtools executable. - bamCoverage_exec_dir (str): Directory containing bamCoverage executable. - bamCoverage_ncores (int, optional): Number of processors for bamCoverage. Default is 10. - bamCoverage_binSize (int, optional): Bin size for bamCoverage. Default is 1. - bamCoverage_normalizeMethod (str, optional): Normalization method for bamCoverage. Default is “BPM”. - samtools_ncores (int, optional): Number of processors for samtools. Default is 20. - sort_bam (bool, optional): Whether to sort the BAM file by coordinate after creation. If True, the BAM file will be sorted by coordinate and the unsorted version will be deleted. Default is False.
- piaso.handler(signum, frame)#
- piaso.intersectPeak(peak_file: str, reference_file: str, output_file: str, bedtools_path: str | None = None) None #
Calculates the overlapping regions of two BED files using bedtools.
- Parameters:
peak_file – Path to the input peak file.
reference_file – Path to the reference gene file.
output_file – Path to the output file where overlapping regions will be saved.
bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.
- piaso.intersectPeakSlop(peak1_bed_file: str, peak2_bed_file: str, chrom_sizes_file: str, output_file: str, bedtools_path: str | None = None, slop_distance: int = 100000) None #
This function extends the regions of the second peak file (peak2) by the given slop_distance, then computes the intersections between peak1 and the slopped peak2 regions, and saves the results to the specified output file.
- Parameters:
peak1_bed_file – Path to the first peak BED file.
peak2_bed_file – Path to the second peak BED file.
chrom_sizes_file – Path to the chromosome sizes file.
output_file – Path to save the intersected results.
bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.
slop_distance – Distance to add on both sides of the regions from peak2. Default is 100,000 bases.
- piaso.loadFragmentAsTile(fragment_file, tile_size=500)#
Load the data from a fragment file and return a sparse matrix, a DataFrame containing cell barcodes and their indices, and a DataFrame containing tiles and their indices.
Parameters: fragment_file (str): The path to the fragment file. tile_size (int): The size of the tiles.
Returns: An AnnData object with a sparse matrix where rows represent cells, and columns represent tiles.
- piaso.match_and_retrieve_values(query_array: ndarray, reference_array: ndarray, reference_value_array: ndarray) ndarray #
This function receives three 1-D numpy arrays: query_array, reference_array, and reference_value_array. For each element in query_array, the function finds the corresponding index of its first occurrence in reference_array and returns the corresponding value from reference_value_array. If an element in query_array does not exist in reference_array, the corresponding value is np.nan.
- Parameters:
query_array – 1-D numpy array - the array whose elements’ indices are to be found in reference_array
reference_array – 1-D numpy array - the array in which to find the indices of elements of query_array
reference_value_array – 1-D numpy array - the array from which to retrieve the values corresponding to the indices found in reference_array
- Returns:
1-D numpy array of values from reference_value_array corresponding to the position of elements in query_array in reference_array
- piaso.mergeFragmentFiles(fragment_files: List[str], temp_dir: str, save_dir: str, merged_file_name: str, bgzip_n_cores: int = 20, chunk_size: int = 10000000)#
Merges fragment files from different data directories.
- Args:
fragment_file (List[str]): List of paths to the fragment files. temp_dir (str): Path to the temporary directory to store intermediate files. save_dir (str): Path to the directory to save the final output. merged_file_name (str): Name of the final merged file (without extension). bgzip_n_cores (int): Number of cores to be used by the bgzip command. Default is 20. chunk_size (int): Number of lines to process at once, A chunk size of around 23.8 million lines would occupy roughly 1 GB of memory. Around 40 bytes for a line
- piaso.mergePeakFile(input_dir: str, output_file: str, file_extension: str = '.narrowPeak', bedtools_path: str | None = None) None #
Merge all the peak files (either .narrowPeak or .bed) in the provided directory.
- Parameters:
input_dir – Directory containing peak files.
output_file – Path to the merged output file.
file_extension – Extension of the peak files, either ‘.narrowPeak’ or ‘.bed’. Default is ‘.narrowPeak’.
bedtools_path – Path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.
- piaso.minusPeak(peaks1_bed_file: str, peaks2_bed_file: str, output_peak_file: str, bedtools_path: str | None = None) None #
Retains only the peaks from ‘peaks1_bed_file’ that don’t overlap with peaks in ‘peaks2_bed_file’.
- Parameters:
peaks1_bed_file – Path to the peaks1 bed file.
peaks2_bed_file – Path to the peaks2 bed file.
output_peak_file – Path where the result peak list will be saved.
bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.
- piaso.parsePeak(peak: str)#
Parse peak string into a tuple containing chromosome, start, and end.
- Parameters:
peak – A string representing a peak, e.g. “chr:start-end” or “chr-start-end”.
- Returns:
A tuple containing chromosome as a string, start and end as integers.
- piaso.processTSSbed(bedtools_path, tss_bed_file, genome_size_file, slop_l, slop_r, shift, output_dir, prefix='tss')#
A function to process bed files: slop, left shift, and right shift operations are performed.
- Parameters:
bedtools_path – str, the path to the bedtools binaries.
tss_bed_file – str, the path to the input TSS bed file.
genome_size_file – str, the path to the genome size file.
slop_l – int, the amount to slop (extend) the features to the left.
slop_r – int, the amount to slop (extend) the features to the right.
shift – int, the amount to shift the features to the left and right.
output_dir – str, the directory where the output files will be saved.
prefix – str, a prefix for naming output files. Default is “tss”.
- piaso.process_celltype(args)#
Processes each cell type: filters fragments, converts to BAM, and then to BigWig.
Parameters: - args (tuple): - input_fragment_file (str): Path to the input fragment file (gzip format). - chrom_sizes_file (str): Path to the chromosome sizes file. - celltype (str): Name of the cell type currently being processed. - group (DataFrame): A DataFrame containing barcodes for the current cell type. - output_dir (str): Directory to save the generated BAM and BigWig files. - bedtools_exec_dir (str): Directory containing bedtools executable. - samtools_exec_dir (str): Directory containing samtools executable. - bamCoverage_exec_dir (str): Directory containing bamCoverage executable. - bamCoverage_ncores (int): Number of processors for bamCoverage. - bamCoverage_binSize (int): Bin size for bamCoverage. - bamCoverage_normalizeMethod (str): Normalization method for bamCoverage. - samtools_ncores (int): Number of processors for samtools.
- piaso.process_fragment_file(idx: int, atac_fragments_gz: str, temp_dir: str) str #
Processes the specified directory to create a prefix file. Args: -idx (int): Index to append to the prefix of each line in the output file. -atac_fragments_gz (str): Full path to the fragment file to process. -temp_dir (str): Path to the temporary directory to store intermediate files. Returns: str: Path to the created prefix file.
- piaso.quantifyPeakActivity(fragment_file: str, peak_file: str, bedtools_path: str | None = None)#
Process fragment and peak files and returns an AnnData object
Parameters: :param fragment_file: Path to the fragment file :param peak_file: Path to the peak file :param bedtools_path: Path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.
Returns: An AnnData object
Examples: adata=quantifyPeakActivity( fragment_file=’./atac_fragments.tsv.gz’, peak_file=’./macs2_peaks.bed’ )
- piaso.runMACS2(fragment_file: str, macs_name: str, output_dir: str, genome_size: str = '1.87e+9', keep_dup: str = '1', extsize: int = 200, shift: int = -100, file_format: str = 'BED', macs2_path: str | None = None, macs2_silent: bool = True) None #
Call the MACS2 to perform peak calling.
- Parameters:
fragment_file – str, path to the fragment file.
macs_name – str, name string of the experiment.
output_dir – str, path to the output directory.
genome_size – str, size of the genome or abbreviation of the organism: ‘hs’ for Homo sapiens (default), ‘mm’ for Mus musculus, ‘ce’ for Caenorhabditis elegans, ‘dm’ for Drosophila melanogaster. Any other input will be considered as a direct size value. Default is ‘1.87e+9’.
genome_size – str, whether to keep the duplicates at the same location. Default is ‘1’. Choose ‘all’ if you want to ask MACS2 to keep all.
extsize – int, the extsize parameter for macs2. Default is 200.
shift – int, the shift parameter for macs2. Default is -100.
file_format – str, format of the input file. Default is ‘BED’.
macs2_path – str, path to the directory containing macs2 executable. If None, ‘macs2’ command available in the PATH will be used. Default is None.
macs2_silent – bool, if set to True, suppresses the MACS2 command output. Default is True.
- piaso.runMACS2Parallel(input_dir: str, output_dir: str, genome_size: str = '1.87e+9', keep_dup: str = '1', extsize: int = 200, shift: int = -100, file_format: str = 'BAM', macs2_path: str | None = None, macs2_silent: bool = True, max_threads: int = 10) None #
Call MACS2 in parallel for multiple files in a given input directory.
- Parameters:
input_dir – str, path to the directory containing input files.
output_dir – str, path to the directory where MACS2 output will be saved.
genome_size – str, size of the genome or abbreviation of the organism: ‘hs’ for Homo sapiens (default), ‘mm’ for Mus musculus, ‘ce’ for Caenorhabditis elegans, ‘dm’ for Drosophila melanogaster. Any other input will be considered as a direct size value.
keep_dup – str, whether to keep the duplicates at the same location. Default is ‘1’. Choose ‘all’ if you want MACS2 to keep all.
extsize – int, the extsize parameter for MACS2. Default is 200.
shift – int, the shift parameter for MACS2. Default is -100.
file_format – str, expected format of the input files. Supports ‘BAM’, ‘BAMPE’, ‘BED’, ‘BEDPE’. Default is ‘BAM’.
macs2_path – str, path to the directory containing the macs2 executable. If None, ‘macs2’ command available in the PATH will be used. Default is None.
macs2_silent – bool, if set to True, suppresses the MACS2 command output. Default is True.
max_threads – int, maximum number of threads to use for parallel execution. Default is 10.
- Returns:
None. The results will be saved in the specified output directory.
- piaso.runMFE(adata, batch_key: str | None = None, groupby: str | None = None, n_gene: int = 30, mu: float = 1.0, resolution: float = 1.0, scoring_method: str | None = None, roeg_layer: str = 'raw')#
- piaso.runSVD(adata: AnnData, use_highly_variable: bool = True, n_components: int = 50, random_state: int | None = 10, scale_data: bool = False, key_added: str = 'X_svd', layer: str | None = None) None #
Run Truncated Singular Value Decomposition (SVD) on the AnnData object and stores the result in adata.obsm[key_added].
Parameters: adata (AnnData): The annotated data matrix. use_highly_variable (bool): Whether to use highly variable genes/features only. Default is True. n_components (int): Desired dimensionality of output data. Must be strictly less than the number of features. Default is 50. random_state (int, optional): Random seed for reproducibility. Default is 10. scale_data (bool): Whether to scale the data using StandardScaler. Default is False. key_added (str): Key in adata.obsm to store the result. Default is ‘X_svd’. layer (str, optional): Specify the layer to use. If None, use adata.X. Default is None.
Usage:
`python runSVD(adata, use_highly_variable=True, n_components=50, random_state=10, scale_data=False, key_added='X_svd', layer='raw') `
This will run SVD on the specified layer ‘raw’ of the adata object, and will store the result in adata.obsm[‘X_svd’].
- piaso.run_tfidf(adata: AnnData, layer: str | None = None, scale_factor: int = 10000.0)#
Compute the TF-IDF (Term Frequency - Inverse Document Frequency) matrix for the input peak count data.
Parameters#
- adataAnnData
An AnnData object containing the peak count data.
- scale_factorfloat, optional
A scaling factor to multiply the resulting TF-IDF matrix by. Default is 1e4.
Returns#
- None
The method modifies the input data object in place by setting its X attribute to the computed TF-IDF matrix.
- piaso.saveBEDFromArray(regions: ndarray, output_bed_file: str) None #
Saves a BED file derived from a numpy array of genomic regions.
- Parameters:
regions – numpy array containing the genomic regions.
output_bed_file – Full path including filename where the BED file will be saved.
- piaso.sortBED(input_bed: str, output_bed: str) None #
Sort a .bed file based on two keys: 1) The first field (as a string) 2) The second field (as a number)
- Parameters:
input_bed – Path to the input .bed file
output_bed – Path to the output .bed file
- piaso.sortBamByCBCoords(input_bam_path, output_bam_path, samtools_path=None, memory_per_thread='512M', n_threads=4, tmp_dir=None)#
Sort a BAM file first by the cell barcode (CB tag) and then by coordinates.
Parameters: :param input_bam_path: Path to the input BAM file. :param output_bam_path: Path to the output BAM file that will be sorted. :param samtools_path: Optional path to the directory containing the samtools executable. If None, samtools command available in the PATH will be used. Default is None. :param memory_per_thread: Maximum memory per thread to use. Default is “512M”. :param n_threads: Number of threads to use. Default is 4. :param tmp_dir: Path to the directory to be used for temporary files. If None, system default is used. Default is None.
- piaso.sortDictByKeys(my_dict, keys, ascending=False)#
Sorts a given list of keys based on their corresponding values in the provided dictionary and returns a DataFrame with sorted keys and their corresponding values.
- Parameters:
my_dict (dict) – Dictionary containing key-value pairs.
keys (list) – List of keys to be sorted.
ascending (bool) – Boolean to decide whether the sorting should be in ascending order. Default is False (descending).
- Returns:
DataFrame with one column containing the sorted keys and another containing the corresponding values.
- Return type:
pd.DataFrame