piaso module#

piaso.addNonOverlappingPeaks(base_peak_file: str, additional_peak_file: str, output_peak_file: str, bedtools_path: str | None = None) None#

Adds peaks from the ‘additional_peak_file’ that don’t overlap with peaks in ‘base_peak_file’ and then merges the result to ensure a sorted non-overlapping peak list while retaining original peak names.

Parameters:
  • base_peak_file – Path to the main peak file.

  • additional_peak_file – Path to the additional peak file.

  • output_peak_file – Path where the merged and sorted peak list will be saved.

  • bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.

piaso.buildADJ(adata, use_rep: str = 'X_svd', n_nearest_neighbors: int = 10, include_self: bool = True)#
piaso.calculateAdjacentPeakSimilarityChunkwindow(peaks1: List[str], mat1: csr_matrix, peaks2: List[str], mat2: csr_matrix, window_size: int = 100000, chunk_size: int = 100, return_df: bool = False)#

Calculate cosine similarities between each peak in mat1 and its adjacent peaks in mat2 within a specified window size. The peaks need to be pre-sorted.

Parameters:
  • peaks1 – List of sorted peak strings associated with mat1.

  • mat1 – A csr_matrix where each row corresponds to a peak in the peaks1 list.

  • peaks2 – List of sorted peak strings associated with mat2.

  • mat2 – A csr_matrix where each row corresponds to a peak in the peaks2 list.

  • window_size – The genomic distance within which to calculate similarities.

  • chunk_size – The number of rows to be processed in each chunk.

  • return_df – Whether to return the result as a DataFrame. If False, returns a csr_matrix.

Returns:

A csr_matrix or DataFrame representing the cosine similarity between peaks in mat1 and mat2.

piaso.calculateCellMetrics(adata: AnnData, layer: str | None = None)#

Calculate the number of fragments overlapping peaks and the number of peaks for each cell and store them in adata.obs.

Parameters: adata (AnnData): An AnnData object with cells in rows and peaks in columns. layer (str, optional): The layer of AnnData to use. If None, the default .X layer is used.

piaso.calculatePeakMetrics(adata: AnnData, layer: str | None = None)#

Calculate the number of cells in which each peak is active (non-zero counts) for an AnnData object, and then add this information to the adata.var[‘n_cells’] attribute.

Parameters:
  • adata – AnnData object containing cell x peak matrix.

  • layer – Specify which layer of the AnnData object to use. If None, use the main matrix (.X attribute).

piaso.check_software_availability()#

Checks the availability of the required software in the system’s PATH. Raises an EnvironmentError if any software is not available.

piaso.chr_split_func(line: str)#
piaso.convertBamToFragment(bam_path, output_gz_path, bgzip_path=None, tabix_path=None, bgzip_n_cores=20, buffer_size=1000000, barcode_tag: str = 'CB')#

Convert a BAM file to a fragment count file with cell barcodes and then create a tabix index.

Parameters: - bam_path (str): Path to the input BAM file. - output_gz_path (str): Path to the output bgzipped fragment count file. - bgzip_path (str, optional): Path to the bgzip executable. Defaults to None (system path). - tabix_path (str, optional): Path to the tabix executable. Defaults to None (system path). - bgzip_n_cores (int, optional): Number of cores for bgzip compression. Defaults to 20. - buffer_size (int, optional): Number of lines to buffer before writing to output. Helps control memory usage. Defaults to 1 million. - barcode_tag (str): Tag for cell barcodes in bam file. Default is ‘CB’.

Returns: - None. Writes results to output_gz_path and creates a .tbi index.

piaso.createCOSGDataFrameLong(adata, key='cosg')#

Converts marker genes and their corresponding scores stored in adata.uns into a long-format DataFrame.

Parameters: - adata (anndata.AnnData): An annotated data matrix. - key (str): The key used to access the names and scores in adata.uns dictionary.

Returns: - cosg_long (pd.DataFrame): A long-format DataFrame with columns ‘Group’, ‘Name’, ‘Score’, and ‘Rank’.

piaso.filterFragmentByBarcode(input_fragment_file: str, output_fragment_file: str, filtered_barcodes: ndarray, bgzip_n_cores: int = 20, bgzip_dir=None, tabix_dir=None, chunk_size: int = 10000000)#

Filters the input fragment file and keeps only those rows where the barcode is in the provided list of filtered barcodes.

Parameters:
  • input_fragment_file – str, Path to the input gz compressed fragment file.

  • output_fragment_file – str, Path to the output file where the filtered fragments will be written.

  • filtered_barcodes – np.ndarray, Numpy array containing the list of filtered cell barcodes.

  • bgzip_n_cores – int, Optional; Number of cores to be used by bgzip. Default is 20.

  • chunk_size – int, Optional; Size of the chunk to be written at once to bgzip.stdin during the bgzipping process. Default is 10_000_000.

  • bgzip_dir – Optional; Directory path containing the bgzip executable. Default is None (uses system PATH).

  • tabix_dir – Optional; Directory path containing the tabix executable. Default is None (uses system PATH).

piaso.filterFragmentByRegion(fragment_file: str, sorted_bed: str, output_fragment_file: str, bedtools_path: str | None = None, bgzip_path: str | None = None, tabix_path: str | None = None, bgzip_n_cores: int = 20) None#

Filter fragments using a sorted BED file and then compress and index the results.

Parameters:
  • fragment_file – Path to the input fragment file.

  • sorted_bed – Path to the sorted BED file.

  • output_fragment_file – Path to the output compressed fragment file.

  • bedtools_path – Optional path to the directory containing the bedtools executable.

  • bgzip_path – Optional path to the directory containing the bgzip executable.

  • tabix_path – Optional path to the directory containing the tabix executable.

  • bgzip_n_cores – Number of cores for bgzip. Defaults to 20.

piaso.filterFragments(input_fragment_file: str, output_fragment_file: str, fragment_count_file: str, min_fragments: int = 1000, bgzip_n_cores: int = 20, chunk_size: int = 10000000)#

Filters the fragments based on the minimum number of fragments in each cell.

Parameters:
  • input_fragment_file – str, Path to the input gz compressed fragment file.

  • output_fragment_file – str, Path to the output file where the filtered fragments will be written.

  • fragment_count_file – str, Path to the csv file where the number of unique fragments per cell will be written.

  • min_fragments – int, Optional; Minimum number of fragments required for a cell barcode to be considered valid. Default is 1000.

  • bgzip_n_cores – int, Optional; Number of cores to be used by bgzip. Default is 1.

  • chunk_size – int, Optional; Size of the chunk to be written at once to bgzip.stdin during the bgzipping process. Default is 1_000_000.

piaso.generateBigWigByCellType(input_fragment_file, chrom_sizes_file, celltype_barcode_df, output_dir, bedtools_exec_dir, samtools_exec_dir, bamCoverage_exec_dir, bamCoverage_ncores=10, bamCoverage_binSize=1, bamCoverage_normalizeMethod='BPM', samtools_ncores=20, sort_bam=False)#

Generates BigWig files for each cell type from the fragment file.

Parameters: - input_fragment_file (str): Path to the input fragment file (gzip format). - chrom_sizes_file (str): Path to the chromosome sizes file. - celltype_barcode_df (DataFrame): A DataFrame with cell types and their corresponding barcodes. - output_dir (str): Directory to save the generated BAM and BigWig files. - bedtools_exec_dir (str): Directory containing bedtools executable. - samtools_exec_dir (str): Directory containing samtools executable. - bamCoverage_exec_dir (str): Directory containing bamCoverage executable. - bamCoverage_ncores (int, optional): Number of processors for bamCoverage. Default is 10. - bamCoverage_binSize (int, optional): Bin size for bamCoverage. Default is 1. - bamCoverage_normalizeMethod (str, optional): Normalization method for bamCoverage. Default is “BPM”. - samtools_ncores (int, optional): Number of processors for samtools. Default is 20. - sort_bam (bool, optional): Whether to sort the BAM file by coordinate after creation. If True, the BAM file will be sorted by coordinate and the unsorted version will be deleted. Default is False.

piaso.handler(signum, frame)#
piaso.intersectPeak(peak_file: str, reference_file: str, output_file: str, bedtools_path: str | None = None) None#

Calculates the overlapping regions of two BED files using bedtools.

Parameters:
  • peak_file – Path to the input peak file.

  • reference_file – Path to the reference gene file.

  • output_file – Path to the output file where overlapping regions will be saved.

  • bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.

piaso.intersectPeakSlop(peak1_bed_file: str, peak2_bed_file: str, chrom_sizes_file: str, output_file: str, bedtools_path: str | None = None, slop_distance: int = 100000) None#

This function extends the regions of the second peak file (peak2) by the given slop_distance, then computes the intersections between peak1 and the slopped peak2 regions, and saves the results to the specified output file.

Parameters:
  • peak1_bed_file – Path to the first peak BED file.

  • peak2_bed_file – Path to the second peak BED file.

  • chrom_sizes_file – Path to the chromosome sizes file.

  • output_file – Path to save the intersected results.

  • bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.

  • slop_distance – Distance to add on both sides of the regions from peak2. Default is 100,000 bases.

piaso.loadFragmentAsTile(fragment_file, tile_size=500)#

Load the data from a fragment file and return a sparse matrix, a DataFrame containing cell barcodes and their indices, and a DataFrame containing tiles and their indices.

Parameters: fragment_file (str): The path to the fragment file. tile_size (int): The size of the tiles.

Returns: An AnnData object with a sparse matrix where rows represent cells, and columns represent tiles.

piaso.match_and_retrieve_values(query_array: ndarray, reference_array: ndarray, reference_value_array: ndarray) ndarray#

This function receives three 1-D numpy arrays: query_array, reference_array, and reference_value_array. For each element in query_array, the function finds the corresponding index of its first occurrence in reference_array and returns the corresponding value from reference_value_array. If an element in query_array does not exist in reference_array, the corresponding value is np.nan.

Parameters:
  • query_array – 1-D numpy array - the array whose elements’ indices are to be found in reference_array

  • reference_array – 1-D numpy array - the array in which to find the indices of elements of query_array

  • reference_value_array – 1-D numpy array - the array from which to retrieve the values corresponding to the indices found in reference_array

Returns:

1-D numpy array of values from reference_value_array corresponding to the position of elements in query_array in reference_array

piaso.mergeFragmentFiles(fragment_files: List[str], temp_dir: str, save_dir: str, merged_file_name: str, bgzip_n_cores: int = 20, chunk_size: int = 10000000)#

Merges fragment files from different data directories.

Args:

fragment_file (List[str]): List of paths to the fragment files. temp_dir (str): Path to the temporary directory to store intermediate files. save_dir (str): Path to the directory to save the final output. merged_file_name (str): Name of the final merged file (without extension). bgzip_n_cores (int): Number of cores to be used by the bgzip command. Default is 20. chunk_size (int): Number of lines to process at once, A chunk size of around 23.8 million lines would occupy roughly 1 GB of memory. Around 40 bytes for a line

piaso.mergePeakFile(input_dir: str, output_file: str, file_extension: str = '.narrowPeak', bedtools_path: str | None = None) None#

Merge all the peak files (either .narrowPeak or .bed) in the provided directory.

Parameters:
  • input_dir – Directory containing peak files.

  • output_file – Path to the merged output file.

  • file_extension – Extension of the peak files, either ‘.narrowPeak’ or ‘.bed’. Default is ‘.narrowPeak’.

  • bedtools_path – Path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.

piaso.minusPeak(peaks1_bed_file: str, peaks2_bed_file: str, output_peak_file: str, bedtools_path: str | None = None) None#

Retains only the peaks from ‘peaks1_bed_file’ that don’t overlap with peaks in ‘peaks2_bed_file’.

Parameters:
  • peaks1_bed_file – Path to the peaks1 bed file.

  • peaks2_bed_file – Path to the peaks2 bed file.

  • output_peak_file – Path where the result peak list will be saved.

  • bedtools_path – Optional path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.

piaso.parsePeak(peak: str)#

Parse peak string into a tuple containing chromosome, start, and end.

Parameters:

peak – A string representing a peak, e.g. “chr:start-end” or “chr-start-end”.

Returns:

A tuple containing chromosome as a string, start and end as integers.

piaso.processTSSbed(bedtools_path, tss_bed_file, genome_size_file, slop_l, slop_r, shift, output_dir, prefix='tss')#

A function to process bed files: slop, left shift, and right shift operations are performed.

Parameters:
  • bedtools_path – str, the path to the bedtools binaries.

  • tss_bed_file – str, the path to the input TSS bed file.

  • genome_size_file – str, the path to the genome size file.

  • slop_l – int, the amount to slop (extend) the features to the left.

  • slop_r – int, the amount to slop (extend) the features to the right.

  • shift – int, the amount to shift the features to the left and right.

  • output_dir – str, the directory where the output files will be saved.

  • prefix – str, a prefix for naming output files. Default is “tss”.

piaso.process_celltype(args)#

Processes each cell type: filters fragments, converts to BAM, and then to BigWig.

Parameters: - args (tuple): - input_fragment_file (str): Path to the input fragment file (gzip format). - chrom_sizes_file (str): Path to the chromosome sizes file. - celltype (str): Name of the cell type currently being processed. - group (DataFrame): A DataFrame containing barcodes for the current cell type. - output_dir (str): Directory to save the generated BAM and BigWig files. - bedtools_exec_dir (str): Directory containing bedtools executable. - samtools_exec_dir (str): Directory containing samtools executable. - bamCoverage_exec_dir (str): Directory containing bamCoverage executable. - bamCoverage_ncores (int): Number of processors for bamCoverage. - bamCoverage_binSize (int): Bin size for bamCoverage. - bamCoverage_normalizeMethod (str): Normalization method for bamCoverage. - samtools_ncores (int): Number of processors for samtools.

piaso.process_fragment_file(idx: int, atac_fragments_gz: str, temp_dir: str) str#

Processes the specified directory to create a prefix file. Args: -idx (int): Index to append to the prefix of each line in the output file. -atac_fragments_gz (str): Full path to the fragment file to process. -temp_dir (str): Path to the temporary directory to store intermediate files. Returns: str: Path to the created prefix file.

piaso.quantifyPeakActivity(fragment_file: str, peak_file: str, bedtools_path: str | None = None)#

Process fragment and peak files and returns an AnnData object

Parameters: :param fragment_file: Path to the fragment file :param peak_file: Path to the peak file :param bedtools_path: Path to the directory containing the bedtools executable. If None, bedtools command available in the PATH will be used. Default is None.

Returns: An AnnData object

Examples: adata=quantifyPeakActivity( fragment_file=’./atac_fragments.tsv.gz’, peak_file=’./macs2_peaks.bed’ )

piaso.runMACS2(fragment_file: str, macs_name: str, output_dir: str, genome_size: str = '1.87e+9', keep_dup: str = '1', extsize: int = 200, shift: int = -100, file_format: str = 'BED', macs2_path: str | None = None, macs2_silent: bool = True) None#

Call the MACS2 to perform peak calling.

Parameters:
  • fragment_file – str, path to the fragment file.

  • macs_name – str, name string of the experiment.

  • output_dir – str, path to the output directory.

  • genome_size – str, size of the genome or abbreviation of the organism: ‘hs’ for Homo sapiens (default), ‘mm’ for Mus musculus, ‘ce’ for Caenorhabditis elegans, ‘dm’ for Drosophila melanogaster. Any other input will be considered as a direct size value. Default is ‘1.87e+9’.

  • genome_size – str, whether to keep the duplicates at the same location. Default is ‘1’. Choose ‘all’ if you want to ask MACS2 to keep all.

  • extsize – int, the extsize parameter for macs2. Default is 200.

  • shift – int, the shift parameter for macs2. Default is -100.

  • file_format – str, format of the input file. Default is ‘BED’.

  • macs2_path – str, path to the directory containing macs2 executable. If None, ‘macs2’ command available in the PATH will be used. Default is None.

  • macs2_silent – bool, if set to True, suppresses the MACS2 command output. Default is True.

piaso.runMACS2Parallel(input_dir: str, output_dir: str, genome_size: str = '1.87e+9', keep_dup: str = '1', extsize: int = 200, shift: int = -100, file_format: str = 'BAM', macs2_path: str | None = None, macs2_silent: bool = True, max_threads: int = 10) None#

Call MACS2 in parallel for multiple files in a given input directory.

Parameters:
  • input_dir – str, path to the directory containing input files.

  • output_dir – str, path to the directory where MACS2 output will be saved.

  • genome_size – str, size of the genome or abbreviation of the organism: ‘hs’ for Homo sapiens (default), ‘mm’ for Mus musculus, ‘ce’ for Caenorhabditis elegans, ‘dm’ for Drosophila melanogaster. Any other input will be considered as a direct size value.

  • keep_dup – str, whether to keep the duplicates at the same location. Default is ‘1’. Choose ‘all’ if you want MACS2 to keep all.

  • extsize – int, the extsize parameter for MACS2. Default is 200.

  • shift – int, the shift parameter for MACS2. Default is -100.

  • file_format – str, expected format of the input files. Supports ‘BAM’, ‘BAMPE’, ‘BED’, ‘BEDPE’. Default is ‘BAM’.

  • macs2_path – str, path to the directory containing the macs2 executable. If None, ‘macs2’ command available in the PATH will be used. Default is None.

  • macs2_silent – bool, if set to True, suppresses the MACS2 command output. Default is True.

  • max_threads – int, maximum number of threads to use for parallel execution. Default is 10.

Returns:

None. The results will be saved in the specified output directory.

piaso.runMFE(adata, batch_key: str | None = None, groupby: str | None = None, n_gene: int = 30, mu: float = 1.0, resolution: float = 1.0, scoring_method: str | None = None, roeg_layer: str = 'raw')#
piaso.runSVD(adata: AnnData, use_highly_variable: bool = True, n_components: int = 50, random_state: int | None = 10, scale_data: bool = False, key_added: str = 'X_svd', layer: str | None = None) None#

Run Truncated Singular Value Decomposition (SVD) on the AnnData object and stores the result in adata.obsm[key_added].

Parameters: adata (AnnData): The annotated data matrix. use_highly_variable (bool): Whether to use highly variable genes/features only. Default is True. n_components (int): Desired dimensionality of output data. Must be strictly less than the number of features. Default is 50. random_state (int, optional): Random seed for reproducibility. Default is 10. scale_data (bool): Whether to scale the data using StandardScaler. Default is False. key_added (str): Key in adata.obsm to store the result. Default is ‘X_svd’. layer (str, optional): Specify the layer to use. If None, use adata.X. Default is None.

Usage: `python runSVD(adata, use_highly_variable=True, n_components=50, random_state=10, scale_data=False, key_added='X_svd', layer='raw') ` This will run SVD on the specified layer ‘raw’ of the adata object, and will store the result in adata.obsm[‘X_svd’].

piaso.run_tfidf(adata: AnnData, layer: str | None = None, scale_factor: int = 10000.0)#

Compute the TF-IDF (Term Frequency - Inverse Document Frequency) matrix for the input peak count data.

Parameters#

adataAnnData

An AnnData object containing the peak count data.

scale_factorfloat, optional

A scaling factor to multiply the resulting TF-IDF matrix by. Default is 1e4.

Returns#

None

The method modifies the input data object in place by setting its X attribute to the computed TF-IDF matrix.

piaso.saveBEDFromArray(regions: ndarray, output_bed_file: str) None#

Saves a BED file derived from a numpy array of genomic regions.

Parameters:
  • regions – numpy array containing the genomic regions.

  • output_bed_file – Full path including filename where the BED file will be saved.

piaso.sortBED(input_bed: str, output_bed: str) None#

Sort a .bed file based on two keys: 1) The first field (as a string) 2) The second field (as a number)

Parameters:
  • input_bed – Path to the input .bed file

  • output_bed – Path to the output .bed file

piaso.sortBamByCBCoords(input_bam_path, output_bam_path, samtools_path=None, memory_per_thread='512M', n_threads=4, tmp_dir=None)#

Sort a BAM file first by the cell barcode (CB tag) and then by coordinates.

Parameters: :param input_bam_path: Path to the input BAM file. :param output_bam_path: Path to the output BAM file that will be sorted. :param samtools_path: Optional path to the directory containing the samtools executable. If None, samtools command available in the PATH will be used. Default is None. :param memory_per_thread: Maximum memory per thread to use. Default is “512M”. :param n_threads: Number of threads to use. Default is 4. :param tmp_dir: Path to the directory to be used for temporary files. If None, system default is used. Default is None.

piaso.sortDictByKeys(my_dict, keys, ascending=False)#

Sorts a given list of keys based on their corresponding values in the provided dictionary and returns a DataFrame with sorted keys and their corresponding values.

Parameters:
  • my_dict (dict) – Dictionary containing key-value pairs.

  • keys (list) – List of keys to be sorted.

  • ascending (bool) – Boolean to decide whether the sorting should be in ascending order. Default is False (descending).

Returns:

DataFrame with one column containing the sorted keys and another containing the corresponding values.

Return type:

pd.DataFrame