Technical (formatting, IO, etc.) functions

Most of these functions can be called as class methods for MutationCaller objects and/or PloidyEstimator objects. They are described here individually, so that subtasks can be more easily managed.

Formatting functions

isomut2py.format.generate_ploidy_info_file(filename=None, sample_names=None, bed_filepaths=None, ploidy_estimation_objects=None, sample_groups=None, group_bed_filepaths=None)
Generate ploidy info file for mutation detection in samples with nondefault ploidies. Make sure to supply one of the following arguments:
  • ploidy_estimation_objects
  • sample_names AND bed_filepaths
  • sample_groups AND group_bed_filepaths
Parameters:
  • filename – The desired path to the generated ploidy info file. If None, ploidy information is saved to ploidy_info_file.txt in the current directory. (default: None) (str)
  • sample_names – List of bam filenames for the samples, must be supplied together with bed_filepaths. (default: None) (list of str)
  • bed_filepaths – Must be supplied together with sample_names. The list of filepaths to each bed file describing the given sample in sample_names. (default: None) (list of str)
  • ploidy_estimation_objects – List of PloidyEstimation objects for each sample. (default: None) (list of isomut2py.PloidyEstimation)
  • sample_groups – List of lists of str. Each list in sample_groups must contain the name of bam files in that group. Must be supplied together with group_bed_filepaths. (default: None) (list of list of str, example: [[‘sample1.bam’, ‘sample2.bam’, ‘sample3.bam’], [‘sample4.bam’, ‘sample5.bam’], [‘sample6.bam’]])
  • group_bed_filepaths – List of filepaths to the bed files describing each sample group in sample_groups. Must be supplied together with sample_groups. (default: None) (list of str, example: [‘bedfile_of_samples123.txt’, ‘bedfile_of_samples45.txt’ ‘bedfile_of_samples6.txt’])
isomut2py.format.get_bed_format_for_sample(chromosomes, chrom_length, output_dir, bam_filename=None, ownbed_filepath=None)

Creates bed file of constant ploidies for a given sample from a file of positional ploidy data. If the ownbed_filepath attribute of the PloidyEstimation object is set, saves the bedfile to the path specified there. Otherwise, saves it to the output_dir with the “_ploidy.bed” suffix. Also sets the bed_dataframe attribute to the pandas.Dataframe containing the bed file.

Parameters:
  • chromosomes – list of chromosomes (array-like)
  • chrom_length – list of chromosome lengths in bp (array-like)
  • output_dir – path to the directory where the PE_fullchrom_* files are located. (str)
  • bam_filename – filename of the BAM file of the sample (default: None) (str)
  • ownbed_filepath – path to the bed file where results should be saved (default: None) (str)
Returns:

(ownbed_filepath, df)

  • ownbed_filepath: path the the bed file where results are saved
  • df: the bed file in a pandas.DataFrame

IO functions

isomut2py.io.get_coverage_distribution(chromosomes, output_dir, cov_max, cov_min)

Sets the coverage_sample attribute of the PloidyEstimation object to the coverage distribution obtained from the temporary files created by __PE_prepare_temp_files(). Positions are filtered according to the attributes of the PloidyEstimation object. The number of positions in the final sample is decreased to 2000 for faster inference.

Parameters:
  • chromosomes – list of chromosomes in the samples (array-like)
  • output_dir – path to the directory where temporary files (PEtmp_fullchrom*) are located (str)
  • cov_max – the maximum value of the coverage for a position to be included in the analysis (int)
  • cov_min – the minimum value of the coverage for a position to be included in the analysis (int)
Returns:

a 2000-element sample from the coverage distribution

isomut2py.io.load_bedfile_from_file(filename=None, output_dir=None, bam_filename=None)

Loads the bedfile containing previous ploidy estimated for the given sample from the path specified in filename.

Parameters:
  • filename – The path to the bedfile. If not supplied, the bed file is attempted to be loaded from [output_dir]/[bam_filename]_ploidy.bed. (default: None) (str)
  • output_dir – The path to the directory where the default bedfile is located. (default: None) (str)
  • bam_filename – the filename of the BAM file of the sample (default: None) (str)
Returns:

the pandas DataFrame of the bedfile

isomut2py.io.load_cov_distribution_parameters_from_file(filename=None, output_dir=None)

Loads the parameters of the seven fitted Gaussians to the coverage distribution of the sample from the specified filename (that was saved with pickle beforehand). If one such file is available, the computationally expensive ploidy estimation process can be skipped.

Parameters:
  • filename – The path to the file with the coverage distribution parameters. (default: None) (str)
  • output_dir – if filename is not supplied, distribution parameters are attempted to be loaded from [output_dir]/GaussDistParams.pkl (default: None) (str)
Returns:

dictionary containing the fitted coverage distribution parameters

isomut2py.io.load_mutations(output_dir, filename=None)

Loads mutations from a file or a list of files.

Parameters:
  • filename – The path to the file, where mutations are stored. A list of paths can be also supplied, in this case, all of them will be loaded to a single dataframe. If None, the file [output_dir]/filtered_mutations.csv will be loaded. (default: None) (str)
  • output_dir – The path to the directory containing the file filtered_results.csv. (str)
Returns:

a pandas.DataFrame containing the mutations

isomut2py.io.load_obj(name)

Loads a python object from a file with the name (name).pkl.

Parameters:name – filename (without extension)
Returns:loaded object
isomut2py.io.save_obj(obj, name)

Saves a python object (obj) with the specified name as (name).pkl.

Parameters:
  • obj – object to save
  • name – filename (without extension)

Processing functions

(These functions mainly take care of parallelization.)

isomut2py.process.PE_prepare_temp_files(ref_fasta, input_dir, bam_filename, output_dir, genome_length, n_min_block, n_conc_blocks, chromosomes, chrom_length, windowsize=10000, shiftsize=3000, min_noise=0.1, print_every_nth=1000, base_quality_limit=0, samtools_fullpath='samtools', samtools_flags=None, bedfile=None, level=0)

Prepares temporary files for ploidy estimation, by averaging coverage in moving windows for the whole genome and collecting positions with reference allele frequencies in the [min_noise, 1-min_noise] range.

Parameters:
  • ref_fasta – path to the reference genome fasta file (str)
  • input_dir – path to the folder where bam files are located (str)
  • bam_filename – list of bam filenames to analyse (list of str)
  • output_dir – path to the directory where results should be saved (str)
  • genome_length – the total length of the genome in basepairs (int)
  • n_min_block – The approximate number of blocks to partition the analysed genome to for parallel computing. The actual number might be slightly larger that this. (default: 200) (int)
  • n_conc_blocks – The number of blocks to process at the same time. (default: 4) (int)
  • chromosomes – list of chromosomes in the genome (list of str)
  • chrom_length – list of chromosome lengths in the genome (list of int)
  • windowsize – windowsize for initial coverage smoothing with a moving average method (default: 10000) (int)
  • shiftsize – shiftsize for initial coverage smoothing with a moving average method (default: 3000) (int)
  • min_noise – The minimum frequency of non-reference or reference bases detected for a position to be considered for LOH detection. Setting it too small will result in poor noise filtering, setting it too large will result in a decreased number of measurement points. (default: 0.1) (float in range(0,1))
  • print_every_nth – Even though LOH detection is limited to the positions with a noise level larger that min_noise, ploidy estimation is based on all the genomic positions meeting the above set criteria. By setting the attribute print_every_nth, the number of positions used can be controlled. Setting it large will result in overlooking ploidy variations in shorter genomic ranges, while setting it too small can cause an increase in both memory usage and computation time. Decrease only if a relatively short genome is analysed. (default: 100) (int)
  • base_quality_limit – The base quality limit used by samtools in order to decide if a base should be included in the pileup file. (default: 0) (int)
  • samtools_fullpath – The path to samtools on the computer. (default: “samtools”) (str)
  • samtools_flags – The samtools flags to be used for pileup file generation. (default: ” -B -d 1000 “) (str)
  • bedfile – path to the the bedfile to limit ploidy estimation to a specific region of the genome (default: None) (str)
  • level – the level of indentation used in verbose output (default: 0) (int)
isomut2py.process.double_run(mutcallObject, level=0)

Runs the mutation detection pipeline on the MutationCaller object when with local realignment.

Parameters:
  • mutcallObject – an isomut2py.MutationCaller object
  • level – the level of indentation used in verbose output (default: 0) (int)
isomut2py.process.single_run(mutcallObject, level=0)

Runs the mutation detection pipeline on the MutationCaller object when no local realignment is used during the process.

Parameters:
  • mutcallObject – an isomut2py.MutationCaller object
  • level – the level of indentation used in verbose output (default: 0) (int)

Bayesian inference functions

(These functions are called by the PloidyEstimator object to fit different theoretical distributions to the actual coverage distribution calculated from the data.)

isomut2py.bayesian.PE_on_chrom(chrom, output_dir, windowsize_PE, shiftsize_PE, distribution_dict)

Runs the whole ploidy estimation pipeline on a given chromosome, using the appropriate attributes of the PloidyEstimation object by running PE_on_range() multiple times. Prints the results to the file: [output_dir]/PE_fullchrom_[chrom].txt.

Parameters:
  • distribution_dict – a dictionary containing the fitted parameters of the Gaussian mixture model to the coverage distribution. (dict with keys ‘mu’, ‘sigma’ and ‘p’)
  • shiftsize_PE – shiftsize for moving average over regions (int)
  • windowsize_PE – windowsize for moving average over regions (int)
  • output_dir – the path to the directory where temporary files are located (str)
  • chrom – the name of the chromosome to analyse (str)
isomut2py.bayesian.PE_on_range(dataframe, rmin, rmax, all_mu, all_sigma, prior, cov_min=0, cov_max=100000)

Run the ploidy estimation on a given range of a chromosome.

Parameters:
  • dataframe – The dataframe read from the temporary files for a given chromosome, containing information about the genomic position, the measured coverage and the frequency of the non-reference bases aligned to the position. (pandas.DataFrame)
  • rmin – The lower bound of the genomic range considered for the analysis. (int)
  • rmax – The upper bound of the genomic range considered for the analysis. (int)
  • all_mu – The centers of the seven Gaussians fitted to the coverage distribution. (list of float)
  • all_sigma – The sigmas of the seven Gaussians fitted to the coverage distribution. (list of float)
  • prior – The weights of the seven Gaussians fitted to the coverage distribution. (list of float in the range(0,1))
Returns:

(most_probable_ploidy, most_probable_LOH)

  • most_probable_ploidy: the estimated ploidy for the genomic region (int, in range(1,8))
  • most_probable_LOH: the estimated LOH status for the genomic region (1: LOH, 0: no LOH)

isomut2py.bayesian.estimate_hapcov_infmix(cov_min=None, hc_percentile=None, chromosomes=None, output_dir=None, cov_max=None, level=0, cov_sample=None, user_defined_hapcov=None)

Estimates the haploid coverage of the sample. If the user_defined_hapcov attribute is set manually, it sets the value of estimated_hapcov to that. Otherwise, a many-component (20) Gaussian mixture model is fitted to the coverage histogram of the sample 10 times. Each time, the haploid coverage is estimated from the center of the component with the maximal weight in the model. The final estimate of the haploid coverage is calculated as the qth percentile of the 10 measurements, with q = hc_percentile.

Parameters:
  • user_defined_hapcov – if not None, its value is returned (default: None) (float)
  • cov_sample – a sample of the coverage distribution of the investigated sample, if None, it is loaded from the temporary files of the output_dir (default: None) (array-like)
  • cov_max – the maximum value of the coverage for a position to be considered in the estimation (default: None) (int)
  • output_dir – the path to the output directory of the PloidyEstimator object, where temporary files are located (default: None) (str)
  • chromosomes – list of chromosomes for the sample (default: None) (array-like)
  • hc_percentile – the percentile value to use for calculating the estimated haploid coverage from 10 subsequent estimations (default: None) (int)
  • cov_min – the maximum value of the coverage for a position to be considered in the estimation (default: None) (int)
  • level – the level of indentation used in verbose output (default: 0) (int)
Returns:

(estimated haploid coverage (float), sample from coverage distribution (array-like))

isomut2py.bayesian.fit_gaussians(estimated_hapcov, chromosomes=None, output_dir=None, cov_max=None, cov_min=None, level=0, cov_sample=None)

Fits a 7-component Gaussian mixture model to the coverage distribution of the sample, using the appropriate attributes of the PloidyEstimation object. The center of the first Gaussian is initialized from a narrow region around the value of the estimated_hapcov attribute. The centers of the other Gaussians are initialized in a region around the value of estimated_hapcov multiplied by consecutive whole numbers.

The parameters of the fitted model (center, sigma and weight) for all seven Gaussians are both saved to the GaussDistParams.pkl file (in output_dir, for later reuse) and set as the value of the distribution_dict attribute.

Parameters:
  • cov_sample – a sample of the coverage distribution of the investigated sample, if None, it is loaded from the temporary files of the output_dir (default: None) (array-like)
  • cov_min – the maximum value of the coverage for a position to be considered in the estimation (default: None) (int)
  • output_dir – the path to the output directory of the PloidyEstimator object, where temporary files are located. If not None, distribution parameters are saved there as GaussDistParams.pkl. (default: None) (str)
  • chromosomes – list of chromosomes for the sample (default: None) (array-like)
  • estimated_hapcov – the estimated value for the haploid coverage, used as prior (float)
  • level – the level of indentation used in verbose output (default: 0) (int)
Returns:

dictionary containing the fitted parameters of the 7 Gaussians

Functions for ploidy comparison

(These functions perform the comparison of ploidy estimates of two samples for different file formats.)

isomut2py.compare.check_interval_for_difference(chrom, chromStart, chromEnd, ploidy1, ploidy2, original_bamfile1, original_bamfile2, refgenome, prior_dist_dict1, prior_dist_dict2)

Checks if a genomic interval with different estimated ploidies is really different in the two samples.

Parameters:
  • chrom – The chromosome of the genomic interval. (str)
  • chromStart – The starting position of the genomic interval. (int)
  • chromEnd – The ending position of the genomic interval. (int)
  • ploidy1 – The ploidy estimated for the genomic interval for original_bamfile1. (int)
  • ploidy2 – The ploidy estimated for the genomic interval for original_bamfile2. (int)
  • original_bamfile1 – The path to the original bamfile containing alignment information for sample1. (str)
  • original_bamfile2 – The path to the original bamfile containing alignment information for sample2. (str)
  • refgenome – The path to the reference genome fasta file. (str)
  • prior_dist_dict1 – The parameters of the seven Gaussians fitted to the coverage distribution of sample1. (dict)
  • prior_dist_dict2 – The parameters of the seven Gaussians fitted to the coverage distribution of sample2. (dict)
Returns:

the ratio of the likelihood of the two samples having different ploidies in the region and the likelihood of them having the same ploidies (float) - if the ratio of them having different ploidies is smaller than the ratio of them having the same one, the returned value is 0

isomut2py.compare.compare_with_bed(bed_dataframe, other_file, minLen)

Compares the results of ploidy estimation with a bed file defined in other_file.

Parameters:
  • bed_dataframe – a pandas.DataFrame of the bedfile of the sample (pandas.DataFrame)
  • other_file – The path to the bedfile of the other sample. (str)
  • minLen – The minimum length of a region to be considered different from the other_file. (int)
Returns:

df_joined: A pandas.DataFrame containing region information from both the PloidyEstimation object and the other_file.

isomut2py.compare.compare_with_other_PloidyEstimator(ob1, ob2, minLen, minQual)

Compare the estimated ploidies of the PloidyEstimation object with the ploidies of another PloidyEstimator object.

Parameters:
  • ob2 – the other PloidyEstimator object (isomut2py.ploidyestimation.PloidyEstimator)
  • ob1 – the first PloidyEstimator object (isomut2py.ploidyestimation.PloidyEstimator)
  • minLen – The minimum length of a region to be considered different from the other object. (int)
  • minQual – The minimum quality of a region to be considered different from the other object. (float)
Returns:

df_intervals: The differing intervals meeting the above criteria. (pandas.DataFrame)

Functions for loading example parameter settings

(These functions download example datasets and help load the settings for processing these in a concise way.)

isomut2py.examples.download_example_data(path='.')

Download example data from http://genomics.hu/tools/isomut2py/isomut2py_exampleDataset.tar.gz to path.

Parameters:path – where to download (default: ‘.’) (str)
isomut2py.examples.download_raw_example_data(path='.')

Download raw example data from either http://genomics.hu/tools/isomut2py/isomut2py_rawExampleDataset.tar.gz or http://genomics.hu/tools/isomut2py/isomut2py_rawExampleDataset_shortgenome.tar.gz to path.

Parameters:path – where to download (default: ‘.’) (str)
isomut2py.examples.load_example_mutdet_settings(example_data_path='.', output_dir='.')

Loads example settings for mutation detection.

Parameters:
  • example_data_path – the path where example data is located (default: ‘.’) (str)
  • output_dir – the path where results should be saved (default: current working directory) (str)
Returns:

dictionary containing example parameters

isomut2py.examples.load_example_ploidyest_settings(example_data_path='.', output_dir='.')

Loads example settings for ploidy estimation.

Parameters:
  • example_data_path – the path where example data is located (default: ‘.’) (str)
  • output_dir – the path where results should be saved (default: current working directory) (str)
Returns:

dictionary containing example parameters

isomut2py.examples.load_preprocessed_example_ploidyest_settings(example_data_path='.')

Loads example settings for ploidy estimation with preprocessed files.

Parameters:example_data_path – the path where example data is located (default: ‘.’) (str)
Returns:dictionary containing example parameters