PloidyEstimator object

class isomut2py.ploidyestimation.PloidyEstimator(**kwargs)

The PloidyEstimator class is designed to keep all parameter values, directories and filepaths in one place that are needed for the ploidy analysis of a single sample.

  • List of basic parameters:
    • ref_fasta: The path to the fasta file of the reference genome. (str)
    • output_dir: The path to a directory that can be used for temporary files and output files. The user must have permission to write the directory. (str)
    • input_dir: The path to the directory, where the bam file(s) of the sample(s) is/are located. (str)
    • bam_filename: The name of the bam file of the sample. (Without path, eg. “sample_1.bam”.) (str)
    • samtools_fullpath: The path to samtools on the computer. (default: “samtools”) (str)
  • Other parameters with default values:
    • n_min_block: The approximate number of blocks to partition the analysed genome to for parallel computing. The actual number might be slightly larger that this. (default: 200) (int)
    • n_conc_blocks: The number of blocks to process at the same time. (default: 4) (int)
    • chromosomes: The list of chromosomes to analyse. (default: all chromosomes included included in the reference genome specified in ref_fasta) (list of str)
    • windowsize: The windowsize used for initial coverage smoothing of the bam file with a moving average method. Setting it too large might disguise CNV effects. (default: 10000) (int)
    • shiftsize: The shiftsize used for the moving average method of the initial coverage smoothing procedure. MUST be smaller than windowsize. (default: 3000) (int)
    • min_noise: The minimum frequency of non-reference or reference bases detected for a position to be considered for LOH detection. Setting it too small will result in poor noise filtering, setting it too large will result in a decreased number of measurement points. (default: 0.1) (float in range(0,1))
    • base_quality_limit: The base quality limit used by samtools in order to decide if a base should be included in the pileup file. (default: 0) (int)
    • print_every_nth: Even though LOH detection is limited to the positions with a noise level larger that min_noise, ploidy estimation is based on all the genomic positions meeting the above set criteria. By setting the attribute print_every_nth, the number of positions used can be controlled. Setting it large will result in overlooking ploidy variations in shorter genomic ranges, while setting it too small can cause an increase in both memory usage and computation time. Decrease only if a relatively short genome is analysed. (default: 100) (int)
    • windowsize_PE: The windowsize used for actual ploidy estimation after the initial coverage smoothing. (default: 1000000) (int)
    • shiftsize_PE: The shiftsize used for actual ploidy estimation after the initial coverage smoothing. MUST be smaller than windowsize_PE. (default: 50000) (int)
    • cov_max: The maximum coverage in a genomic position that can be considered for ploidy estimation. Alignment errors might cause certain genomic positions to have an enormous coverage. These outliers are ignored, when cov_max is set. The value must be set in agreement with the average sequencing depth. Using a low value for a deeply sequenced sample can result in a decreased number of positions to be analysed. (default: 200) (int)
    • cov_min: The minimum coverage in a genomic position that can be considered for ploidy estimation. Sequencing noise might cause certain genomic positions to have very low coverage, frequently merely from misaligned reads. These outliers are ignored, when cov_min is set. The value must be set in agreement with the average sequencing depth. Using a high value for a shallowly sequenced sample can result in a decreased number of positions to be analysed. (default: 5) (int)
    • hc_percentile: The haploid coverage of the sequenced sample is estimated multiple times by fitting a mixture model to the coverage distribution of the sample. The actual value is chosen as a statistical measure of these multiple results, set by the value of hc_percentile. For example, setting hc_percentile to 50 results in using the median of the results. For more details on the suggested values, see Pipek et al. 2018. (default: 75) (int)
    • compare_to_bed: The path to a bed file to compare ploidy estimation results to. (default: None) (str)
    • samtools_flags: The samtools flags to be used for pileup file generation. (default: ” -B -d 1000 “) (str)
    • user_defined_hapcov: During ploidy estimation, the haploid coverage is estimated from the coverage distribution of the sample. In some cases, the estimation might not find the real value of the haploid coverage. In these situations, supplying an estimate of the haploid coverage manually might improve the overall ploidy estimation results. If you have a generally diploid genome, using the half of the average coverage can be a good starting point. If user_defined_hapcov is set, hc_percentile is ignored. (default: None) (float)
PE_on_chrom(chrom, **kwargs)

Runs the whole ploidy estimation pipeline on a given chromosome, using the appropriate attributes of the PloidyEstimation object by running PE_on_range() multiple times. Prints the results to the file: [self.output_dir]/PE_fullchrom_[chrom].txt.

Parameters:
  • chrom – the name of the chromosome to analyse (str)
  • kwargs – keyword arguments for PloidyEstimator object
PE_on_whole_genome(**kwargs)

Runs the whole ploidy estimation pipeline on the whole genome by running PE_on_chrom() on all chromosomes.

Parameters:kwargs – keyword arguments for PloidyEstimator object
compare_with_other(other, minLen=2000, minQual=0.1)

Compare ploidy estimation results with another PloidyEstimation object or a bed file.

Parameters:
  • other – The other PloidyEstimation object or the path to the other bedfile. (isomut2py.PloidyEstimation or str)
  • minLen – The minimum length of a region to be considered different from the other object or file. (int)
  • minQual – The minimum quality of a region to be considered different from the other object or file. (float)
estimate_hapcov_infmix(level=0, **kwargs)

Estimates the haploid coverage of the sample from the appropriate attributes of the PloidyEstimation object. If the user_defined_hapcov attribute is set manually, it sets the value of estimated_hapcov to that. Otherwise, a many-component (20) Gaussian mixture model is fitted to the coverage histogram of the sample 10 times. Each time, the haploid coverage is estimated from the center of the component with the maximal weight in the model. The final estimate of the haploid coverage is calculated as the qth percentile of the 10 measurements, with q = hc_percentile. Sets the “estimated_hapcov” attribute to the calculated haploid coverage and the “coverage_sample” attribute to a 2000-element sample of the coverage distribution.

Parameters:
  • level – the level of indentation used in verbose output (default: 0) (int)
  • kwargs – keyword arguments for PloidyEstimator object
fit_gaussians(level=0, **kwargs)

Fits a 7-component Gaussian mixture model to the coverage distribution of the sample, using the appropriate attributes of the PloidyEstimation object. The center of the first Gaussian is initialized from a narrow region around the value of the estimated_hapcov attribute. The centers of the other Gaussians are initialized in a region around the value of estimated_hapcov multiplied by consecutive whole numbers.

The parameters of the fitted model (center, sigma and weight) for all seven Gaussians are both saved to the GaussDistParams.pkl file (in output_dir, for later reuse) and set as the value of the distribution_dict attribute.

Parameters:
  • level – the level of indentation used in verbose output (default: 0) (int)
  • kwargs – keyword arguments for PloidyEstimator object
generate_HTML_report_for_ploidy_est(**kwargs)

Generates a HTML file with figures displaying the results of ploidy estimation and saves it to output_dir/PEreport.html.

Parameters:kwargs – keyword arguments for PloidyEstimator object
get_bed_format_for_sample(**kwargs)

Creates bed file of constant ploidies for a given sample from a file of positional ploidy data. If the ownbed_filepath attribute of the PloidyEstimation object is set, saves the bedfile to the path specified there. Otherwise, saves it to the output_dir with the “_ploidy.bed” suffix. Also sets the bed_dataframe attribute to the pandas.Dataframe containing the bed file.

Parameters:kwargs – keyword arguments for PloidyEstimator object
get_coverage_distribution(**kwargs)

Sets the coverage_sample attribute of the PloidyEstimation object to the coverage distribution obtained from the temporary files created by __PE_prepare_temp_files(). Positions are filtered according to the attributes of the PloidyEstimation object. The number of positions in the final sample is decreased to 2000 for faster inference.

Parameters:kwargs – keyword arguments for PloidyEstimator object
Returns:A 2000-element sample of the coverage distribution.
load_bedfile_from_file(filename=None, **kwargs)

Loads the bedfile containing previous ploidy estimated for the given sample from the path specified in filename. The dataframe will be stored in the “bed_dataframe” attribute.

Parameters:
  • filename – The path to the bedfile. (default: [output_dir]/[bam_filename]_ploidy.bed) (str)
  • kwargs – keyword arguments for PloidyEstimator object
load_cov_distribution_parameters_from_file(filename=None, **kwargs)

Loads the parameters of the seven fitted Gaussians to the coverage distribution of the sample from the specified filename (that was saved with pickle beforehand). If one such file is available, the computationally expensive ploidy estimation process can be skipped. The parameter values will be stored in the attribute “distribution_dict” as a dictionary.

Parameters:
  • filename – The path to the file with the coverage distribution parameters. (default: [output_dir]/GaussDistParams.pkl) (str)
  • kwargs – keyword arguments for PloidyEstimator object
plot_coverage_distribution(**kwargs)

Plot the coverage distribution of the sample.

Parameters:kwargs – keyword arguments for PloidyEstimator object
Returns:a matplotlib figure of the coverage distribution
plot_karyotype_for_all_chroms(return_string=False, **kwargs)

Plots karyotype information (coverage, estimated ploidy, estimated LOH, reference base frequencies) about the sample for all analysed chromosomes.

Parameters:
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: True) (bool)
  • kwargs – keyword arguments for PloidyEstimator object
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

plot_karyotype_summary(**kwargs)

Plots a simple karyotype summary for the whole genome. (Details coming soon.)

Parameters:kwargs – keyword arguments for PloidyEstimator object
Returns:a matplotlib figure of the plot
run_ploidy_estimation(**kwargs)

Runs the whole ploidy estimation pipeline on the PloidyEstimation object.

Parameters:kwargs – keyword arguments for PloidyEstimator object