MutationCaller object

class isomut2py.mutationcalling.MutationCaller(**kwargs)

The MutationCaller class is designed to keep all parameter values, directories and filepaths in one place that are needed for the mutation detection and postprocessing of a single or multiple sample(s).

  • List of basic parameters:
    • ref_fasta: The path to the fasta file of the reference genome. (str)
    • output_dir: The path to a directory that can be used for temporary files and output files. The user must have permission to write the directory. (str)
    • input_dir: The path to the directory, where the bam file(s) of the sample(s) is/are located. (str)
    • bam_filename: A list of the name(s) of the bam file(s) of the sample(s). (Without path, eg. [“sample_1.bam”, “sample_2.bam”, “sample_3.bam”, …].) (list of str)
    • samtools_fullpath: The path to samtools on the computer. (default: “samtools”) (str)
  • Other parameters with default values:
    • n_min_block: The approximate number of blocks to partition the analysed genome to for parallel computing. The actual number might be slightly larger that this. (default: 200) (int)
    • n_conc_blocks: The number of blocks to process at the same time. (default: 4) (int)
    • chromosomes: The list of chromosomes to analyse. (default: all chromosomes included included in the reference genome specified in ref_fasta) (list of str)
    • base_quality_limit: The base quality limit used by samtools in order to decide if a base should be included in the pileup file. (default: 30) (int)
    • samtools_flags: The samtools flags to be used for pileup file generation. (default: ” -B -d 1000 “) (str)
    • unique_mutations_only: If True, only those mutations are sought that are unique to specific samples. Setting it to False greatly increases computation time, but allows for the detection of shared mutations and thus the analysis of phylogenetic connections between samples. If True, print_shared_by_all is ignored. (default: True) (boolean)
    • print_shared_by_all: If False, mutations that are present in all analysed samples are not printed to the output files. This decreases both memory usage and computation time. (default: False) (boolean)
    • min_sample_freq: The minimum frequency of the mutated base at a given position in the mutated sample(s). (default: 0.21) (float)
    • min_other_ref_freq: The minimum frequency of the reference base at a given position in the non-mutated sample(s). (default: 0.95) (float)
    • cov_limit: The minimum coverage at a given position in the mutated sample(s). (default: 5) (float)
    • min_gap_dist_snv: Minimum genomic distance from an identified SNV for a position to be considered as a potential mutation. (default: 0) (int)
    • min_gap_dist_indel: Minimum genomic distance from an identified indel for a position to be considered as a potential mutation. (default: 20) (int)
    • use_local_realignment: By default, mutation detection is run only once, with the samtools mpileup command with option -B. This turns off the probabilistic realignment of reads while creating the temporary pileup file, which might result in false positives due to alignment error. To filter out these mutations, setting the above parameter to True runs the whole mutation detection pipeline again for possibly mutated positions without the -B option as well, and only those mutations are kept that are still present with the probabilistic realignment turned on. Setting use_local_realignment = True increases runtime. (default: False) (boolean)
    • ploidy_info_filepath: Path to the file containing ploidy information (see below for details) of the samples. (default: None) (str)
    • constant_ploidy: The default ploidy to be used in regions not contained in ploidy_info_filepath. (default: 2) (int)
    • bedfile: Path to a bedfile containing a list of genomic regions, if mutations are only sought in a limited part of the genome. (default: None) (str)
    • control_samples: list of the bam filenames of the control samples. Control samples are not expected to have any unique mutations, thus they can be used to optimize the results. (default: None) (array-like)
    • FPs_per_genome: the maximum number of false positive mutations allowed in a control sample (default: None) (int)
    • mutations: a pandas DataFrame of mutations found in the sample set (default: None) (pandas.DataFrame)
    • chrom_length: the length of chromosomes (default: not set) (array-like)
    • genome_length: the total length of the genome (default: not set) (int)
    • IDspectra: a dictionary containing the indel spectra of all the samples in the dataset (default: not set) (dict with keys=bam_filename and values=array-like)
    • DNVspectra: a dictionary containing the dinucleotide variation spectra of all the samples in the dataset (default: not set) (dict with keys=bam_filename and values=array-like)
    • SNVspectra: a dictionary containing the single nucleotide variation spectra of all the samples in the dataset (default: not set) (dict with keys=bam_filename and values=array-like)
calculate_DNV_matrix(unique_only=True, **kwargs)

Calculates the dinucleotide variation spectrum in matrix format from the dataframe of relevant mutations, using the fasta file of the reference genome. Results are stored in the DNVmatrice attribute of the object.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • kwargs

    keyword arguments for MutationCaller attributes

    • mutations_dataframe: A pandas DataFrame where mutations are listed. (default: not set) (pandas.DataFrame)
    • sample_names: A list of sample names to plot results for, a subset of the list in “bam_filename” attribute. (default: not set) (list)
    • mutations_filaname: the path to the file where mutations are stored (default: not set) (str)
calculate_DNV_spectrum(unique_only=True, **kwargs)

Calculates the dinucleotide variation spectrum from the dataframe of relevant mutations, using the fasta file of the reference genome. Results are stored in the DNVmatrice attribute of the object.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • kwargs

    keyword arguments for MutationCaller attributes

    • mutations_dataframe: A pandas DataFrame where mutations are listed. (default: not set) (pandas.DataFrame)
    • sample_names: A list of sample names to plot results for, a subset of the list in “bam_filename” attribute. (default: not set) (list)
    • mutations_filaname: the path to the file where mutations are stored (default: not set) (str)
calculate_SNV_spectrum(unique_only=True, **kwargs)

Calculates the triplet spectrum from the dataframe of relevant mutations, using the fasta file of the reference genome. Results are stored in the SNVspectra attribute of the object.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • kwargs

    keyword arguments for MutationCaller attributes

    • mutations_dataframe: A pandas DataFrame where mutations are listed. (default: not set) (pandas.DataFrame)
    • sample_names: A list of sample names to plot results for, a subset of the list in “bam_filename” attribute. (default: not set) (list)
    • mutations_filaname: the path to the file where mutations are stored (default: not set) (str)
calculate_and_plot_DNV_heatmap(unique_only=True, return_string=False, **kwargs)

Calculates and plots the DNV spectra as a heatmap of the samples listed in attribute bam_filename.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
  • kwargs – keyword arguments for MutationCaller attributes
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

calculate_and_plot_DNV_spectrum(unique_only=True, return_string=False, normalize_to_1=False, **kwargs)

Calculates and plots the DNV spectra of the samples listed in attribute bam_filename.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • normalize_to_1 – If True, results are plotted as percentages, instead of counts. (default: False) (bool)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
  • kwargs – keyword arguments for MutationCaller attributes
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

calculate_and_plot_SNV_spectrum(unique_only=True, normalize_to_1=False, return_string=False, **kwargs)

Calculates and plots the SNV spectra of the samples listed in attribute bam_filename.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • normalize_to_1 – If True, results are plotted as percentages, instead of counts. (default: False) (bool)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
  • kwargs – keyword arguments for MutationCaller attributes
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

calculate_and_plot_indel_spectrum(unique_only=True, normalize_to_1=False, return_string=False, **kwargs)

Calculates and plots the indel spectra of the samples listed in attribute bam_filename.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • normalize_to_1 – If True, results are plotted as percentages, instead of counts. (default: False) (bool)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
  • kwargs – keyword arguments for MutationCaller attributes
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

calculate_indel_spectrum(unique_only=True, **kwargs)

Calculates the indel spectrum from the dataframe of relevant mutations, using the fasta file of the reference genome. Results are stored in the IDspectra attribute of the object.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • kwargs – keyword arguments for MutationCaller attributes
check_pileup(chrom_list, from_pos_list, to_pos_list, print_original=True, savetofile=None, **kwargs)

Loads pileup information for a list of genomic regions.

Parameters:
  • chrom_list – List of chromosomes for the regions. (list of str)
  • from_pos_list – List of starting positions for the regions. (list of int)
  • to_pos_list – List of ending positions for the regions. (list of int)
  • print_original – If True, prints the original string generated with samtools mpileup as well (default: True) (bool)
  • savetofile – If savetofile is specified, results will be writted to that path. Otherwise, results are written to [output_dir]/checkPileup_tmp.csv (default: None) (str)
Returns:

df: The processed results of the pileup, containing information on all samples in a pandas.DataFrame.

decompose_DNV_spectra(sample_names=None, unique_only=True, signatures_file=None, equal_initial_proportions=False, tol=0.0001, max_iter=1000, filter_percent=0, filter_count=0, keep_top_n=None, use_signatures=None, ignore_signatures=None, **kwargs)

Run the whole pipeline of decomposing DNV spectra for the samples specified in sample_names.

Parameters:
  • sample_names – The list of sample names analysed. (list of str)
  • unique_only – If True, only unique mutations are used to construct the original spectrum.
  • signatures_file – The path to the csv file containing the signature matrix. (str)
  • equal_initial_proportions – If True, initial weights are initialized to be equal for all reference signature. Otherwise, they are initialized so the signatures with large cosine similarity with the original spectrum get larger weights. (default: False) (bool)
  • tol – The maximal difference between the proportions for convergence to be True. (default: 0.0001) (float)
  • max_iter – The maximum number of iterations (default: 1000) (int)
  • filter_percent – Filter signatures, that contribute less than filter_percent of the mutations in the sample. (default: 0) (float)
  • filter_count – Filter signatures, that contribute less than filter_count number of mutations in the sample. (default: 0) (int)
  • keep_top_n – Only keep those signatures that contribute to the mixture with the top keep_top_n number of mutation. (default: None) (int)
  • use_signatures – Use a specific subset of all signatures. A list of signature names. (default: None) (list of str)
  • ignore_signatures – Exclude a specific subset of all signatures from the analysis. A list of signature names. (default: None) (list of str)
  • kwargs – keyword arguments for MutationDetection object attributes
Returns:

The final proportions of all signatures in the mixture. (Filtered out or not used signatures appear with a proportion of zero.) (numpy.array)

decompose_SNV_spectra(sample_names=None, unique_only=True, signatures_file=None, equal_initial_proportions=False, tol=0.0001, max_iter=1000, filter_percent=0, filter_count=0, keep_top_n=None, use_signatures=None, ignore_signatures=None, **kwargs)

Run the whole pipeline of decomposing SNV spectra for the samples specified in sample_names.

Parameters:
  • sample_names – The list of sample names analysed. (list of str)
  • unique_only – If True, only unique mutations are used to construct the original spectrum.
  • signatures_file – The path to the csv file containing the signature matrix. (str)
  • equal_initial_proportions – If True, initial weights are initialized to be equal for all reference signature. Otherwise, they are initialized so the signatures with large cosine similarity with the original spectrum get larger weights. (default: False) (bool)
  • tol – The maximal difference between the proportions for convergence to be True. (default: 0.0001) (float)
  • max_iter – The maximum number of iterations (default: 1000) (int)
  • filter_percent – Filter signatures, that contribute less than filter_percent of the mutations in the sample. (default: 0) (float)
  • filter_count – Filter signatures, that contribute less than filter_count number of mutations in the sample. (default: 0) (int)
  • keep_top_n – Only keep those signatures that contribute to the mixture with the top keep_top_n number of mutation. (default: None) (int)
  • use_signatures – Use a specific subset of all signatures. A list of signature names. (default: None) (list of str)
  • ignore_signatures – Exclude a specific subset of all signatures from the analysis. A list of signature names. (default: None) (list of str)
  • kwargs – keyword arguments for MutationDetection object attributes
Returns:

The final proportions of all signatures in the mixture. (Filtered out or not used signatures appear with a proportion of zero.) (numpy.array)

decompose_indel_spectra(sample_names=None, unique_only=True, signatures_file=None, equal_initial_proportions=False, tol=0.0001, max_iter=1000, filter_percent=0, filter_count=0, keep_top_n=None, use_signatures=None, ignore_signatures=None, **kwargs)

Run the whole pipeline of decomposing indel spectra for the samples specified in sample_names.

Parameters:
  • sample_names – The list of sample names analysed. (list of str)
  • unique_only – If True, only unique mutations are used to construct the original spectrum.
  • signatures_file – The path to the csv file containing the signature matrix. (str)
  • equal_initial_proportions – If True, initial weights are initialized to be equal for all reference signature. Otherwise, they are initialized so the signatures with large cosine similarity with the original spectrum get larger weights. (default: False) (bool)
  • tol – The maximal difference between the proportions for convergence to be True. (default: 0.0001) (float)
  • max_iter – The maximum number of iterations (default: 1000) (int)
  • filter_percent – Filter signatures, that contribute less than filter_percent of the mutations in the sample. (default: 0) (float)
  • filter_count – Filter signatures, that contribute less than filter_count number of mutations in the sample. (default: 0) (int)
  • keep_top_n – Only keep those signatures that contribute to the mixture with the top keep_top_n number of mutation. (default: None) (int)
  • use_signatures – Use a specific subset of all signatures. A list of signature names. (default: None) (list of str)
  • ignore_signatures – Exclude a specific subset of all signatures from the analysis. A list of signature names. (default: None) (list of str)
  • kwargs – keyword arguments for MutationDetection object attributes
Returns:

The final proportions of all signatures in the mixture. (Filtered out or not used signatures appear with a proportion of zero.) (numpy.array)

get_details_for_mutations(mutations_filename=None, **kwargs)

Get detailed results for the list of mutations contained in the mutations attribute of the object.

Parameters:
  • mutations_filename – The path(s) to the file(s) where mutations are stored. (default: None) (list of str)
  • kwargs

    keyword arguments for MutationDetection object attributes

    • mutations_dataframe: A pandas DataFrame where mutations are listed. (default: not set) (pandas.DataFrame)
Returns:

df_joined: A dataframe containing the detailed results. (pandas.DataFrame)

load_mutations(filename=None)

Loads mutations from a file or a list of files into the mutations attribute.

Parameters:filename – The path to the file, where mutations are stored. A list of paths can be also supplied, in this case, all of them will be loaded to a single dataframe. The mutations attribute of the MutationDetection object will be set to the loaded dataframe. If None, the file [output_dir]/filtered_mutations.csv will be loaded. (default: None) (str)
optimize_results(control_samples, FPs_per_genome, **kwargs)

Optimizes the list of detected mutations according to the list of control samples and desired level of false positives set by the user. Filtered results will be loaded to the mutations attribute of the MutationDetection object as a pandas.DataFrame. Optimized values for the score are stored in the attribute optimized_score_values.

Parameters:
  • control_samples – List of sample names that should be used as control samples in the sense, that no unique mutations are expected in them. (The sample names listed here must match a subset of the sample names listed in the attribute bam_filename.) (list of str)
  • FPs_per_genome – The total number of false positives tolerated in a control sample. (int)
  • kwargs

    possible keyword arguments besides MutationCaller attributes:

    • plot_roc_curve: If True, ROC curves will be plotted as a visual representation of the optimization process. (default: False) (boolean)
    • plot_tuning_curve: If True, tuning curves displaying the number of mutations found in different samples with different score filters will be plotted as a visual representation of the optimization process. (default: False) (boolean)
plot_DNV_heatmap(return_string=False, **kwargs)

Plots the DNV spectra as a heatmap for the samples in attribute bam_filename.

Parameters:return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
Returns:If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.
plot_DNV_spectrum(return_string=False, normalize_to_1=False, **kwargs)

Plots the DNV spectra for the samples in attribute bam_filename.

Parameters:
  • normalize_to_1 – If True, results are plotted as percentages, instead of counts. (default: False) (bool)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

plot_SNV_spectrum(return_string=False, normalize_to_1=False, **kwargs)

Plots the triplet spectra for the samples in attribute bam_filename.

Parameters:
  • normalize_to_1 – If True, results are plotted as percentages, instead of counts. (default: False) (bool)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

plot_hierarchical_clustering(mutations_dataframe=None, return_string=False, mutations_filename=None, **kwargs)

Generates a heatmap based on the number of shared mutations found in all possible sample pairs. A dendrogram is also added that is the result of hierarchical clustering of the samples.

Parameters:
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
  • mutations_filename – The path to the file, where mutations are stored, if the mutations attribute of the object does not exist, its value will be set to the file defined here. (default: None) (str)
  • mutations_dataframe – If the mutations are not to be loaded from a file, but are contained in a pandas.DataFrame, this can be supplied with setting the mutations_dataframe parameter. (default: None) (pandas.DataFrame)
  • kwargs – keyword arguments for MutationDetection object attributes
Returns:

If the return_string value is True, a base64 encoded string of the image. Otherwise, a matplotlib figure.

plot_indel_spectrum(return_string=False, normalize_to_1=False, **kwargs)

Plots the indel spectra for the samples in attribute bam_filename.

Parameters:
  • normalize_to_1 – If True, results are plotted as percentages, instead of counts. (default: False) (bool)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

plot_mutation_counts(unique_only=False, return_string=False, mutations_filename=None, mutations_dataframe=None, **kwargs)

Plots the number of mutations found in all the samples in different ploidy regions.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: False) (boolean)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
  • mutations_filename – The path to the file, where mutations are stored, if the mutations attribute of the object does not exist, its value will be set to the file defined here. (default: None) (str)
  • mutations_dataframe – If the mutations are not to be loaded from a file, but are contained in a pandas.DataFrame, this can be supplied with setting the mutations_dataframe parameter. (default: None) (pandas.DataFrame)
  • kwargs

    possible keyword arguments besides MutationCaller attributes

    • control_samples: List of sample names that should be used as control samples in the sense, that no unique mutations are expected in them. (The sample names listed here must match a subset of the sample names listed in bam_filename.) (list of str)
Returns:

If the return_string value is True, a base64 encoded string of the image. Otherwise, a matplotlib figure.

plot_rainfall(unique_only=True, return_string=False, **kwargs)

Plots the rainfall plot of mutations found in the samples listed in the attribute bam_filename. Displaying rainfall plots is a good practice to detect mutational clusters throughout the genome. The horizontal axis is the genomic position of each mutation, while the vertical shows the genomic distance between the mutation and the previous one. Thus mutations that are clustered together appear close to each other horizontally and on the lower part of the plot vertically.

Parameters:
  • unique_only – If True, only unique mutations are plotted for each sample. (default: True) (boolean)
  • return_string – If True, only a temporary plot is generated and its base64 code is returned, that can be included in HTML files. (default: False) (bool)
  • kwargs

    keyword arguments for MutationCaller attributes

    • mutations_dataframe: A pandas DataFrame where mutations are listed. (default: not set) (pandas.DataFrame)
    • sample_names: A list of sample names to plot results for, a subset of the list in “bam_filename” attribute. (default: not set) (list)
    • mut_types: list of mutation types to display (default: not set, displaying SNVs, insertions and deletions) (a list containing any combination of these items: [‘SNV’, ‘INS’, ‘DEL’])
    • plot_range: The genomic range to plot. (default: not set) (str, example: “chr9:2342-24124”)
Returns:

If the return_string value is True, a list of base64 encoded strings of the images. Otherwise, a list of matplotlib figures.

run_isomut2_mutdet(**kwargs)

Runs IsoMut2 mutation detection pipeline on the MutationDetection object, using parameter values specified in the respective attributes of the object.

Parameters:kwargs – keyword arguments for MutationDetection object attributes