Functions for postprocessing lists of mutations

isomut2py.postprocess.calculate_DNV_matrix(mutations_dataframe=None, sample_names=None, unique_only=True, output_dir=None, mutations_filename=None)

Calculates the DNV matrix from a dataframe of relevant mutations.

Parameters:
  • mutations_dataframe – Dataframe containing the list of mutations to be considered. (default: None) (pandas.DataFrame)
  • sample_names – the list of sample names included in the analysis (default: None) (list of str)
  • unique_only – if True, only unique mutations are considered for the spectrum (default: True) (bool)
  • output_dir – path to the directory where mutation tables are located (default: None) (str)
  • mutations_filename – path to the mutation table(s) (default: None) (list of str)
Returns:

A dictionary containing DNV matrices as values and sample names as keys. (str: numpy.array)

isomut2py.postprocess.calculate_DNV_spectrum(mutations_dataframe=None, sample_names=None, unique_only=True, output_dir=None, mutations_filename=None)

Calculates the indel spectrum from a dataframe of relevant mutations.

Parameters:
  • mutations_dataframe – Dataframe containing the list of mutations to be considered. (default: None) (pandas.DataFrame)
  • sample_names – the list of sample names included in the analysis (default: None) (list of str)
  • unique_only – if True, only unique mutations are considered for the spectrum (default: True) (bool)
  • output_dir – path to the directory where mutation tables are located (default: None) (str)
  • mutations_filename – path to the mutation table(s) (default: None) (list of str)
Returns:

A dictionary containing DNV spectra arrays (the counts for each mutation type) as values and sample names as keys. (str: numpy.array)

isomut2py.postprocess.calculate_SNV_spectrum(ref_fasta, mutations_dataframe=None, sample_names=None, unique_only=True, chromosomes=None, output_dir=None, mutations_filename=None)

Calculates the triplet spectrum from a dataframe of relevant mutations, using the fasta file of the reference genome.

Parameters:
  • ref_fasta – the path to the reference genome fasta file (str)
  • mutations_dataframe – Dataframe containing the list of mutations to be considered. (default: None) (pandas.DataFrame)
  • sample_names – the list of sample names included in the analysis (default: None) (list of str)
  • unique_only – if True, only unique mutations are considered for the spectrum (default: True) (bool)
  • chromosomes – list of chromosomes in the analysis (default: None) (list of str)
  • output_dir – path to the directory where mutation tables are located (default: None) (str)
  • mutations_filename – path to the mutation table(s) (default: None) (list of str)
Returns:

A dictionary containing 96-element vectors (the counts for each mutation type) as values and sample names as keys. (str: numpy.array)

isomut2py.postprocess.calculate_indel_spectrum(ref_fasta, mutations_dataframe=None, sample_names=None, unique_only=True, chromosomes=None, output_dir=None, mutations_filename=None)

Calculates the indel spectrum from a dataframe of relevant mutations, using the fasta file of the reference genome.

Parameters:
  • ref_fasta – the path to the reference genome fasta file (str)
  • mutations_dataframe – Dataframe containing the list of mutations to be considered. (default: None) (pandas.DataFrame)
  • sample_names – the list of sample names included in the analysis (default: None) (list of str)
  • unique_only – if True, only unique mutations are considered for the spectrum (default: True) (bool)
  • chromosomes – list of chromosomes in the analysis (default: None) (list of str)
  • output_dir – path to the directory where mutation tables are located (default: None) (str)
  • mutations_filename – path to the mutation table(s) (default: None) (list of str)
Returns:

A dictionary containing 83-element vectors (the counts for each mutation type) as values and sample names as keys. (str: numpy.array)

isomut2py.postprocess.check_pileup(chrom_list, from_pos_list, to_pos_list, ref_fasta, input_dir, bam_filename, output_dir='', samtools_fullpath='samtools', base_quality_limit=30, samtools_flags=' -B -d 1000 ', print_original=True, filename=None)

Loads pileup information for a list of genomic regions.

Parameters:
  • chrom_list – List of chromosomes for the regions. (list of str)
  • from_pos_list – List of starting positions for the regions. (list of int)
  • to_pos_list – List of ending positions for the regions. (list of int)
  • ref_fasta – the path to the reference genome fasta file (str)
  • bam_filename – the filenames for the bam files of the investigated samples (list of str)
  • output_dir – path to the directory where temporary files should be saved (default: ‘’) (str)
  • samtools_fullpath – path to samtools on the computer (default: “samtools’) (str)
  • base_quality_limit – base quality limit for the pileup generation (default: 30) (int)
  • samtools_flags – additional flags for samtools (default: ‘ -B -d ‘ + str(SAMTOOLS_MAX_DEPTH) + ‘ ‘) (str)
  • print_original – If True, prints the original string generated with samtools mpileup as well (default: True) (bool)
  • filename – If filename is specified, results will be writted to that path. Otherwise, results are written to [output_dir]/checkPileup_tmp.csv (default: None) (str)
Returns:

df: The processed results of the pileup, containing information on all samples in a pandas.DataFrame.

isomut2py.postprocess.decompose_DNV_spectra(DNVspectrumDict, sample_names=None, signatures_file=None, equal_initial_proportions=False, tol=0.0001, max_iter=1000, filter_percent=0, filter_count=0, keep_top_n=None, use_signatures=None, ignore_signatures=None)

Run the whole pipeline of decomposing DNV spectra for the samples specified in sample_names.

Parameters:
  • DNVspectrumDict – dictionary containing DNV spectra as values and sample names as keys (dictionary)
  • sample_names – The list of sample names analysed. (list of str)
  • unique_only – If True, only unique mutations are used to construct the original spectrum.
  • signatures_file – The path to the csv file containing the signature matrix. (str)
  • equal_initial_proportions – If True, initial weights are initialized to be equal for all reference signature. Otherwise, they are initialized so the signatures with large cosine similarity with the original spectrum get larger weights. (default: False) (bool)
  • tol – The maximal difference between the proportions for convergence to be True. (default: 0.0001) (float)
  • max_iter – The maximum number of iterations (default: 1000) (int)
  • filter_percent – Filter signatures, that contribute less than filter_percent of the mutations in the sample. (default: 0) (float)
  • filter_count – Filter signatures, that contribute less than filter_count number of mutations in the sample. (default: 0) (int)
  • keep_top_n – Only keep those signatures that contribute to the mixture with the top keep_top_n number of mutation. (default: None) (int)
  • use_signatures – Use a specific subset of all signatures. A list of signature names. (default: None) (list of str)
  • ignore_signatures – Exclude a specific subset of all signatures from the analysis. A list of signature names. (default: None) (list of str)
Returns:

The final proportions of all signatures in the mixture. (Filtered out or not used signatures appear with a proportion of zero.) (numpy.array)

isomut2py.postprocess.decompose_SNV_spectra(SNVspectrumDict, sample_names=None, signatures_file=None, equal_initial_proportions=False, tol=0.0001, max_iter=1000, filter_percent=0, filter_count=0, keep_top_n=None, use_signatures=None, ignore_signatures=None)

Run the whole pipeline of decomposing SNV spectra for the samples specified in sample_names.

Parameters:
  • SNVspectrumDict – dictionary containing SNV spectra as values and sample names as keys (dictionary)
  • sample_names – The list of sample names analysed. (list of str)
  • unique_only – If True, only unique mutations are used to construct the original spectrum.
  • signatures_file – The path to the csv file containing the signature matrix. (str)
  • equal_initial_proportions – If True, initial weights are initialized to be equal for all reference signature. Otherwise, they are initialized so the signatures with large cosine similarity with the original spectrum get larger weights. (default: False) (bool)
  • tol – The maximal difference between the proportions for convergence to be True. (default: 0.0001) (float)
  • max_iter – The maximum number of iterations (default: 1000) (int)
  • filter_percent – Filter signatures, that contribute less than filter_percent of the mutations in the sample. (default: 0) (float)
  • filter_count – Filter signatures, that contribute less than filter_count number of mutations in the sample. (default: 0) (int)
  • keep_top_n – Only keep those signatures that contribute to the mixture with the top keep_top_n number of mutation. (default: None) (int)
  • use_signatures – Use a specific subset of all signatures. A list of signature names. (default: None) (list of str)
  • ignore_signatures – Exclude a specific subset of all signatures from the analysis. A list of signature names. (default: None) (list of str)
Returns:

The final proportions of all signatures in the mixture. (Filtered out or not used signatures appear with a proportion of zero.) (numpy.array)

isomut2py.postprocess.decompose_indel_spectra(IDspectrumDict, sample_names=None, signatures_file=None, equal_initial_proportions=False, tol=0.0001, max_iter=1000, filter_percent=0, filter_count=0, keep_top_n=None, use_signatures=None, ignore_signatures=None)

Run the whole pipeline of decomposing indel spectra for the samples specified in sample_names.

Parameters:
  • IDspectrumDict – dictionary containing indel spectra as values and sample names as keys (dictionary)
  • sample_names – The list of sample names analysed. (list of str)
  • signatures_file – The path to the csv file containing the signature matrix. (str)
  • equal_initial_proportions – If True, initial weights are initialized to be equal for all reference signature. Otherwise, they are initialized so the signatures with large cosine similarity with the original spectrum get larger weights. (default: False) (bool)
  • tol – The maximal difference between the proportions for convergence to be True. (default: 0.0001) (float)
  • max_iter – The maximum number of iterations (default: 1000) (int)
  • filter_percent – Filter signatures, that contribute less than filter_percent of the mutations in the sample. (default: 0) (float)
  • filter_count – Filter signatures, that contribute less than filter_count number of mutations in the sample. (default: 0) (int)
  • keep_top_n – Only keep those signatures that contribute to the mixture with the top keep_top_n number of mutation. (default: None) (int)
  • use_signatures – Use a specific subset of all signatures. A list of signature names. (default: None) (list of str)
  • ignore_signatures – Exclude a specific subset of all signatures from the analysis. A list of signature names. (default: None) (list of str)
Returns:

The final proportions of all signatures in the mixture. (Filtered out or not used signatures appear with a proportion of zero.) (numpy.array)

isomut2py.postprocess.get_details_for_mutations(ref_fasta, input_dir, bam_filename, mutations_dataframe=None, output_dir=None, mutations_filename=None, samtools_fullpath='samtools', base_quality_limit=30, samtools_flags=' -B -d 1000 ')

Get detailed results for the list of mutations contained in the mutations attribute of the object.

Parameters:
  • ref_fasta – the path to the reference genome fasta file (str)
  • input_dir – the path to the directory where bam files are located (str)
  • bam_filename – the filenames for the bam files of the investigated samples (list of str)
  • mutations_dataframe – a pandas.DataFrame where mutations are stored (default: None) (pandas.DataFrame)
  • output_dir – path to the directory where temporary files should be saved (default: ‘’) (str)
  • mutations_filename – path to the file(s) where mutations are stored (default: None) (list of str)
  • samtools_fullpath – path to samtools on the computer (default: “samtools’) (str)
  • base_quality_limit – base quality limit for the pileup generation (default: 30) (int)
  • samtools_flags – additional flags for samtools (default: ‘ -B -d ‘ + str(SAMTOOLS_MAX_DEPTH) + ‘ ‘) (str)
Returns:

df_joined: A dataframe containing the detailed results. (pandas.DataFrame)

isomut2py.postprocess.optimize_results(sample_names, control_samples, FPs_per_genome, plot_roc=False, plot_tuning_curve=False, filtered_results_file=None, output_dir=None, mutations_dataframe=None)

Optimizes the list of detected mutations according to the list of control samples and desired level of false positives set by the user. Filtered results will be loaded to the mutations attribute of the MutationDetection object.

Parameters:
  • sample_names – The list of sample names included in the analysis. (list of str)
  • control_samples – List of sample names that should be used as control samples in the sense, that no unique mutations are expected in them. (The sample names listed here must match a subset of the sample names listed in bam_filename.) (list of str)
  • FPs_per_genome – The total number of false positives tolerated in a control sample. (int)
  • plot_roc – If True, ROC curves will be plotted as a visual representation of the optimization process. (default: False) (boolean)
  • plot_tuning_curve – If True, tuning curves displaying the number of mutations found in different samples with different score filters will be plotted as a visual representation of the optimization process. (default: False) (boolean)
  • filtered_results_file – The path to the file where filtered results should be saved. (default: [output_dir]/filtered_results.csv) (str)
  • output_dir – the path to the directory where raw mutation tables are located (default: None) (str)
  • mutations_dataframe – the pandas.DataFrame where mutations are located (default: None) (pandas.DataFrame)
Returns:

(score_lim_dict, filtered_results)

  • score_lim_dict: a dictionary containing the optimized score values for each ploidy separately
  • filtered_results: a pandas.DataFrame containing the filtered mutations

isomut2py.postprocess.optimize_score(mutation_dataframe, control_samples, FPs_per_genome, score0=0, unique_samples=None)

Optimizes score values for different mutation types (SNV, INS, DEL) and ploidies according to the list of control samples and the desired level of false positives in the genome. The results are stored in the score_lim_dict attribute of the MutationDetection object. If plot = True, plots ROC curves for all mutations types (SNV, INS, DEL) and all ploidies.

Parameters:
  • mutation_dataframe – The dataframe containing the mutations. (pandas.DataFrame)
  • control_samples – a subset of bam_filename (list of sample names) that should be considered as control samples. Control samples are defined as samples where no unique mutations are expected to be found. (list of str)
  • FPs_per_genome – the largest number of false positives tolerated in a control sample (int)
  • score0 – Score optimization starts with score0. If a larger score value is likely to be optimal, setting score0 to a number larger than 0 can decrease computation time. (default: 0) (float)
  • unique_samples – list of unique samples where at least one mutation is detected (default: None) (list of str)
Returns:

a dictionary containing the optimized score values for each ploidy