Importing external mutations

For instructions on installation and basic exploration steps, see Getting started.

It is possible to post-process lists of mutations with isomut2py that were otherwise generated with an external variant caller tool. Here we demonstrate how external VCF files can be loaded and further analysed.

As isomut2py expects pandas DataFrames as inputs to handle mutation lists, we will be using the pandas package for the loading of VCF files.

[2]:
import isomut2py as im2
import pandas as pd
%matplotlib inline
Compiling C scripts...
Done.

Importing VCF files using pyvcf

VCF files can be easily processed in python with the pyvcf package. This can be installed with:

pip install pyvcf
[7]:
import vcf

To convert VCF files to pandas DataFrames with columns that can be parsed by isomut2py, we first need to make sure to find the values of sample_name, chr, pos, type, score, ref, mut, cov, mut_freq, cleanliness and ploidy. In order to perform downstream analysis of mutations, fields cov, mut_freq and cleanliness can be left empty, but nonetheless have to be defined in the dataframe.

Here we are importing VCF files generated by the tool MuTect2 (of GATK). If another tool was used for variant calling, make sure to modify parser function below. The example VCF files are located at [exampleDataDir]/isomut2py_example_dataset/ExternalMutations/mutect2. (For instructions on how to download the example datafiles, see Getting started.)

[91]:
def parse_VCF_record(record):
    if record.is_deletion:
        muttype = 'DEL'
    elif record.is_indel:
        muttype = 'INS'
    elif record.is_snp:
        muttype = 'SNV'

    return (record.CHROM, record.POS, muttype,
            record.INFO['TLOD'][0],
            record.REF, record.ALT[0],
            int(record.samples[0].data.DP),
            int(record.samples[0].data.AD[1])/int(record.samples[0].data.DP),
            int(record.samples[1].data.AD[1])/int(record.samples[1].data.DP))
[108]:
sample_dataframes = []
for i in range(1,7):
    vcf_reader = vcf.Reader(filename = exampleDataDir + 'isomut2py_example_dataset/ExternalMutations/mutect2/'+str(i)+'_somatic_m.vcf.gz')
    d = []
    for record in vcf_reader:
        d.append(parse_VCF_record(record))
    df = pd.DataFrame(d, columns=['chr', 'pos', 'type', 'score', 'ref', 'mut', 'cov', 'mut_freq', 'cleanliness'])
    df['sample_name'] = 'sample_'+str(i)
    df['ploidy'] = 2
    sample_dataframes.append(df[['sample_name','chr', 'pos', 'type', 'score', 'ref',
                                'mut', 'cov', 'mut_freq', 'cleanliness', 'ploidy']])

mutations_dataframe = pd.concat(sample_dataframes)

Now we have all the mutations found in 6 samples in a single dataframe. This can be further analysed with the functions described in Further analysis, visualization. For example the number of mutations found in each sample can be plotted with:

[112]:
f = im2.plot.plot_mutation_counts(mutations_dataframe=mutations_dataframe)
Warning: list of control samples not defined.
_images/external_mutations_9_1.png

Importing lists of mutations in arbitrary files

If your lists of mutations are stored in some kind of text files as tables, the easiest way to import them is to use the pandas python package. (As isomut2py heavily relies on pandas, it should already be installed on your computer.)

Here we merely include an example for such an import, make sure to modify the code and customize it to your table. (This data was generated with MuTect, that only detects SNPs, thus we don’t expect to see any indels.)

[4]:
exampleDataDir = '/nagyvinyok/adat83/sotejedlik/orsi/'
[27]:
sample_dataframes = []
for i in range(1,7):
    for sample_type in ['N', 'T']:
        df_sample = pd.read_csv(exampleDataDir + 'isomut2py_example_dataset/ExternalMutations/mutect/SV0'+
                                str(i)+sample_type+'_chr1.vcf', sep='\t', comment='#')

        df_sample = df_sample[df_sample['judgement'] != 'REJECT']
        df = pd.DataFrame()
        df['chr'] = df_sample['contig']
        df['pos'] = df_sample['position']
        df['type'] = 'SNV'
        df['score'] = df_sample['t_lod_fstar']
        df['ref'] = df_sample['ref_allele']
        df['mut'] = df_sample['alt_allele']
        df['cov'] = df_sample['t_ref_count'] + df_sample['t_alt_count']
        df['mut_freq'] = df_sample['t_alt_count']/df['cov']
        df['cleanliness'] = df_sample['n_ref_count']/(df_sample['n_ref_count'] + df_sample['n_alt_count'])
        df['ploidy'] = 2
        df['sample_name'] = 'SV0' + str(i) + sample_type
        df = df[['sample_name',
         'chr',
         'pos',
         'type',
         'score',
         'ref',
         'mut',
         'cov',
         'mut_freq',
         'cleanliness',
         'ploidy']]
        sample_dataframes.append(df)

mutations_dataframe = pd.concat(sample_dataframes)
[30]:
import warnings
warnings.filterwarnings("ignore")
[31]:
f = im2.plot.plot_mutation_counts(mutations_dataframe=mutations_dataframe)
Warning: list of control samples not defined.
_images/external_mutations_14_1.png