mgkit.workflow.pnps_gen module

Calculates pN/pS values

This script calculates pN/pS using the data produced by the script vcf command. The result table is a CSV file.

Parse VCF Files

The vcf command will parse a VCF file to produce the pickle file that is used to calculate the pN/pS.

Calculate Rank pN/pS

The rank command of the script reads SNPs information and calculate for each element of a specific taxonomic rank (species, genus, family, etc.) its pN/pS. Another option is the None rank, which makes the script use the taxonomic ID found in the annotations.

For example, choosing the rank genus a table will be produced, similar to:

Prevotella,0.0001,1,1.1,0.4
Methanobrevibacter,1,0.5,0.6,0.8

A pN/pS value for each genus and sample (4 in this case) will be calculated.

It is important to specify the taxonomic IDs to include in tha calculations. By default only bacteria are included. To get those values, the taxonomy can be queried using taxon-utils get.

Calculate Gene/Rank pN/pS

The full command create a gene/taxon table of pN/pS, internally is a pandas MultiIndex DataFrame, written in CSV format after script execution. The difference with the rank is the pN/pS calculation is for each gene/taxon and by default the gene_id from the original GFF file is used (which is stored in the file generated by snp_parser). If other gene IDs needs to be used, a table file can be provided, which can be passed in two column formats.

The default in MGKit is to use Uniprot gene IDs for the functions, but we may want to examine the Kegg Orthologs instead. A table can be passed where the first column in the gene_id stored in the GFF file and the second is the KO:

Q7N6F9  K05685
Q7N6F9  K01242
G7E4F2  K05625

The Q7N6F9 gene_id is repeated because it has multiple correspondences to KOs and this format needs to be selected using the -2 option of the command.

The default type of table expected by the command is a table with a gene ID as first column one or more tab separated columns with mappings. The previous table would look like this:

Q7N6F9  K05685  K01242
G7E4F2  K05625

These tables can be created from the original GFF file, assuming that mappings to KO, EC Numbers are included, with a command line like this:

edit-gff view -a gene_id -a map_KO final.contigs-a3.gff.gz | tr ',' '      '

Extracting the KOs (which are comma separated in a MGKit GFF file) and changing any comma to tab. This table can be passed to the script and will make it possible to calculate the pN/pS for the KOs associated to the genes. Only gene IDs present in this file have a calculated pN/pS.

Normally you combine the all isoforms with the same the gene_id to produce a single pN/pS, but if it’s needed, the -u option can be used to calculate a pN/pS for each line in the GFF file.

Changes

Changed in version 0.5.7: added vcf command to parse VCF files and generate data for the script

Changed in version 0.5.1: bug fix

    Changed in version 0.5.1:
  • added option to include the lineage as a string

  • added option to use the uids from the GFF instead of gene_id, this does not require the GFF file, they are embedded into the .pickle file

New in version 0.5.0.

mgkit.workflow.pnps_gen.check_snp_in_set(samples, snp_data, pos, change, annotations, seq)[source]

Used by parse_vcf() to check if a SNP

Parameters
  • samples (iterable) – list of samples that contain the SNP

  • snp_data (dict) – dictionary from init_count_set() with per sample SNPs information

mgkit.workflow.pnps_gen.get_lineage(taxonomy, taxon_id)[source]
mgkit.workflow.pnps_gen.init_count_set(annotations, seqs)[source]
mgkit.workflow.pnps_gen.init_count_set_sample_files(annotations, seqs, cov_files)[source]

New in version 0.5.7.

mgkit.workflow.pnps_gen.parse_vcf(vcf_handle, snp_data, annotations, seqs, min_qual, min_reads, min_freq, sample_ids, num_lines, verbose=True)[source]
mgkit.workflow.pnps_gen.read_gene_map_default(file_handle, separator)[source]
mgkit.workflow.pnps_gen.read_gene_map_two_columns(file_handle, separator)[source]
mgkit.workflow.pnps_gen.save_data(output_file, snp_data)[source]

Pickle data structures to the disk.

Parameters
  • output_file (str) – base name for pickle files

  • snp_data (dict) – dictionary from init_count_set() with per sample SNPs information