pnps-gen - pN/pS Table Generation

Overview

Calculates pN/pS values

This script calculates pN/pS using the data produced by the script vcf command. The result table is a CSV file.

Parse VCF Files

The vcf command will parse a VCF file to produce the pickle file that is used to calculate the pN/pS.

Calculate Rank pN/pS

The rank command of the script reads SNPs information and calculate for each element of a specific taxonomic rank (species, genus, family, etc.) its pN/pS. Another option is the None rank, which makes the script use the taxonomic ID found in the annotations.

For example, choosing the rank genus a table will be produced, similar to:

Prevotella,0.0001,1,1.1,0.4
Methanobrevibacter,1,0.5,0.6,0.8

A pN/pS value for each genus and sample (4 in this case) will be calculated.

It is important to specify the taxonomic IDs to include in tha calculations. By default only bacteria are included. To get those values, the taxonomy can be queried using taxon-utils get.

Calculate Gene/Rank pN/pS

The full command create a gene/taxon table of pN/pS, internally is a pandas MultiIndex DataFrame, written in CSV format after script execution. The difference with the rank is the pN/pS calculation is for each gene/taxon and by default the gene_id from the original GFF file is used (which is stored in the file generated by snp_parser). If other gene IDs needs to be used, a table file can be provided, which can be passed in two column formats.

The default in MGKit is to use Uniprot gene IDs for the functions, but we may want to examine the Kegg Orthologs instead. A table can be passed where the first column in the gene_id stored in the GFF file and the second is the KO:

Q7N6F9  K05685
Q7N6F9  K01242
G7E4F2  K05625

The Q7N6F9 gene_id is repeated because it has multiple correspondences to KOs and this format needs to be selected using the -2 option of the command.

The default type of table expected by the command is a table with a gene ID as first column one or more tab separated columns with mappings. The previous table would look like this:

Q7N6F9  K05685  K01242
G7E4F2  K05625

These tables can be created from the original GFF file, assuming that mappings to KO, EC Numbers are included, with a command line like this:

edit-gff view -a gene_id -a map_KO final.contigs-a3.gff.gz | tr ',' '      '

Extracting the KOs (which are comma separated in a MGKit GFF file) and changing any comma to tab. This table can be passed to the script and will make it possible to calculate the pN/pS for the KOs associated to the genes. Only gene IDs present in this file have a calculated pN/pS.

Normally you combine the all isoforms with the same the gene_id to produce a single pN/pS, but if it’s needed, the -u option can be used to calculate a pN/pS for each line in the GFF file.

Changes

Changed in version 0.5.7: added vcf command to parse VCF files and generate data for the script

Changed in version 0.5.1: bug fix

    Changed in version 0.5.1:
  • added option to include the lineage as a string

  • added option to use the uids from the GFF instead of gene_id, this does not require the GFF file, they are embedded into the .pickle file

New in version 0.5.0.

Options

pnps-gen

Main function

pnps-gen [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

--cite

full

Calculates pN/pS

pnps-gen full [OPTIONS] OUTPUT_FILE

Options

-v, --verbose
-t, --taxonomy <taxonomy>

Required Taxonomy file

-s, --snp-data <snp_data>

Required SNP data, output of snp_parser

-r, --rank <rank>

Taxonomic rank

Options

superkingdom|kingdom|phylum|class|order|family|genus|species|None

-m, --min-num <min_num>

Minimum number of samples with a pN/pS to accept

Default

2

-c, --min-cov <min_cov>

Minimum coverage for SNPs to be accepted

Default

4

-i, --taxon-ids <taxon_ids>

Taxon IDs to include

-u, --use-uid

Use uids from the GFF file instead of gene_id as genes

Default

False

-g, --gene-map <gene_map>

Dictionary to map gene_id to another ID

-2, --two-columns

gene-map is a two columns table with repeated keys

-p, --separator <separator>

column separator for gene-map file

Default

-l, --lineage

Use lineage string instead of taxon_id

Default

False

-e, --parquet

Output a Parquet file instead of CSV

Default

False

-ps, --only-ps

Only calculate pS values

Default

False

-pn, --only-pn

Only calculate pN values

Default

False

Arguments

OUTPUT_FILE

Required argument

rank

Calculates pN/pS for a taxonomic rank

pnps-gen rank [OPTIONS] [TXT_FILE]

Options

-v, --verbose
-t, --taxonomy <taxonomy>

Required Taxonomy file

-s, --snp-data <snp_data>

Required SNP data, output of snp_parser

-r, --rank <rank>

Taxonomic rank

Default

order

Options

superkingdom|kingdom|phylum|class|order|family|genus|species|None

-m, --min-num <min_num>

Minimum number of samples with a pN/pS to accept

Default

2

-c, --min-cov <min_cov>

Minimum coverage for SNPs to be accepted

Default

4

-i, --taxon_ids <taxon_ids>

Taxon IDs to include

Default

2

-u, --unstack

Samples are not in columns but as an array

Default

False

-l, --lineage

Use lineage string instead of taxon_id

Default

False

-ps, --only-ps

Only calculate pS values

Default

False

-pn, --only-pn

Only calculate pN values

Default

False

Arguments

TXT_FILE

Optional argument

vcf

parse a VCF file and a GFF file to produce the data used for pnps-gen

pnps-gen vcf [OPTIONS] [VCF_FILE] OUTPUT_FILE

Options

-v, --verbose
-ft, --feature <feature>

Feature to use in the GFF file

Default

CDS

-g, --gff-file <gff_file>

Required GFF file to use

-a, --fasta-file <fasta_file>

Required Reference file (FASTA) for the GFF

-q, --min-qual <min_qual>

Minimum quality for SNPs (Phred score)

Default

30

-f, --min-freq <min_freq>

Minimum allele frequency

Default

0.01

-r, --min-reads <min_reads>

Minimum number of reads to accept the SNP

Default

4

-m, --sample-ids <sample_ids>

the ids of the samples used in the analysis, must be the same as in the GFF file

-n, --num-lines <num_lines>

Number of VCF lines after which printing status

Default

100000

Arguments

VCF_FILE

Optional argument

OUTPUT_FILE

Required argument

vcf_alt

parse a VCF file and a GFF file to produce the data used for pnps-gen, uses file a list for sample coverage instead of taking information from the GFF file

pnps-gen vcf_alt [OPTIONS] [VCF_FILE] OUTPUT_FILE

Options

-v, --verbose
-ft, --feature <feature>

Feature to use in the GFF file

Default

CDS

-g, --gff-file <gff_file>

Required GFF file to use

-a, --fasta-file <fasta_file>

Required Reference file (FASTA) for the GFF

-q, --min-qual <min_qual>

Minimum quality for SNPs (Phred score)

Default

30

-f, --min-freq <min_freq>

Minimum allele frequency

Default

0.01

-r, --min-reads <min_reads>

Minimum number of reads to accept the SNP

Default

4

-n, --num-lines <num_lines>

Number of VCF lines after which printing status

Default

100000

-l, --sample-file <sample_file>

Required File with list of coverage files and sample names (TAB separated)

-s, --file-list <file_list>

File with list of VCF files (one per line)

-u, --uid-map <uid_map>

Only load annotations from a specific map file

Arguments

VCF_FILE

Optional argument

OUTPUT_FILE

Required argument