pnps-gen - pN/pS Table Generation¶

Overview¶

Calculates pN/pS values¶

This script calculates pN/pS using the data produced by the script vcf command. The result table is a CSV file.

Parse VCF Files¶

The vcf command will parse a VCF file to produce the pickle file that is used to calculate the pN/pS.

Calculate Rank pN/pS¶

The rank command of the script reads SNPs information and calculate for each element of a specific taxonomic rank (species, genus, family, etc.) its pN/pS. Another option is the None rank, which makes the script use the taxonomic ID found in the annotations.

For example, choosing the rank genus a table will be produced, similar to:

Prevotella,0.0001,1,1.1,0.4
Methanobrevibacter,1,0.5,0.6,0.8

A pN/pS value for each genus and sample (4 in this case) will be calculated.

It is important to specify the taxonomic IDs to include in tha calculations. By default only bacteria are included. To get those values, the taxonomy can be queried using taxon-utils get.

Calculate Gene/Rank pN/pS¶

The full command create a gene/taxon table of pN/pS, internally is a pandas MultiIndex DataFrame, written in CSV format after script execution. The difference with the rank is the pN/pS calculation is for each gene/taxon and by default the gene_id from the original GFF file is used (which is stored in the file generated by snp_parser). If other gene IDs needs to be used, a table file can be provided, which can be passed in two column formats.

The default in MGKit is to use Uniprot gene IDs for the functions, but we may want to examine the Kegg Orthologs instead. A table can be passed where the first column in the gene_id stored in the GFF file and the second is the KO:

Q7N6F9  K05685
Q7N6F9  K01242
G7E4F2  K05625

The Q7N6F9 gene_id is repeated because it has multiple correspondences to KOs and this format needs to be selected using the -2 option of the command.

The default type of table expected by the command is a table with a gene ID as first column one or more tab separated columns with mappings. The previous table would look like this:

Q7N6F9  K05685  K01242
G7E4F2  K05625

These tables can be created from the original GFF file, assuming that mappings to KO, EC Numbers are included, with a command line like this:

edit-gff view -a gene_id -a map_KO final.contigs-a3.gff.gz | tr ',' '      '

Extracting the KOs (which are comma separated in a MGKit GFF file) and changing any comma to tab. This table can be passed to the script and will make it possible to calculate the pN/pS for the KOs associated to the genes. Only gene IDs present in this file have a calculated pN/pS.

Normally you combine the all isoforms with the same the gene_id to produce a single pN/pS, but if it’s needed, the -u option can be used to calculate a pN/pS for each line in the GFF file.

Changes¶

Changed in version 0.5.7: added vcf command to parse VCF files and generate data for the script

Changed in version 0.5.1: bug fix

Changed in version 0.5.1:

added option to include the lineage as a string
added option to use the uids from the GFF instead of gene_id, this does not require the GFF file, they are embedded into the .pickle file

New in version 0.5.0.

Options¶

pnps-gen¶

Main function

pnps-gen [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

--cite¶

full¶

Calculates pN/pS

pnps-gen full [OPTIONS] OUTPUT_FILE

Options

-v, --verbose¶

-t, --taxonomy <taxonomy>¶: Required Taxonomy file

-s, --snp-data <snp_data>¶: Required SNP data, output of snp_parser

-r, --rank <rank>¶

Taxonomic rank

Options: superkingdom|kingdom|phylum|class|order|family|genus|species|None

-m, --min-num <min_num>¶

Minimum number of samples with a pN/pS to accept

Default: 2

-c, --min-cov <min_cov>¶

Minimum coverage for SNPs to be accepted

Default: 4

-i, --taxon-ids <taxon_ids>¶: Taxon IDs to include

-u, --use-uid¶

Use uids from the GFF file instead of gene_id as genes

Default: False

-g, --gene-map <gene_map>¶: Dictionary to map gene_id to another ID

-2, --two-columns¶: gene-map is a two columns table with repeated keys

-p, --separator <separator>¶

column separator for gene-map file

Default

-l, --lineage¶

Use lineage string instead of taxon_id

Default: False

-e, --parquet¶

Output a Parquet file instead of CSV

Default: False

-ps, --only-ps¶

Only calculate pS values

Default: False

-pn, --only-pn¶

Only calculate pN values

Default: False

Arguments

OUTPUT_FILE¶: Required argument

rank¶

Calculates pN/pS for a taxonomic rank

pnps-gen rank [OPTIONS] [TXT_FILE]

Options

-v, --verbose¶

-t, --taxonomy <taxonomy>¶: Required Taxonomy file

-s, --snp-data <snp_data>¶: Required SNP data, output of snp_parser

-r, --rank <rank>¶

Taxonomic rank

Default: order
Options: superkingdom|kingdom|phylum|class|order|family|genus|species|None

-m, --min-num <min_num>¶

Minimum number of samples with a pN/pS to accept

Default: 2

-c, --min-cov <min_cov>¶

Minimum coverage for SNPs to be accepted

Default: 4

-i, --taxon_ids <taxon_ids>¶

Taxon IDs to include

Default: 2

-u, --unstack¶

Samples are not in columns but as an array

Default: False

-l, --lineage¶

Use lineage string instead of taxon_id

Default: False

-ps, --only-ps¶

Only calculate pS values

Default: False

-pn, --only-pn¶

Only calculate pN values

Default: False

Arguments

TXT_FILE¶: Optional argument

vcf¶

parse a VCF file and a GFF file to produce the data used for pnps-gen

pnps-gen vcf [OPTIONS] [VCF_FILE] OUTPUT_FILE

Options

-v, --verbose¶

-ft, --feature <feature>¶

Feature to use in the GFF file

Default: CDS

-g, --gff-file <gff_file>¶: Required GFF file to use

-a, --fasta-file <fasta_file>¶: Required Reference file (FASTA) for the GFF

-q, --min-qual <min_qual>¶

Minimum quality for SNPs (Phred score)

Default: 30

-f, --min-freq <min_freq>¶

Minimum allele frequency

Default: 0.01

-r, --min-reads <min_reads>¶

Minimum number of reads to accept the SNP

Default: 4

-m, --sample-ids <sample_ids>¶: the ids of the samples used in the analysis, must be the same as in the GFF file

-n, --num-lines <num_lines>¶

Number of VCF lines after which printing status

Default: 100000

Arguments

VCF_FILE¶: Optional argument

OUTPUT_FILE¶: Required argument

vcf_alt¶

parse a VCF file and a GFF file to produce the data used for pnps-gen, uses file a list for sample coverage instead of taking information from the GFF file

pnps-gen vcf_alt [OPTIONS] [VCF_FILE] OUTPUT_FILE

Options

-v, --verbose¶

-ft, --feature <feature>¶

Feature to use in the GFF file

Default: CDS

-g, --gff-file <gff_file>¶: Required GFF file to use

-a, --fasta-file <fasta_file>¶: Required Reference file (FASTA) for the GFF

-q, --min-qual <min_qual>¶

Minimum quality for SNPs (Phred score)

Default: 30

-f, --min-freq <min_freq>¶

Minimum allele frequency

Default: 0.01

-r, --min-reads <min_reads>¶

Minimum number of reads to accept the SNP

Default: 4

-n, --num-lines <num_lines>¶

Number of VCF lines after which printing status

Default: 100000

-l, --sample-file <sample_file>¶: Required File with list of coverage files and sample names (TAB separated)

-s, --file-list <file_list>¶: File with list of VCF files (one per line)

-u, --uid-map <uid_map>¶: Only load annotations from a specific map file

Arguments

VCF_FILE¶: Optional argument

OUTPUT_FILE¶: Required argument

pnps-gen - pN/pS Table Generation¶

Overview¶

Calculates pN/pS values¶

Parse VCF Files¶

Calculate Rank pN/pS¶

Calculate Gene/Rank pN/pS¶

Changes¶

Options¶

pnps-gen¶

full¶

rank¶

vcf¶

vcf_alt¶

MGKit: Metagenomic framework

Navigation

Related Topics