pnps-gen - pN/pS Table Generation¶
Overview¶
Calculates pN/pS values¶
This script calculates pN/pS using the data produced by the script vcf command. The result table is a CSV file.
Parse VCF Files¶
The vcf command will parse a VCF file to produce the pickle file that is used to calculate the pN/pS.
Calculate Rank pN/pS¶
The rank command of the script reads SNPs information and calculate for each element of a specific taxonomic rank (species, genus, family, etc.) its pN/pS. Another option is the None rank, which makes the script use the taxonomic ID found in the annotations.
For example, choosing the rank genus a table will be produced, similar to:
Prevotella,0.0001,1,1.1,0.4
Methanobrevibacter,1,0.5,0.6,0.8
A pN/pS value for each genus and sample (4 in this case) will be calculated.
It is important to specify the taxonomic IDs to include in tha calculations. By default only bacteria are included. To get those values, the taxonomy can be queried using taxon-utils get.
Calculate Gene/Rank pN/pS¶
The full command create a gene/taxon table of pN/pS, internally is a pandas MultiIndex DataFrame, written in CSV format after script execution. The difference with the rank is the pN/pS calculation is for each gene/taxon and by default the gene_id from the original GFF file is used (which is stored in the file generated by snp_parser). If other gene IDs needs to be used, a table file can be provided, which can be passed in two column formats.
The default in MGKit is to use Uniprot gene IDs for the functions, but we may want to examine the Kegg Orthologs instead. A table can be passed where the first column in the gene_id stored in the GFF file and the second is the KO:
Q7N6F9 K05685
Q7N6F9 K01242
G7E4F2 K05625
The Q7N6F9 gene_id is repeated because it has multiple correspondences to KOs and this format needs to be selected using the -2 option of the command.
The default type of table expected by the command is a table with a gene ID as first column one or more tab separated columns with mappings. The previous table would look like this:
Q7N6F9 K05685 K01242
G7E4F2 K05625
These tables can be created from the original GFF file, assuming that mappings to KO, EC Numbers are included, with a command line like this:
edit-gff view -a gene_id -a map_KO final.contigs-a3.gff.gz | tr ',' ' '
Extracting the KOs (which are comma separated in a MGKit GFF file) and changing any comma to tab. This table can be passed to the script and will make it possible to calculate the pN/pS for the KOs associated to the genes. Only gene IDs present in this file have a calculated pN/pS.
Normally you combine the all isoforms with the same the gene_id to produce a single pN/pS, but if it’s needed, the -u option can be used to calculate a pN/pS for each line in the GFF file.
Changes¶
Changed in version 0.5.7: added vcf command to parse VCF files and generate data for the script
Changed in version 0.5.1: bug fix
-
Changed in version 0.5.1:
added option to include the lineage as a string
added option to use the uids from the GFF instead of gene_id, this does not require the GFF file, they are embedded into the .pickle file
New in version 0.5.0.
Options¶
pnps-gen¶
Main function
pnps-gen [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
-
--cite
¶
full¶
Calculates pN/pS
pnps-gen full [OPTIONS] OUTPUT_FILE
Options
-
-v
,
--verbose
¶
-
-t
,
--taxonomy
<taxonomy>
¶ Required Taxonomy file
-
-s
,
--snp-data
<snp_data>
¶ Required SNP data, output of snp_parser
-
-r
,
--rank
<rank>
¶ Taxonomic rank
- Options
superkingdom|kingdom|phylum|class|order|family|genus|species|None
-
-m
,
--min-num
<min_num>
¶ Minimum number of samples with a pN/pS to accept
- Default
2
-
-c
,
--min-cov
<min_cov>
¶ Minimum coverage for SNPs to be accepted
- Default
4
-
-i
,
--taxon-ids
<taxon_ids>
¶ Taxon IDs to include
-
-u
,
--use-uid
¶
Use uids from the GFF file instead of gene_id as genes
- Default
False
-
-g
,
--gene-map
<gene_map>
¶ Dictionary to map gene_id to another ID
-
-2
,
--two-columns
¶
gene-map is a two columns table with repeated keys
-
-p
,
--separator
<separator>
¶ column separator for gene-map file
- Default
-
-l
,
--lineage
¶
Use lineage string instead of taxon_id
- Default
False
-
-e
,
--parquet
¶
Output a Parquet file instead of CSV
- Default
False
-
-ps
,
--only-ps
¶
Only calculate pS values
- Default
False
-
-pn
,
--only-pn
¶
Only calculate pN values
- Default
False
Arguments
-
OUTPUT_FILE
¶
Required argument
rank¶
Calculates pN/pS for a taxonomic rank
pnps-gen rank [OPTIONS] [TXT_FILE]
Options
-
-v
,
--verbose
¶
-
-t
,
--taxonomy
<taxonomy>
¶ Required Taxonomy file
-
-s
,
--snp-data
<snp_data>
¶ Required SNP data, output of snp_parser
-
-r
,
--rank
<rank>
¶ Taxonomic rank
- Default
order
- Options
superkingdom|kingdom|phylum|class|order|family|genus|species|None
-
-m
,
--min-num
<min_num>
¶ Minimum number of samples with a pN/pS to accept
- Default
2
-
-c
,
--min-cov
<min_cov>
¶ Minimum coverage for SNPs to be accepted
- Default
4
-
-i
,
--taxon_ids
<taxon_ids>
¶ Taxon IDs to include
- Default
2
-
-u
,
--unstack
¶
Samples are not in columns but as an array
- Default
False
-
-l
,
--lineage
¶
Use lineage string instead of taxon_id
- Default
False
-
-ps
,
--only-ps
¶
Only calculate pS values
- Default
False
-
-pn
,
--only-pn
¶
Only calculate pN values
- Default
False
Arguments
-
TXT_FILE
¶
Optional argument
vcf¶
parse a VCF file and a GFF file to produce the data used for pnps-gen
pnps-gen vcf [OPTIONS] [VCF_FILE] OUTPUT_FILE
Options
-
-v
,
--verbose
¶
-
-ft
,
--feature
<feature>
¶ Feature to use in the GFF file
- Default
CDS
-
-g
,
--gff-file
<gff_file>
¶ Required GFF file to use
-
-a
,
--fasta-file
<fasta_file>
¶ Required Reference file (FASTA) for the GFF
-
-q
,
--min-qual
<min_qual>
¶ Minimum quality for SNPs (Phred score)
- Default
30
-
-f
,
--min-freq
<min_freq>
¶ Minimum allele frequency
- Default
0.01
-
-r
,
--min-reads
<min_reads>
¶ Minimum number of reads to accept the SNP
- Default
4
-
-m
,
--sample-ids
<sample_ids>
¶ the ids of the samples used in the analysis, must be the same as in the GFF file
-
-n
,
--num-lines
<num_lines>
¶ Number of VCF lines after which printing status
- Default
100000
Arguments
-
VCF_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Required argument
vcf_alt¶
parse a VCF file and a GFF file to produce the data used for pnps-gen, uses file a list for sample coverage instead of taking information from the GFF file
pnps-gen vcf_alt [OPTIONS] [VCF_FILE] OUTPUT_FILE
Options
-
-v
,
--verbose
¶
-
-ft
,
--feature
<feature>
¶ Feature to use in the GFF file
- Default
CDS
-
-g
,
--gff-file
<gff_file>
¶ Required GFF file to use
-
-a
,
--fasta-file
<fasta_file>
¶ Required Reference file (FASTA) for the GFF
-
-q
,
--min-qual
<min_qual>
¶ Minimum quality for SNPs (Phred score)
- Default
30
-
-f
,
--min-freq
<min_freq>
¶ Minimum allele frequency
- Default
0.01
-
-r
,
--min-reads
<min_reads>
¶ Minimum number of reads to accept the SNP
- Default
4
-
-n
,
--num-lines
<num_lines>
¶ Number of VCF lines after which printing status
- Default
100000
-
-l
,
--sample-file
<sample_file>
¶ Required File with list of coverage files and sample names (TAB separated)
-
-s
,
--file-list
<file_list>
¶ File with list of VCF files (one per line)
-
-u
,
--uid-map
<uid_map>
¶ Only load annotations from a specific map file
Arguments
-
VCF_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Required argument