add-gff-info - Add informations to GFF annotations¶
Overview¶
Add more information to GFF annotations: gene mappings, coverage, taxonomy, etc..
Uniprot Command¶
If the gene_id of an annotation is a Uniprot ID, the script queries Uniprot for the requested information. At the moment the information that can be added is the taxon_id, taxon_name, lineage and mapping to EC, KO, eggNOG IDs.
It’s also possible to add mappings to other databases using the -m option with the correct identifier for the mapping, which can be found at this page; for example if it’s we want to add the mappings of uniprot IDs to BioCyc, in the abbreviation column of the mappings we find that it’s identifier is REACTOME_ID, so we pass -m REACTOME to the script (leaving _ID out). Mapped IDs are separated by commas.
The taxonomy IDs are not overwritten if they are found in the annotations, the -f is provided to force the overwriting of those values.
See also MGKit GFF Specifications for more informations about the GFF specifications used.
Note
As the script needs to query Uniprot a lot, it is recommended to split the GFF in several files, so an error in the connection doesn’t waste time.
However, a cache is kept to reduce the number of connections
Coverage Command¶
Adds coverage information from BAM alignment files to a GFF file, using the
function mgkit.align.add_coverage_info()
, the user needs to supply for
each sample a BAM file, using the -a option, whose parameter is in the form
sample,samplealg.bam. More samples can be supplied adding more -a
arguments.
Hint
As an example, to add coverage for sample1, sample2 the command line is:
add-gff-info coverage -a sample1,sample1.bam -a sample2,sample2.bam \
inputgff outputgff
A total coverage for the annotation is also calculated and stored in the cov attribute, while each sample coverage is stored into sample_cov as per MGKit GFF Specifications.
Adding Coverage from samtools depth¶
The cov_samtools allows the use of the output of samtools depth command. The command work similarly to coverage, accepting compressed depth files as well. If only one depth file is passed and no sample is passed, the attribute in the GFF will be cov, otherwise the attribute will be sample1_cov, sample2_cov, etc.
Note
From version 0.4.2 the behaviour is different in the following ways:
- the GFF file is loaded in memory to get information that helps reduce
the memory size
- the depth file is scanned and each time a sequence in the GFF file is
found, coverage data is added, this way memory is freed more often
- the GFF file is written at the very end, so chaining multiple commands
(one for each sample) doesn’t give any advantage now
- there’s no need to use the -aa option of samtools depth, since some
assumptions are made when scanning the fils (see next point)
- when the depth file is scanned completely, the sequences in the GFF file
with coverage information are set coverage 0.0, instead of raising an error
average coverage is only calculated if requested, with the -m option
The GFF file is first loaded in memory to get information about the maximum array size for each sequence, then the script proceeds for each depth file, to scan it until a) all sequences in the GFF are found or b) until the end of the file. If b) is the case, all sequences with not found in the depth file are set to a coverage of 0.0. A warning with the number of sequences lacking coverage is printed, for diagnostic purposes. If the the mean coverage is requested with the -m option, coverage is then calculated
To create samtools depth files, you can use this command:
$ samtools depth [bam_file] > [depth_file]
where [bam_file] is the alignment to be used and [depth_file] where the
file will be stored (by redirection). You could create a gzipped file, as
the file can be huge (and in my experience compressing reduces by ~10x the
file size)::
$ samtools depth [bam_file] | gzip > [depth_file.gz]
Warning
if version < 0.4.2:
The -aa options must be used to pass information about all base pairs and sequences coverage in the BAM/SAM file.
$ samtools depth -aa bam_file
Uniprot Offline Mappings¶
Similar to the uniprot command, it uses the idmapping file provided by Uniprot, which speeds up the process of adding mappings and taxonomy IDs from Uniprot gene IDs. It’s not possible tough to add EC mappings with this command, as those are not included in the file.
Kegg Information¶
The kegg command allows to add information to each annotation. Right now the information that can be added is restricted to the pathway(s) (reference KO) a KO is part of and both the KO and pathway(s) descriptions. This information is stored in keys starting with ko_.
Expected Aminoacidic Changes¶
Some scripts, like snp_parser - SNPs analysis, require information about the expected
number of synonymous and non-synonymous changes of an annotation. This can be
done using mgkit.io.gff.Annotation.add_exp_syn_count()
by the user of the
command exp_syn of this script. The attributes added to each annotation are
explained in the MGKit GFF Specifications
Adding Count Data¶
Count data on a per-sample basis can be added with the counts command. The accepted inputs are from HTSeq-count and featureCounts. The ouput produced by featureCounts, is the one from using its -f option must be used.
This script accept by default a tab separated file, with a uid in the first column and the other columns are the counts for each sample, in the same order as they are passed to the -s option. To use the featureCounts file format, this script -e option must be used.
The sample names must be provided in the same order as the columns in the input files. If the counts are FPKMS the -f option can be used.
Adding Taxonomy from a Table¶
There are cases where it may needed or preferred to add the taxonomy from a gene_id already provided in the GFF file. For such cases the addtaxa command can be used. It works in a similar way to the taxonomy command, only it expect three different type of inputs:
GI-Taxa table from NCBI (e.g. gi_taxid_nucl.dmp, )
tab separated table
dictionary
HDF5
The first two are tab separated files, where on each line, the first column is the gene_id that is found in the first column, while the second if the taxon_id.
The third option is a serialised Python dict/hash table, whose keys are the gene_id and the value is that gene corresponding taxon_id. The serialised formats accepted are msgpack, json and pickle. The msgpack module must be importable. The option to use json and msgpack allow to integrate this script with other languages without resorting to a text file.
The last option is a HDF5 created using the to_hdf command in taxon-utils - Taxonomy Utilities. This requires pandas installed and pytables and it provides faster lookup of IDs in the table.
While the default is to look for the gene_id attribute in the GFF annotation, another attribute can be specified, using the -gene-attr option.
Note
the dictionary content is loaded after the table files and its keys and corresponding values takes precedence over the text files.
Warning
from September 2016 NCBI will retire the GI. In that case the same
kind of table can be built from the nucl_gb.accession2taxid.gz file
The format is different, but some information can be found in
mgkit.io.blast.parse_accession_taxa_table()
Adding information from Pfam¶
Adds the Pfam description for the annotation, by downloading the list from Pfam.
The options allow to specify in which attribute the ID/ACCESSION is stored (defaults to gene_id) and which one between ID/ACCESSION is the value of that attribute (defaults to ID). if no description is found for the family, a warning message is logged.
Adding information about Bins¶
Warning
The taxonomy used must have been created using the taxon-utils import command on the same DB the results is based upon. At the moment, only PhyloPhlan is supported
The script needs to know to which Bin the annotation belongs to, which can be defined at the moment in 3 different ways (the priority given is in reverse order):
passing the Bin ID defined in the first column of PhyloPhlan results
pass the attribute in the GFF file that contains the Bin ID
the seq_id (FASTA headers) contain the Bin ID
The first way to operate relies on -i to indicate the exact bin_id to use. This can only be passed once, so the whole input GFF will set to that Bin ID.
The secondo option -a relies on an attribute being set in the GFF, this can be used if annotations in the GFF file comes from different Bins
The third option -h relies on the fact that the sequences were renames in a way that can help find the Bin ID. It relies on the sequences being renamed by fasta-utils rename in a form similar to NCBI. By default expects something like this as seq_id:
BinID.fa|contig_1221|xXAS
The script will split the header -s ‘|’, then picking the first element -x 0 and finally replacing -e .fa from it:
BinID.fa|contig_1221|xXAS -> BinID.fa -> BinID
Changes¶
Changed in version 0.5.7: added bins command and –manual option to script
Changed in version 0.4.2: added -m option for cov_samtools command, to calculate the average coverage for an annotation (cov attribute). Fixed loading compressed files
Changed in version 0.3.4: removed the taxonomy command, since a similar result can be obtained with taxon-utils lca and add-gff-info addtaxa. Removed eggnog command and added option to verbose the logging in cov_samtools (now is quiet), also changed the interface
Changed in version 0.3.3: changed how addtaxa -a works, to allow the use of seq_id as key to add the taxon_id
Changed in version 0.3.0: added cov_samtools command, –split option to exp_syn, -c option to addtaxa. kegg now does not skip annotations when the attribute is not found.
Changed in version 0.2.6: added skip-no-taxon option to addtaxa
Changed in version 0.2.5: if a dictionary is supplied to addtaxa, the GFF is not preloaded
Changed in version 0.2.3: added pfam command, renamed gitaxa to addtaxa and made it general
Changed in version 0.2.2: added eggnog, gitaxa and counts command
Changed in version 0.2.1.
added -d to uniprot command
added cache to uniprot command
added kegg command (cached)
Changed in version 0.1.16: added exp_syn command
Changed in version 0.1.15: taxonomy command -b option changed
Changed in version 0.1.13.
added –force-taxon-id option to the uniprot command
added coverage command
added taxonomy command
added unipfile command
New in version 0.1.12.
Options¶
add-gff-info¶
Main function
add-gff-info [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
-
--cite
¶
-
--manual
¶
addtaxa¶
Adds taxonomy information from a GI-Taxa, gene_id/taxon_id table or a dictionary serialised as a pickle/msgpack/json file, or a table in a HDF5 file
add-gff-info addtaxa [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-t
,
--gene-taxon-table
<gene_taxon_table>
¶ GIDs taxonomy table (e.g. gi_taxid_nucl.dmp.gz) or a similar file where GENE/TAXON are tab separated and one per line
-
-a
,
--gene-attr
<gene_attr>
¶ In which attribute the GENEID in the table is stored (defaults to gene_id)
-
-f
,
--hdf-table
<hdf_table>
¶ HDF5 file and table name to use for taxon_id lookups. The format to pass is the file name, colon and the table file hf5:taxa-table. The index in the table is the accession_id, while the column taxon_id stores the taxon_id as int
-
-x
,
--taxonomy
<taxonomy>
¶ Taxonomy file - If given, both taxon_name and lineage attributes will be set
-
-d
,
--dictionary
<dictionary>
¶ A serialised dictionary, where the key is the GENEID and the value is TAXONID. It can be in json or msgpack format (can be a compressed file) Note: the dictionary values takes precedence over the table files
-
-e
,
--skip-no-taxon
¶
If used, annotations with no taxon_id won’t be outputted
-
-db
,
--taxon-db
<taxon_db>
¶ DB used to add the taxonomic information
-
-c
,
--cache-table
¶
If used, annotations are not preloaded, but the taxa table is cached, instead
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
bins¶
Adds information about bins
add-gff-info bins [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-t
,
--taxonomy
<taxonomy>
¶ Required Taxonomy file
-
-r
,
--bins-results
<bins_results>
¶ Required Bins taxonomy results
-
-y
,
--results-type
<results_type>
¶ Software used
- Default
phylophlan
- Options
phylophlan
-
-i
,
--bin-id
<bin_id>
¶ Bin ID from PhyloPhlan results to use
-
-a
,
--bin-attribute
<bin_attribute>
¶ The Bin ID is stored as attribute in the annotation
-
-h
,
--header-bin
¶
The Bin ID is stored as part of the FASTA header
-
-e
,
--header-replace
<header_replace>
¶ String to replace in the FASTA header
- Default
.fa
-
-s
,
--header-separator
<header_separator>
¶ Separator for FASTA header items
- Default
-
-x
,
--header-index
<header_index>
¶ Index for item to use in FASTA header
- Default
0
-
-s
,
--ignore-missing-bin
¶
Bins not found will have an ID set
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
counts¶
Adds counts data to the GFF file
add-gff-info counts [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-s
,
--samples
<samples>
¶ Required Sample names, in the same order as the count files
-
-c
,
--count-files
<count_files>
¶ Required Count file(s)
-
-f
,
--fpkms
¶
If the counts are FPKMS
-
-e
,
--featureCounts
¶
If the counts files are from featureCounts
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
cov_samtools¶
Adds information from samtools_depth
add-gff-info cov_samtools [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-m
,
--average
¶
if one or more samples are provided, the average coverage is calculated
-
-s
,
--samples
<samples>
¶ Sample name, will add a sample_cov in the GFF file. If not passed, the attribute will be cov
-
-d
,
--depths
<depths>
¶ Required samtools depth -aa file
-
-n
,
--num-seqs
<num_seqs>
¶ Number of sequences to update the log. If 0, no message is logged
- Default
0
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
coverage¶
Adds coverage information from BAM Alignment files
add-gff-info coverage [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-a
,
--sample-alignment
<sample_alignment>
¶ Required sample name and correspondent alignment file separated by comma
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
exp_syn¶
Adds expected synonymous and non-synonymous changes information
add-gff-info exp_syn [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-r
,
--reference
<reference>
¶ Required reference sequence in fasta format
-
-s
,
--split
¶
Split the sequence header of the reference at the first space, to emulate BLAST behaviour
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
kegg¶
Adds information and mapping from Kegg
add-gff-info kegg [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-c
,
--email
<email>
¶ Required Contact email
-
-d
,
--description
¶
Add Kegg description
-
-p
,
--pathways
¶
Add pathways ID involved
-
-m
,
--kegg-id
<kegg_id>
¶ In which attribute the Kegg ID is stored (defaults to gene_id)
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
pfam¶
Adds information from Pfam
add-gff-info pfam [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-i
,
--id-attr
<id_attr>
¶ In which attribute the Pfam ID/ACCESSION is stored (defaults to gene_id)
-
-a
,
--use-accession
¶
If used, the attribute value is the Pfam ACCESSION (e.g. PF06894), not ID (e.g. Phage_TAC_2)
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
unipfile¶
Adds Uniprot gene information from a file
add-gff-info unipfile [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-i
,
--mapping-file
<mapping_file>
¶ Required Uniprot mapping file
-
-f
,
--force-taxon-id
¶
Overwrite taxon_id if already present
-
-m
,
--mapping
<mapping>
¶ Required Mappings to add
- Options
NCBI_TaxID|eggNOG|KO|KEGG|BioCyc|UniPathway|EMBL|EMBL-CDS|GI|STRING
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
uniprot¶
Adds information from GFF whose gene_id is from Uniprot
add-gff-info uniprot [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-c
,
--email
<email>
¶ Required Contact email
-
--buffer
<buffer>
¶ Number of annotations to keep in memory
- Default
50
-
-f
,
--force-taxon-id
¶
Overwrite taxon_id if already present
-
-t
,
--taxon-id
¶
Add taxonomic ids to annotations, if taxon_id is found, it won’t be Overwritten.
-
-l
,
--lineage
¶
Add taxonomic lineage to annotations
-
-e
,
--eggnog
¶
Add eggNOG mappings to annotations
-
-ec
,
--enzymes
¶
Add EC mappings to annotations
-
-ko
,
--kegg_orthologs
¶
Add KO mappings to annotations
-
-d
,
--protein-names
¶
Add Uniprot description
-
-m
,
--mapping
<mapping>
¶ Add any DB mappings to annotations
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument