MGKit GFF Specifications¶
The GFF produced with MGKit follows the conventions of GFF/GTF files but it provides some additional fields in the 9th column which translate to a
Python dictionary when an annotation is loaded into an Annotation
instance.
The 9th column is a list of key=value item, separated by a semicolon (;); each value is also expected to be quoted with double quotes and the values to not include a semicolon or other characters that can make the parsing difficult. MGKit uses urllib.quote()
to encode those characters and also ” ()/”. The mgkit.io.gff.from_gff()
uses urllib.unquote()
to set the values.
Warning
As the last column translates to a dictionary in the data structures, duplicate keys are not allowed. mgkit.io.gff.from_gff()
raises an exception if any are found.
Reserved Values¶
Any key can be added to a GFF annotation, but MGKit expects a few key to be in the GFF annotation as summarised in the following tables.
Key |
Value |
Explanation |
---|---|---|
seq_id |
string |
the sequence of the annotation (header in FASTA files) |
gene_id |
any string |
used to identify the gene predicted |
db |
any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT |
identifies the database used to make the gene_id prediction |
taxon_db |
any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT |
identifies the database used to make the taxon_id prediction |
dbq |
integer |
identifies the quality of the database, used when filtering annotations |
taxon_id |
integer |
identifies the annotation taxon, NCBI taxonomy is used |
uid |
string |
unique identifier for the annotation, any string is accepted but a value is assigned by using |
cov and {any}_cov |
integer |
coverage for the annotation over all samples, keys ending with _cov indicates coverage for each sample |
exp_syn, exp_nonsyn |
integer |
used for expected number of synonymous and non-synonymous changes for the annotation |
The following keys are added by different scripts and may be used in different scripts or annotation methods.
Key |
Value |
Explanation |
Used |
---|---|---|---|
taxon_name |
string |
name of the taxon |
not used |
lineage |
string |
taxon lineage |
not used |
EC |
comma separated values |
list of EC numbers associated to the annotation |
used by |
map_{any} |
comma separated values |
list of mapping to a specific db (e.g. eggNOG -> map_EGGNOG) |
|
counts_{any} |
float |
Stores the count data for a sample (e.g. counts_Sample1) |
used by script add-gff-info |
fpkms_{any} |
float |
Stores the count data for a sample (e.g. fpkms_Sample1) |
used by script add-gff-info |
FC |
string (no spaces) |
each character is a Functional Category in eggNOG (e.g. FJO are 3 different categories) |
used by |