.. _gff-specs: MGKit GFF Specifications ======================== The GFF produced with MGKit follows the conventions of GFF/GTF files but it provides some additional fields in the 9th column which translate to a Python dictionary when an annotation is loaded into an :class:`Annotation` instance. The 9th column is a list of **key=value** item, separated by a semicolon (;); each value is also expected to be quoted with double quotes and the values to not include a semicolon or other characters that can make the parsing difficult. MGKit uses :func:`urllib.quote` to encode those characters and also " ()/". The :func:`mgkit.io.gff.from_gff` uses :func:`urllib.unquote` to set the values. .. warning:: As the last column translates to a dictionary in the data structures, duplicate keys are not allowed. :func:`mgkit.io.gff.from_gff` raises an exception if any are found. Reserved Values --------------- Any key can be added to a GFF annotation, but MGKit expects a few key to be in the GFF annotation as summarised in the following tables. .. list-table:: Reserved values, used by the scripts :header-rows: 1 :stub-columns: 1 * - Key - Value - Explanation * - seq_id - string - the sequence of the annotation (header in FASTA files) * - gene_id - any string - used to identify the gene predicted * - db - any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT - identifies the database used to make the gene_id prediction * - taxon_db - any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT - identifies the database used to make the taxon_id prediction * - dbq - integer - identifies the quality of the database, used when filtering annotations * - taxon_id - integer - identifies the annotation taxon, NCBI taxonomy is used * - uid - string - unique identifier for the annotation, any string is accepted but a value is assigned by using :func:`uuid.uuid4` * - cov and {any}_cov - integer - coverage for the annotation over all samples, keys ending with *_cov* indicates coverage for each sample * - exp_syn, exp_nonsyn - integer - used for expected number of synonymous and non-synonymous changes for the annotation The following keys are added by different scripts and may be used in different scripts or annotation methods. .. list-table:: Interpreted Values :header-rows: 1 :stub-columns: 1 * - Key - Value - Explanation - Used * - taxon_name - string - name of the taxon - not used * - lineage - string - taxon lineage - not used * - EC - comma separated values - list of EC numbers associated to the annotation - used by :meth:`mgkit.io.gff.Annotation.get_ec` * - map_{any} - comma separated values - list of mapping to a specific db (e.g. eggNOG -> map_EGGNOG) - used by :meth:`mgkit.io.gff.Annotation.get_mapping` * - counts_{any} - float - Stores the count data for a sample (e.g. counts_Sample1) - used by script `add-gff-info` * - fpkms_{any} - float - Stores the count data for a sample (e.g. fpkms_Sample1) - used by script `add-gff-info` * - FC - string (no spaces) - each character is a Functional Category in eggNOG (e.g. FJO are 3 different categories) - used by :meth:`mgkit.io.gff.Annotation.get_fc`