mgkit.io.gff module¶
This modules define classes and function related to manipulation of GFF/GTF files.
-
class
mgkit.io.gff.
Annotation
(seq_id='None', start=1, end=1, strand='+', source='None', feat_type='None', score=0.0, phase=0, uid=None, **kwd)[source]¶ Bases:
mgkit.io.gff.GenomicRange
New in version 0.1.12.
Changed in version 0.2.1: using __slots__ for better memory usage
Alternative implementation for an Annotation. When initialised, If uid is None, a unique id is added using uuid.uuid4.
-
add_exp_syn_count
(seq, syn_matrix=None)[source]¶ New in version 0.1.13.
Adds expected synonymous/non-synonymous values for an annotation.
- Parameters
seq (str) – sequence corresponding to the annotation seq_id syn_matrix (None, dict): matrix that determines the return values. Defaults to the one defined in the called function
mgkit.utils.sequnce.get_seq_expected_syn_count()
.
-
add_gc_content
(seq)[source]¶ Adds GC content information for an annotation. The formula is:
(1)¶\[\frac {(G + C)}{(G + C + A + T)}\]Modifies the instances of the annotation. gc_ratio will be added to its attributes.
- Parameters
seq (str) – nucleotide sequence referred in the GFF
-
add_gc_ratio
(seq)[source]¶ Adds GC content information for an annotation. The formula is:
(2)¶\[\frac {(A + T)}{(G + C)}\]Modifies the instances of the annotation. gc_ratio will be added to its attributes.
- Parameters
seq (str) – nucleotide sequence referred in the GFF
-
attr
¶
-
property
bitscore
¶ bitscore of the annotation
-
property
counts
¶ New in version 0.2.2.
Returns the sample counts for the annotation
-
property
coverage
¶ New in version 0.1.13.
Return the total coverage for the annotation
- Return float
coverage
- Raises
AttributeNotFound – if no coverage attribute is found
-
property
db
¶ db used for the gene_id prediction
-
property
dbq
¶ db quality of the annotation
-
property
exp_nonsyn
¶ New in version 0.1.13.
Returns the expected number of non-synonymous changes
-
property
exp_syn
¶ New in version 0.1.13.
Returns the expected number of synonymous changes
-
feat_type
¶
-
property
fpkms
¶ New in version 0.2.2.
Returns the sample fpkms for the annotation
-
property
gene_id
¶ gene_id of the annotation, or ko if available
-
get_aa_seq
(seq, start=0, tbl=None, snp=None)[source]¶ New in version 0.1.16.
Returns a translated aminoacid sequence of the annotation. The snp parameter is passed to
Annotation.get_nuc_seq()
- Parameters
seq (seq) – chromosome/contig sequence
start (int) – position (0-based) from where the correct occurs (frame). If None, the phase attribute is used
tbl (dict) – dictionary with the translation for each codon, passed to
mgkit.utils.sequence.translate_sequence()
snp (tuple) – first element is the position of the SNP and the second element is the change
- Returns
aminoacid sequence
- Return type
-
get_attr
(attr, conv=<class 'str'>)[source]¶ Changed in version 0.3.4: any GFF attribute can be returned
Changed in version 0.3.3: added seq_id as special attribute, in addition do length
New in version 0.1.13.
Generic method to get an attribute and convert it to a specific datatype. The order for the lookup is:
length
self.attr (dictionary)
getattr(self) of the first 8 columns of a GFF (seq_id, source, …)
-
get_ec
(level=4)[source]¶ New in version 0.1.13.
Changed in version 0.2.0: returns a set instead of a list
Returns the EC values associated with the annotation, cutting them at the desired level.
-
get_mapping
(db)[source]¶ New in version 0.1.13.
Returns the mappings, to a particular db, associated with the annotation.
-
get_mappings
()[source]¶ New in version 0.2.1.
Return a dictionary where the keys are the mapping DBs (lowercase) and and the values are the mapping IDs for that DB
-
get_nuc_seq
(seq, reverse=False, snp=None)[source]¶ New in version 0.1.13.
Changed in version 0.1.16: added snp parameter
Returns the nucleotidic sequence that the annotation covers. if the annotation’s strand is -, and reverse is True, the reverse complement is returned.
- Parameters
- Returns
nucleotide sequence with requested transformations
- Return type
-
get_number_of_samples
(min_cov=4)[source]¶ New in version 0.1.13.
Returns the number of sample that have at least a minimum coverage of min_cov.
- Parameters
min_cov (int) – minimum coverage
- Return int
number of samples passing the filter
- Raises
AttributeNotFound – if no sample coverage attribute is found
-
is_syn
(seq, pos, change, tbl=None, abs_pos=True, start=0, strict=True)[source]¶ New in version 0.1.16.
Changed in version 0.4.4: added strict parameter
Return if a SNP is synonymous or non-synonymous.
- Parameters
seq (seq) – reference sequence of the annotation
pos (int) – position of the SNP on the reference (1-based index)
change (str) – nucleotidic change
tbl (dict) – dictionary with the translation table. Defaults to the universal genetic code
abs_pos (bool) – if True the pos is referred to the reference and not a position relative to the annotation
start (int or None) – phase to be used to get the start position of the codon. if None, the Annotation phase will be used
strict (bool) – if a variant codon is not found, a KeyError is raised, otherwise False is returned
- Returns
True if the SNP is synonymous, false if it’s non-synonymous. Behaviour in case of variant codons not found in the translation table changes based on strict
- Return type
- Raises
KeyError – if the variant codon is not found and strict is True
-
property
length
¶ Changed in version 0.2.0.
Length of the annotation, uses len(self)
-
phase
¶
-
property
region
¶ New in version 0.1.13.
Return the region covered by the annotation, to use in samtools
-
property
sample_coverage
¶ New in version 0.1.13.
Returns a dictionary with the coverage for each sample, the returned dictionary has the sample id (stripped of the _cov) suffix and as values the coverage (converted via
int()
).- Return dict
dictionary with the samples’ coverage
-
score
¶
-
set_attr
(attr, value)[source]¶ New in version 0.1.13.
Changed in version 0.4.4: a standard attribute can now be set
Generic method to set an attribute
-
set_mapping
(db, values)[source]¶ New in version 0.1.13.
Set mappings to a particular db, associated with the annotation.
- Parameters
db (str) – database to which the mappings come from
mappings (iterable) – iterable of mappings
-
source
¶
-
property
taxon_db
¶ db used for the taxon_id prediction
-
property
taxon_id
¶ Changed in version 0.3.1: if taxon_id is set to “None” as a string, it’s converted to None
taxon_id of the annotation
-
to_dict
(exclude_attr=None)[source]¶ New in version 0.3.1.
Return a dictionary representation of the Annotation.
-
to_gtf
(gene_id_attr='uid', sep=' ')[source]¶ New in version 0.1.15.
Changed in version 0.1.16: added gene_id_attr parameter
Changed in version 0.2.2: added sep argument, default to a space, now
Simple conversion to a valid GTF. gene_id and transcript_id are set to uid or the attribute specified using the gene_id_attr parameter. It’s written to be used with SNPDat.
-
to_json
()[source]¶ New in version 0.2.1.
Changed in version 0.3.1: now
Annotation.to_dict()
is usedReturns a json representation of the Annotation
-
to_mongodb
(lineage_func=None, indent=None, raw=False)[source]¶ New in version 0.2.1.
Changed in version 0.2.2: added handling of counts_ and fpkms_
Changed in version 0.2.6: added indent parameter
Changed in version 0.3.4: added raw
Returns a MongoDB document that represent the Annotation.
- Parameters
lineage (func) – function used to populate the lineage key, returns a list of taxon_id
indent (int) – the amount of indent to put in the record, None (the default) is for the most compact - one line for the record
raw (bool) – if True, the method returns a string, which is the json dump, if False, the value returned is the dictionary
- Returns
the MongoDB document, with Annotation.uid as _id, as a string if raw is True, a dictionary if it is False
- Return type
-
property
uid
¶ New in version 0.1.13.
uid of the annotation
-
-
exception
mgkit.io.gff.
AttributeNotFound
[source]¶ Bases:
Exception
Raised if an attribute is not found in a GFF file
-
exception
mgkit.io.gff.
DuplicateKeyError
[source]¶ Bases:
Exception
New in version 0.1.12.
Raised if a GFF annotation contains duplicate keys
-
class
mgkit.io.gff.
GenomicRange
(seq_id='None', start=1, end=1, strand='+')[source]¶ Bases:
object
Defines a genomic range
Changed in version 0.2.1: using __slots__ for better memory usage
-
__contains__
(pos)[source]¶ Changed in version 0.2.3: a range or a subclass are accepted
New in version 0.1.16.
Tests if the position is inside the range of the GenomicRange
Pos is 1-based as
GenomicRange.start
andGenomicRange.end
-
end
¶
-
expand_from_list
(others)[source]¶ Expand the
GenomicRange
range instance with a list ofGenomicRange
- Parameters
others (iterable) – iterable of
GenomicRange
-
get_relative_pos
(pos)[source]¶ New in version 0.1.16.
Given an absolute position (referred to the reference), convert the position to a coordinate relative to the GenomicRange
- Returns
the position relative to the GenomicRange
- Return type
- Raises
ValueError – if the position is not in the range
-
intersect
(other)[source]¶ Return an instance of
GenomicRange
that represent the intersection of the current instance and another.
-
seq_id
¶
-
start
¶
-
strand
¶
-
union
(other)[source]¶ Return the union of two
GenomicRange
-
-
mgkit.io.gff.
annotation_coverage
(annotations, seqs, strand=True)[source]¶ New in version 0.1.12.
Given a list of annotations and a dictionary where the keys are the sequence names referred in the annotations and the values are the sequences themselves, returns a number which indicated how much the sequence length is “covered” in annotations. If strand is True the coverage is strand specific.
- Parameters
annotations (iterable) – iterable of
Annotation
instancesseqs (dict) – dictionary in which the keys are the sequence names and the values are the sequences
strand (bool) – if True, the values are strand specific (the annotations) are grouped by (seq_id, strand) instead of seq_id
- Yields
tuple – the first element is the key, (seq_id, strand) if strand is True or seq_id if strand is False, and the coverage is the second value.
-
mgkit.io.gff.
annotation_coverage_sorted
(annotations, seqs, strand=True)[source]¶ New in version 0.3.1.
Given a list of annotations and a dictionary where the keys are the sequence names referred in the annotations and the values are the sequences themselves, returns a number which indicated how much the sequence length is “covered” in annotations. If strand is True the coverage is strand specific.
Note
It differs from
annotation_coverage()
because it assumes the annotations are correctly sorted and in the values yielded- Parameters
annotations (iterable) – iterable of
Annotation
instancesseqs (dict) – dictionary in which the keys are the sequence names and the values are the sequences
strand (bool) – if True, the values are strand specific (the annotations) are grouped by (seq_id, strand) instead of seq_id
- Yields
tuple – the first element is the seq_id, the second the strand (if strand is True, else it’s set to None), and the third element is the coverage.
-
mgkit.io.gff.
annotation_elongation
(ann1, annotations)[source]¶ New in version 0.1.12.
Given an
Annotation
instance and a list of the instances of the same class, returns the longest overlapping range that can be found and the annotations that are included in it.Warning
annotations are not checked for seq_id and strand
- Parameters
ann1 (Annotation) – annotation to elongate
annotations (iterable) – iterable of
Annotation
instances
- Returns
the first element is the longest range found, while the the second element is a set with the annotations used
- Return type
-
mgkit.io.gff.
convert_gff_to_gtf
(file_in, file_out, gene_id_attr='uid')[source]¶ New in version 0.1.16.
Function that uses
Annotation.to_gtf()
to convert a GFF into GTF.
-
mgkit.io.gff.
diff_gff
(files, key_func=None)[source]¶ New in version 0.1.12.
Returns a simple diff made between a list of gff files. The annotations are grouped using key_func, so it depends on it to find similar annotations.
- Parameters
files (iterable) – an iterable of file handles, pointing to GFF files
key_func (func) – function used to group annotations, defaults to this key: (x.seq_id, x.strand, x.start, x.end, x.gene_id, x.bitscore)
- Returns
the returned dictionary keys are determined by key_func and as values lists. The lists elements are tuple whose first element is the index of the file, relative to files and the second element is the line number in which the annotation is. Can be used with the
linecache
module.- Return type
-
mgkit.io.gff.
elongate_annotations
(annotations)[source]¶ New in version 0.1.12.
Given an iterable of
Annotation
instances, tries to find the all possible longest ranges and returns them.Warning
annotations are not checked for seq_id and strand
- Parameters
annotations (iterable) – iterable of
Annotation
instances- Returns
set with the all ranges found
- Return type
-
mgkit.io.gff.
extract_nuc_seqs
(annotations, seqs, name_func=<function <lambda>>, reverse=False)[source]¶ New in version 0.1.13.
Extract the nucleotidic sequences from a list of annotations. Internally uses the method
Annotation.get_nuc_seq()
.- Parameters
annotations (iterable) – iterable of
Annotation
instancesseqs (dict) – dictionary with the sequences referenced in the annotations
name_func (func) – function used to extract the sequence name to be used, defaults to the uid of the annotation
reverse (bool) – if True the annotations on the - strand are reverse complemented
- Yields
tuple – tuple whose first element is the sequence name and the second is the sequence to which the annotation refers.
-
mgkit.io.gff.
from_gff
(line, strict=True, encoding='ascii')[source]¶ New in version 0.1.12.
Changed in version 0.2.6: added strict parameter
Changed in version 0.4.0: added encoding parameter
Parse GFF line and returns an
Annotation
instance- Parameters
- Returns
instance of
Annotation
for the line- Return type
- Raises
DuplicateKeyError – if the attribute column has duplicate keys
-
mgkit.io.gff.
from_glimmer3
(header, line, feat_type='CDS')[source]¶ New in version 0.1.12.
Parses the line of a GLIMMER3 ouput and returns an instance of a GFF annotation.
- Parameters
- Returns
instance of annotation
- Return type
Example
Assuming a GLIMMER3 output like this:
>sequence0001 orf00001 66 611 +3 6.08
The code used is:
>>> header = 'sequence0001' >>> line = 'orf00001 66 611 +3 6.08' >>> from_glimmer3(header, line)
-
mgkit.io.gff.
from_hmmer
(line, aa_seqs, feat_type='gene', source='HMMER', db='CUSTOM', custom_profiles=True, noframe=False)[source]¶ New in version 0.1.15: first implementation to move old scripts to new GFF specs
Changed in version 0.2.1: removed compatibility with old scripts
Changed in version 0.2.2: taxon_id and taxon_name are not saved for non-custom profiles
Changed in version 0.3.1: added support for non mgkit-translated sequences (noframe)
Parse HMMER results (one line), it won’t parse commented lines (starting with #)
- Parameters
line (str) – HMMER domain table line
aa_seqs (dict) – dictionary with amino-acid sequences (name->seq), used to get the correct nucleotide positions
feat_type (str) – string to be used in the ‘feature type’ column
source (str) – string to be used in the ‘source’ column
custom_profiles (bool) – if True, the profile name contains gene, taxonomy and reviewed information in the form KOID_TAXONID_TAXON-NAME(-nr)
noframe (bool) – if True, the sequence is assumed to be in frame f0
- Returns
A
Annotation
instance
Note
if custom_profiles is False, gene_id, taxon_id and taxon_name will be equal to the profile name
-
mgkit.io.gff.
from_json
(line)[source]¶ New in version 0.2.1.
Returns an Annotation from a json representation
-
mgkit.io.gff.
from_mongodb
(record, lineage=True)[source]¶ New in version 0.2.1.
Changed in version 0.2.2: added handling of counts_ and fpkms_
Changed in version 0.2.6: better handling of missing attributes and added lineage parameter
Returns a
Annotation
instance from a MongoDB record (created) usingAnnotation.to_mongodb()
. The actual record returned by pymongo is a dictionary that is copied, manipulated and passed to theAnnotation.__init__()
.- Parameters
- Returns
instance of
Annotation
object- Return type
-
mgkit.io.gff.
from_nuc_blast
(hit, db, feat_type='CDS', seq_len=None, to_nuc=False, **kwd)[source]¶ New in version 0.1.12.
Changed in version 0.1.16: added to_nuc parameter
Changed in version 0.2.3: removed to_nuc, the hit can include the subject end/start and evalue
Returns an instance of
Annotation
- Parameters
hit (tuple) – a BLAST hit, from
mgkit.io.blast.parse_blast_tab()
db (str) – db used with BLAST
- Keyword Arguments
- Returns
instance of
Annotation
- Return type
-
mgkit.io.gff.
from_prodigal_frag
(main_gff, blast_gff, attr='ID', split_func=None)[source]¶ Changed in version 0.3.3: fixed a bug for the strand, also the code is tested
New in version 0.2.6: experimental
Reads the GFF given in output by PRODIGAL and the resulting GFF from using BLAST (or other software) on the aa or nucleotide file output by PRODIGAL.
It then integrates the two outputs, so to the PRODIGAL GFF is added the information from the the output of the gene prediction software used.
- Parameters
main_gff (file) – GFF file from PRODIGAL
blast_gff (file) – GFF with the returned annotations
attr (str) – attribute in the PRODIGAL GFF that is used to identify an annotation
split_func (func) – function to rename the headers from the predicted sequences back to their parent sequence
- Yields
annotation – annotation for each blast_gff back translated
-
mgkit.io.gff.
from_sequence
(name, seq, feat_type='SEQUENCE', **kwd)[source]¶ New in version 0.1.12.
Returns an instance of
Annotation
for the full length of a sequence- Parameters
- Keyword Arguments
feat_type (str) – feature type in the GFF
**kwd – any additional column
- Returns
instance of
Annotation
- Return type
-
mgkit.io.gff.
get_annotation_map
(annotations, key_func, value_func)[source]¶ New in version 0.1.15.
Applies two functions to an iterable of annotations with an iterator returned with the applied functions. Useful to build a dictionary
- Parameters
annotations (iterable) – iterable of annotations
key_func (func) – function that accept an annotation as argument and returns one value, the first of the returned tuple
value_func (func) – function that accept an annotation as argument and returns one value, the second of the returned tuple
- Yields
tuple – a tuple where the first value is the result of key_func on the passed annotation and the second is the value returned by value_func on the same annotation
-
mgkit.io.gff.
group_annotations
(annotations, key_func=<function <lambda>>)[source]¶ New in version 0.1.12.
Group
Annotation
instances in a dictionary by using a key function that returns the key to be used in the dictionary.- Parameters
annotations (iterable) – iterable with
Annotation
instanceskey_func (func) – function used to extract the key used in the dictionary, defaults to a function that returns (ann.seq_id, ann.strand)
- Returns
dictionary whose keys are returned by key_func and the values are lists of annotations
- Return type
Example
>>> ann = [Annotation(seq_id='seq1', strand='+', start=10, end=15), ... Annotation(seq_id='seq1', strand='+', start=1, end=5), ... Annotation(seq_id='seq1', strand='-', start=30, end=100)] >>> group_annotations(ann) {('seq1', '+'): [seq1(+):10-15, seq1(+):1-5], ('seq1', '-'): [seq1(-):30-100]}
-
mgkit.io.gff.
group_annotations_by_ancestor
(annotations, ancestors, taxonomy)[source]¶ New in version 0.1.13.
Group annotations by the ancestors provided.
- Parameters
annotations (iterable) – annotations to group
ancestors (iterable) – list of ancestors accepted
taxonomy – taxonomy class
- Returns
grouped annotations
- Return type
-
mgkit.io.gff.
group_annotations_sorted
(annotations, key_func=<function <lambda>>)[source]¶ New in version 0.1.13.
Group
Annotation
instances by using a key function that returns a key. Assumes that the annotations are already sorted to return an iterator and save memory. One way to sort them is using: sort -s -k 1,1 -k 7,7 on the file.- Parameters
annotations (iterable) – iterable with
Annotation
instanceskey_func (func) – function used to extract the key used in the dictionary, defaults to a function that returns (ann.seq_id, ann.strand)
- Yields
list – a list of the grouped annotations by key_func values
-
mgkit.io.gff.
load_gff_base_info
(files, taxonomy=None, exclude_ids=None, include_taxa=None, encoding='ascii')[source]¶ This function is useful if the number of annotations in a GFF is high or there are memory constraints on the system. It returns a dictionary that can be used with functions like
mgkit.counts.func.load_sample_counts()
.- Parameters
files (iterable, str) – file name or list of paths of GFF files
taxonomy – taxonomy pickle file, needed if include_taxa is not None
exclude_ids (set, list) – a list of gene_id to exclude from the dictionary
include_taxa (int, iterable) – a taxon_id or list thereof to be passed to
mgkit.taxon.taxonomy.is_ancestor()
, so only the taxa that have the those taxon_id(s) as ancestor(s) are keptencoding (str) – passed to
parse_gff()
- Returns
dictionary where the key is
Annotation.uid
and the value is a tuple (Annotation.gene_id
,Annotation.taxon_id
)- Return type
-
mgkit.io.gff.
load_gff_mappings
(files, map_db, taxonomy=None, exclude_ids=None, include_taxa=None, encoding='ascii')[source]¶ This function is useful if the number of annotations in a GFF is high or there are memory constraints on the system. It returns a dictionary that can be used with functions like
mgkit.counts.func.load_sample_counts()
.- Parameters
files (iterable, str) – file name or list of paths of GFF files
map_db (str) – any kind mapping in the GFF, as passed to
Annotation.get_mapping()
taxonomy – taxonomy pickle file, needed if include_taxa is not None
exclude_ids (set, list) – a list of gene_id to exclude from the dictionary
include_taxa (int, iterable) – a taxon_id or list thereof to be passed to
mgkit.taxon.taxonomy.is_ancestor()
, so only the taxa that have the those taxon_id(s) as ancestor(s) are keptencoding (str) – passed to
parse_gff()
- Returns
dictionary where the key is
Annotation.gene_id
and the value is a list of mappings, as returned byAnnotation.get_mapping()
- Return type
-
mgkit.io.gff.
parse_gff
(file_handle, gff_type=<function from_gff>, strict=False, encoding='ascii', filter_func=None)[source]¶ Changed in version 0.5.7: added filter_func
Changed in version 0.4.0: In some cases ASCII decoding is not enough, so it is parametrised now
Changed in version 0.3.4: added decoding from binary for compatibility with Python3
Changed in version 0.2.6: added strict parameter
Changed in version 0.2.3: correctly handling of GFF with comments of appended sequences
Changed in version 0.1.12: added gff_type parameter
Parse a GFF file and returns generator of
GFFKegg
instancesAccepts a file handle or a string with the file name
- Parameters
file_handle (str, file) – file name or file handle to read from
gff_type (class) – class/function used to parse a GFF annotation
strict (bool) – if True duplicate keys raise an exception
encoding (str) – encoding of the file, if ascii fails, use utf8
file_func (func) – a function used to filter, accepts an annotation as argument and returns True if the annotation is to be kept
- Yields
Annotation – an iterator of
Annotation
instances
-
mgkit.io.gff.
parse_gff_files
(files, strict=True)[source]¶ New in version 0.1.15.
Changed in version 0.2.6: added strict parameter
Function that returns an iterator of annotations from multiple GFF files.
- Parameters
- Yields
Annotation
– iterator of annotations
-
mgkit.io.gff.
split_gff_file
(file_handle, name_mask, num_files=2, encoding='ascii')[source]¶ New in version 0.1.14.
Changed in version 0.2.6: now accept a file object as sole input
Changed in version 0.4.0: added encoding parameter
Splits a GFF, or a list of them, into a number of files. It is assured that annotations for the same sequence are kept in the same file, which is useful for cases like filtering, even when the annotations are from different GFF files.
Internally, a structure is kept to check if a sequence ID is already been stored to a file, in which case the annotation is written to that file, otherwise a random file handles (among the open ones) is chosen.
- Parameters
Example
>>> import glob >>> files = glob.glob('*.gff') >>> name_mask = 'split-file-{0}.gff' >>> split_gff_file(files, name_mask, 5)
-
mgkit.io.gff.
write_gff
(annotations, file_handle, verbose=True)[source]¶ Changed in version 0.1.12: added verbose argument
Write a GFF to file
- Parameters
annotations (iterable) – iterable that returns
GFFKegg
orAnnotation
instancesfile_handle (str, file) – file name or file handle to write to
verbose (bool) – if True, a message is logged