mgkit.io.gff module

This modules define classes and function related to manipulation of GFF/GTF files.

class mgkit.io.gff.Annotation(seq_id='None', start=1, end=1, strand='+', source='None', feat_type='None', score=0.0, phase=0, uid=None, **kwd)[source]

Bases: mgkit.io.gff.GenomicRange

New in version 0.1.12.

Changed in version 0.2.1: using __slots__ for better memory usage

Alternative implementation for an Annotation. When initialised, If uid is None, a unique id is added using uuid.uuid4.

add_exp_syn_count(seq, syn_matrix=None)[source]

New in version 0.1.13.

Adds expected synonymous/non-synonymous values for an annotation.

Parameters

seq (str) – sequence corresponding to the annotation seq_id syn_matrix (None, dict): matrix that determines the return values. Defaults to the one defined in the called function mgkit.utils.sequnce.get_seq_expected_syn_count().

add_gc_content(seq)[source]

Adds GC content information for an annotation. The formula is:

(1)\[\frac {(G + C)}{(G + C + A + T)}\]

Modifies the instances of the annotation. gc_ratio will be added to its attributes.

Parameters

seq (str) – nucleotide sequence referred in the GFF

add_gc_ratio(seq)[source]

Adds GC content information for an annotation. The formula is:

(2)\[\frac {(A + T)}{(G + C)}\]

Modifies the instances of the annotation. gc_ratio will be added to its attributes.

Parameters

seq (str) – nucleotide sequence referred in the GFF

attr
property bitscore

bitscore of the annotation

property counts

New in version 0.2.2.

Returns the sample counts for the annotation

property coverage

New in version 0.1.13.

Return the total coverage for the annotation

Return float

coverage

Raises

AttributeNotFound – if no coverage attribute is found

property db

db used for the gene_id prediction

property dbq

db quality of the annotation

del_attr(attr)[source]

New in version 0.4.4.

Removes attributes from the Annotation

property exp_nonsyn

New in version 0.1.13.

Returns the expected number of non-synonymous changes

property exp_syn

New in version 0.1.13.

Returns the expected number of synonymous changes

feat_type
property fpkms

New in version 0.2.2.

Returns the sample fpkms for the annotation

property gene_id

gene_id of the annotation, or ko if available

get_aa_seq(seq, start=0, tbl=None, snp=None)[source]

New in version 0.1.16.

Returns a translated aminoacid sequence of the annotation. The snp parameter is passed to Annotation.get_nuc_seq()

Parameters
  • seq (seq) – chromosome/contig sequence

  • start (int) – position (0-based) from where the correct occurs (frame). If None, the phase attribute is used

  • tbl (dict) – dictionary with the translation for each codon, passed to mgkit.utils.sequence.translate_sequence()

  • snp (tuple) – first element is the position of the SNP and the second element is the change

Returns

aminoacid sequence

Return type

str

get_attr(attr, conv=<class 'str'>)[source]

Changed in version 0.3.4: any GFF attribute can be returned

Changed in version 0.3.3: added seq_id as special attribute, in addition do length

New in version 0.1.13.

Generic method to get an attribute and convert it to a specific datatype. The order for the lookup is:

  • length

  • self.attr (dictionary)

  • getattr(self) of the first 8 columns of a GFF (seq_id, source, …)

get_ec(level=4)[source]

New in version 0.1.13.

Changed in version 0.2.0: returns a set instead of a list

Returns the EC values associated with the annotation, cutting them at the desired level.

Parameters

level (int) – level of classification desired (between 1 and 4)

Returns

list of all EC numbers associated, at the desired level, if none are found an empty set is returned

Return type

set

get_fc(description=False)[source]

New in version 0.5.7.

Parameters

description (bool) – If True, instead of the one letter category, the description is returned

Returns

list of functional categories

Return type

list

get_mapping(db)[source]

New in version 0.1.13.

Returns the mappings, to a particular db, associated with the annotation.

Parameters

db (str) – database to which the mappings come from

Returns

list of all mappings associated, to the specified db, if none are found an empty list is returned

Return type

list

get_mappings()[source]

New in version 0.2.1.

Return a dictionary where the keys are the mapping DBs (lowercase) and and the values are the mapping IDs for that DB

get_nuc_seq(seq, reverse=False, snp=None)[source]

New in version 0.1.13.

Changed in version 0.1.16: added snp parameter

Returns the nucleotidic sequence that the annotation covers. if the annotation’s strand is -, and reverse is True, the reverse complement is returned.

Parameters
  • seq (seq) – chromosome/contig sequence

  • reverse (bool) – if True and the strand is ‘-‘, a reverse complement is returned

  • snp (tuple) – first element is the position of the SNP relative to the Annotation and the second element is the change

Returns

nucleotide sequence with requested transformations

Return type

str

get_number_of_samples(min_cov=4)[source]

New in version 0.1.13.

Returns the number of sample that have at least a minimum coverage of min_cov.

Parameters

min_cov (int) – minimum coverage

Return int

number of samples passing the filter

Raises

AttributeNotFound – if no sample coverage attribute is found

has_attr(attr)[source]

New in version 0.4.4.

Tests if an attributes is present in the Annotation

Parameters

attr (str) – attribute to test

Returns

True if the attributes is present

Return type

bool

is_syn(seq, pos, change, tbl=None, abs_pos=True, start=0, strict=True)[source]

New in version 0.1.16.

Changed in version 0.4.4: added strict parameter

Return if a SNP is synonymous or non-synonymous.

Parameters
  • seq (seq) – reference sequence of the annotation

  • pos (int) – position of the SNP on the reference (1-based index)

  • change (str) – nucleotidic change

  • tbl (dict) – dictionary with the translation table. Defaults to the universal genetic code

  • abs_pos (bool) – if True the pos is referred to the reference and not a position relative to the annotation

  • start (int or None) – phase to be used to get the start position of the codon. if None, the Annotation phase will be used

  • strict (bool) – if a variant codon is not found, a KeyError is raised, otherwise False is returned

Returns

True if the SNP is synonymous, false if it’s non-synonymous. Behaviour in case of variant codons not found in the translation table changes based on strict

Return type

bool

Raises

KeyError – if the variant codon is not found and strict is True

property length

Changed in version 0.2.0.

Length of the annotation, uses len(self)

phase
property region

New in version 0.1.13.

Return the region covered by the annotation, to use in samtools

property sample_coverage

New in version 0.1.13.

Returns a dictionary with the coverage for each sample, the returned dictionary has the sample id (stripped of the _cov) suffix and as values the coverage (converted via int()).

Return dict

dictionary with the samples’ coverage

score
set_attr(attr, value)[source]

New in version 0.1.13.

Changed in version 0.4.4: a standard attribute can now be set

Generic method to set an attribute

set_mapping(db, values)[source]

New in version 0.1.13.

Set mappings to a particular db, associated with the annotation.

Parameters
  • db (str) – database to which the mappings come from

  • mappings (iterable) – iterable of mappings

source
property taxon_db

db used for the taxon_id prediction

property taxon_id

Changed in version 0.3.1: if taxon_id is set to “None” as a string, it’s converted to None

taxon_id of the annotation

to_dict(exclude_attr=None)[source]

New in version 0.3.1.

Return a dictionary representation of the Annotation.

Parameters

exclude_attr (str,list) – attributes to exclude from the dictionary, can be either a single attribute (string) or a list of strings

Returns

dictionary with the annotation

Return type

dict

to_file(file_handle)[source]

Writes the GFF annotation to file_handle

to_gff(sep='=')[source]

Format the Annotation as a GFF string.

Parameters

sep (str) – separator key -> value

Returns

annotation formatted as GFF

Return type

str

to_gtf(gene_id_attr='uid', sep=' ')[source]

New in version 0.1.15.

Changed in version 0.1.16: added gene_id_attr parameter

Changed in version 0.2.2: added sep argument, default to a space, now

Simple conversion to a valid GTF. gene_id and transcript_id are set to uid or the attribute specified using the gene_id_attr parameter. It’s written to be used with SNPDat.

to_json()[source]

New in version 0.2.1.

Changed in version 0.3.1: now Annotation.to_dict() is used

Returns a json representation of the Annotation

to_mongodb(lineage_func=None, indent=None, raw=False)[source]

New in version 0.2.1.

Changed in version 0.2.2: added handling of counts_ and fpkms_

Changed in version 0.2.6: added indent parameter

Changed in version 0.3.4: added raw

Returns a MongoDB document that represent the Annotation.

Parameters
  • lineage (func) – function used to populate the lineage key, returns a list of taxon_id

  • indent (int) – the amount of indent to put in the record, None (the default) is for the most compact - one line for the record

  • raw (bool) – if True, the method returns a string, which is the json dump, if False, the value returned is the dictionary

Returns

the MongoDB document, with Annotation.uid as _id, as a string if raw is True, a dictionary if it is False

Return type

str or dict

property uid

New in version 0.1.13.

uid of the annotation

exception mgkit.io.gff.AttributeNotFound[source]

Bases: Exception

Raised if an attribute is not found in a GFF file

exception mgkit.io.gff.DuplicateKeyError[source]

Bases: Exception

New in version 0.1.12.

Raised if a GFF annotation contains duplicate keys

class mgkit.io.gff.GenomicRange(seq_id='None', start=1, end=1, strand='+')[source]

Bases: object

Defines a genomic range

Changed in version 0.2.1: using __slots__ for better memory usage

__contains__(pos)[source]

Changed in version 0.2.3: a range or a subclass are accepted

New in version 0.1.16.

Tests if the position is inside the range of the GenomicRange

Pos is 1-based as GenomicRange.start and GenomicRange.end

end
expand_from_list(others)[source]

Expand the GenomicRange range instance with a list of GenomicRange

Parameters

others (iterable) – iterable of GenomicRange

get_range()[source]

New in version 0.1.13.

Returns the start and end position as a tuple

get_relative_pos(pos)[source]

New in version 0.1.16.

Given an absolute position (referred to the reference), convert the position to a coordinate relative to the GenomicRange

Returns

the position relative to the GenomicRange

Return type

int

Raises

ValueError – if the position is not in the range

intersect(other)[source]

Return an instance of GenomicRange that represent the intersection of the current instance and another.

seq_id
start
strand
union(other)[source]

Return the union of two GenomicRange

mgkit.io.gff.annotate_sequence(name, seq, window=None)[source]
mgkit.io.gff.annotation_coverage(annotations, seqs, strand=True)[source]

New in version 0.1.12.

Given a list of annotations and a dictionary where the keys are the sequence names referred in the annotations and the values are the sequences themselves, returns a number which indicated how much the sequence length is “covered” in annotations. If strand is True the coverage is strand specific.

Parameters
  • annotations (iterable) – iterable of Annotation instances

  • seqs (dict) – dictionary in which the keys are the sequence names and the values are the sequences

  • strand (bool) – if True, the values are strand specific (the annotations) are grouped by (seq_id, strand) instead of seq_id

Yields

tuple – the first element is the key, (seq_id, strand) if strand is True or seq_id if strand is False, and the coverage is the second value.

mgkit.io.gff.annotation_coverage_sorted(annotations, seqs, strand=True)[source]

New in version 0.3.1.

Given a list of annotations and a dictionary where the keys are the sequence names referred in the annotations and the values are the sequences themselves, returns a number which indicated how much the sequence length is “covered” in annotations. If strand is True the coverage is strand specific.

Note

It differs from annotation_coverage() because it assumes the annotations are correctly sorted and in the values yielded

Parameters
  • annotations (iterable) – iterable of Annotation instances

  • seqs (dict) – dictionary in which the keys are the sequence names and the values are the sequences

  • strand (bool) – if True, the values are strand specific (the annotations) are grouped by (seq_id, strand) instead of seq_id

Yields

tuple – the first element is the seq_id, the second the strand (if strand is True, else it’s set to None), and the third element is the coverage.

mgkit.io.gff.annotation_elongation(ann1, annotations)[source]

New in version 0.1.12.

Given an Annotation instance and a list of the instances of the same class, returns the longest overlapping range that can be found and the annotations that are included in it.

Warning

annotations are not checked for seq_id and strand

Parameters
  • ann1 (Annotation) – annotation to elongate

  • annotations (iterable) – iterable of Annotation instances

Returns

the first element is the longest range found, while the the second element is a set with the annotations used

Return type

tuple

mgkit.io.gff.convert_gff_to_gtf(file_in, file_out, gene_id_attr='uid')[source]

New in version 0.1.16.

Function that uses Annotation.to_gtf() to convert a GFF into GTF.

Parameters
  • file_in (str, file) – either file name or file handle of a GFF file

  • file_out (str) – file name to which write the converted annotations

mgkit.io.gff.diff_gff(files, key_func=None)[source]

New in version 0.1.12.

Returns a simple diff made between a list of gff files. The annotations are grouped using key_func, so it depends on it to find similar annotations.

Parameters
  • files (iterable) – an iterable of file handles, pointing to GFF files

  • key_func (func) – function used to group annotations, defaults to this key: (x.seq_id, x.strand, x.start, x.end, x.gene_id, x.bitscore)

Returns

the returned dictionary keys are determined by key_func and as values lists. The lists elements are tuple whose first element is the index of the file, relative to files and the second element is the line number in which the annotation is. Can be used with the linecache module.

Return type

dict

mgkit.io.gff.elongate_annotations(annotations)[source]

New in version 0.1.12.

Given an iterable of Annotation instances, tries to find the all possible longest ranges and returns them.

Warning

annotations are not checked for seq_id and strand

Parameters

annotations (iterable) – iterable of Annotation instances

Returns

set with the all ranges found

Return type

set

mgkit.io.gff.extract_nuc_seqs(annotations, seqs, name_func=<function <lambda>>, reverse=False)[source]

New in version 0.1.13.

Extract the nucleotidic sequences from a list of annotations. Internally uses the method Annotation.get_nuc_seq().

Parameters
  • annotations (iterable) – iterable of Annotation instances

  • seqs (dict) – dictionary with the sequences referenced in the annotations

  • name_func (func) – function used to extract the sequence name to be used, defaults to the uid of the annotation

  • reverse (bool) – if True the annotations on the - strand are reverse complemented

Yields

tuple – tuple whose first element is the sequence name and the second is the sequence to which the annotation refers.

mgkit.io.gff.from_aa_blast_frag(hit, parent_ann, aa_seqs)[source]
mgkit.io.gff.from_gff(line, strict=True, encoding='ascii')[source]

New in version 0.1.12.

Changed in version 0.2.6: added strict parameter

Changed in version 0.4.0: added encoding parameter

Parse GFF line and returns an Annotation instance

Parameters
  • line (str) – GFF line

  • strict (bool) – if True duplicate keys raise an exception

Returns

instance of Annotation for the line

Return type

Annotation

Raises

DuplicateKeyError – if the attribute column has duplicate keys

mgkit.io.gff.from_glimmer3(header, line, feat_type='CDS')[source]

New in version 0.1.12.

Parses the line of a GLIMMER3 ouput and returns an instance of a GFF annotation.

Parameters
  • header (str) – the seq_id to which the ORF belongs

  • line (str) – the prediction line for the orf

  • feat_type (str) – the feature type to use

Returns

instance of annotation

Return type

Annotation

Example

Assuming a GLIMMER3 output like this:

>sequence0001
orf00001       66      611  +3     6.08

The code used is:

>>> header = 'sequence0001'
>>> line = 'orf00001       66      611  +3     6.08'
>>> from_glimmer3(header, line)
mgkit.io.gff.from_hmmer(line, aa_seqs, feat_type='gene', source='HMMER', db='CUSTOM', custom_profiles=True, noframe=False)[source]

New in version 0.1.15: first implementation to move old scripts to new GFF specs

Changed in version 0.2.1: removed compatibility with old scripts

Changed in version 0.2.2: taxon_id and taxon_name are not saved for non-custom profiles

Changed in version 0.3.1: added support for non mgkit-translated sequences (noframe)

Parse HMMER results (one line), it won’t parse commented lines (starting with #)

Parameters
  • line (str) – HMMER domain table line

  • aa_seqs (dict) – dictionary with amino-acid sequences (name->seq), used to get the correct nucleotide positions

  • feat_type (str) – string to be used in the ‘feature type’ column

  • source (str) – string to be used in the ‘source’ column

  • custom_profiles (bool) – if True, the profile name contains gene, taxonomy and reviewed information in the form KOID_TAXONID_TAXON-NAME(-nr)

  • noframe (bool) – if True, the sequence is assumed to be in frame f0

Returns

A Annotation instance

Note

if custom_profiles is False, gene_id, taxon_id and taxon_name will be equal to the profile name

mgkit.io.gff.from_json(line)[source]

New in version 0.2.1.

Returns an Annotation from a json representation

mgkit.io.gff.from_mongodb(record, lineage=True)[source]

New in version 0.2.1.

Changed in version 0.2.2: added handling of counts_ and fpkms_

Changed in version 0.2.6: better handling of missing attributes and added lineage parameter

Returns a Annotation instance from a MongoDB record (created) using Annotation.to_mongodb(). The actual record returned by pymongo is a dictionary that is copied, manipulated and passed to the Annotation.__init__().

Parameters
  • record (dict) – a dictionary with the full record from a MongoDB query

  • lineage (bool) – indicates if the lineage information in the record should be kept in the annotation

Returns

instance of Annotation object

Return type

Annotation

mgkit.io.gff.from_nuc_blast(hit, db, feat_type='CDS', seq_len=None, to_nuc=False, **kwd)[source]

New in version 0.1.12.

Changed in version 0.1.16: added to_nuc parameter

Changed in version 0.2.3: removed to_nuc, the hit can include the subject end/start and evalue

Returns an instance of Annotation

Parameters
Keyword Arguments
  • feat_type (str) – feature type in the GFF

  • seq_len (int) – sequence length, if supplied, the phase for strand ‘-‘ can be assigned, otherwise is assigned a 0 and the sequence coverage are added

  • **kwd – any additional column

Returns

instance of Annotation

Return type

Annotation

mgkit.io.gff.from_nuc_blast_frag(hit, parent_ann, db='NCBI-NT')[source]
mgkit.io.gff.from_prodigal_frag(main_gff, blast_gff, attr='ID', split_func=None)[source]

Changed in version 0.3.3: fixed a bug for the strand, also the code is tested

New in version 0.2.6: experimental

Reads the GFF given in output by PRODIGAL and the resulting GFF from using BLAST (or other software) on the aa or nucleotide file output by PRODIGAL.

It then integrates the two outputs, so to the PRODIGAL GFF is added the information from the the output of the gene prediction software used.

Parameters
  • main_gff (file) – GFF file from PRODIGAL

  • blast_gff (file) – GFF with the returned annotations

  • attr (str) – attribute in the PRODIGAL GFF that is used to identify an annotation

  • split_func (func) – function to rename the headers from the predicted sequences back to their parent sequence

Yields

annotation – annotation for each blast_gff back translated

mgkit.io.gff.from_sequence(name, seq, feat_type='SEQUENCE', **kwd)[source]

New in version 0.1.12.

Returns an instance of Annotation for the full length of a sequence

Parameters
  • name (str) – name of the sequence

  • seq (str) – sequence, to get the length of the annotation

Keyword Arguments
  • feat_type (str) – feature type in the GFF

  • **kwd – any additional column

Returns

instance of Annotation

Return type

Annotation

mgkit.io.gff.get_annotation_map(annotations, key_func, value_func)[source]

New in version 0.1.15.

Applies two functions to an iterable of annotations with an iterator returned with the applied functions. Useful to build a dictionary

Parameters
  • annotations (iterable) – iterable of annotations

  • key_func (func) – function that accept an annotation as argument and returns one value, the first of the returned tuple

  • value_func (func) – function that accept an annotation as argument and returns one value, the second of the returned tuple

Yields

tuple – a tuple where the first value is the result of key_func on the passed annotation and the second is the value returned by value_func on the same annotation

mgkit.io.gff.group_annotations(annotations, key_func=<function <lambda>>)[source]

New in version 0.1.12.

Group Annotation instances in a dictionary by using a key function that returns the key to be used in the dictionary.

Parameters
  • annotations (iterable) – iterable with Annotation instances

  • key_func (func) – function used to extract the key used in the dictionary, defaults to a function that returns (ann.seq_id, ann.strand)

Returns

dictionary whose keys are returned by key_func and the values are lists of annotations

Return type

dict

Example

>>> ann = [Annotation(seq_id='seq1', strand='+', start=10, end=15),
... Annotation(seq_id='seq1', strand='+', start=1, end=5),
... Annotation(seq_id='seq1', strand='-', start=30, end=100)]
>>> group_annotations(ann)
{('seq1', '+'): [seq1(+):10-15, seq1(+):1-5], ('seq1', '-'): [seq1(-):30-100]}
mgkit.io.gff.group_annotations_by_ancestor(annotations, ancestors, taxonomy)[source]

New in version 0.1.13.

Group annotations by the ancestors provided.

Parameters
  • annotations (iterable) – annotations to group

  • ancestors (iterable) – list of ancestors accepted

  • taxonomy – taxonomy class

Returns

grouped annotations

Return type

dict

mgkit.io.gff.group_annotations_sorted(annotations, key_func=<function <lambda>>)[source]

New in version 0.1.13.

Group Annotation instances by using a key function that returns a key. Assumes that the annotations are already sorted to return an iterator and save memory. One way to sort them is using: sort -s -k 1,1 -k 7,7 on the file.

Parameters
  • annotations (iterable) – iterable with Annotation instances

  • key_func (func) – function used to extract the key used in the dictionary, defaults to a function that returns (ann.seq_id, ann.strand)

Yields

list – a list of the grouped annotations by key_func values

mgkit.io.gff.load_gff_base_info(files, taxonomy=None, exclude_ids=None, include_taxa=None, encoding='ascii')[source]

This function is useful if the number of annotations in a GFF is high or there are memory constraints on the system. It returns a dictionary that can be used with functions like mgkit.counts.func.load_sample_counts().

Parameters
  • files (iterable, str) – file name or list of paths of GFF files

  • taxonomy – taxonomy pickle file, needed if include_taxa is not None

  • exclude_ids (set, list) – a list of gene_id to exclude from the dictionary

  • include_taxa (int, iterable) – a taxon_id or list thereof to be passed to mgkit.taxon.taxonomy.is_ancestor(), so only the taxa that have the those taxon_id(s) as ancestor(s) are kept

  • encoding (str) – passed to parse_gff()

Returns

dictionary where the key is Annotation.uid and the value is a tuple (Annotation.gene_id, Annotation.taxon_id)

Return type

dict

mgkit.io.gff.load_gff_mappings(files, map_db, taxonomy=None, exclude_ids=None, include_taxa=None, encoding='ascii')[source]

This function is useful if the number of annotations in a GFF is high or there are memory constraints on the system. It returns a dictionary that can be used with functions like mgkit.counts.func.load_sample_counts().

Parameters
  • files (iterable, str) – file name or list of paths of GFF files

  • map_db (str) – any kind mapping in the GFF, as passed to Annotation.get_mapping()

  • taxonomy – taxonomy pickle file, needed if include_taxa is not None

  • exclude_ids (set, list) – a list of gene_id to exclude from the dictionary

  • include_taxa (int, iterable) – a taxon_id or list thereof to be passed to mgkit.taxon.taxonomy.is_ancestor(), so only the taxa that have the those taxon_id(s) as ancestor(s) are kept

  • encoding (str) – passed to parse_gff()

Returns

dictionary where the key is Annotation.gene_id and the value is a list of mappings, as returned by Annotation.get_mapping()

Return type

dict

mgkit.io.gff.parse_gff(file_handle, gff_type=<function from_gff>, strict=False, encoding='ascii', filter_func=None)[source]

Changed in version 0.5.7: added filter_func

Changed in version 0.4.0: In some cases ASCII decoding is not enough, so it is parametrised now

Changed in version 0.3.4: added decoding from binary for compatibility with Python3

Changed in version 0.2.6: added strict parameter

Changed in version 0.2.3: correctly handling of GFF with comments of appended sequences

Changed in version 0.1.12: added gff_type parameter

Parse a GFF file and returns generator of GFFKegg instances

Accepts a file handle or a string with the file name

Parameters
  • file_handle (str, file) – file name or file handle to read from

  • gff_type (class) – class/function used to parse a GFF annotation

  • strict (bool) – if True duplicate keys raise an exception

  • encoding (str) – encoding of the file, if ascii fails, use utf8

  • file_func (func) – a function used to filter, accepts an annotation as argument and returns True if the annotation is to be kept

Yields

Annotation – an iterator of Annotation instances

mgkit.io.gff.parse_gff_files(files, strict=True)[source]

New in version 0.1.15.

Changed in version 0.2.6: added strict parameter

Function that returns an iterator of annotations from multiple GFF files.

Parameters
  • files (iterable, str) – iterable of file names of GFF files, or a single file name

  • strict (bool) – if True duplicate keys raise an exception

Yields

Annotation – iterator of annotations

mgkit.io.gff.split_gff_file(file_handle, name_mask, num_files=2, encoding='ascii')[source]

New in version 0.1.14.

Changed in version 0.2.6: now accept a file object as sole input

Changed in version 0.4.0: added encoding parameter

Splits a GFF, or a list of them, into a number of files. It is assured that annotations for the same sequence are kept in the same file, which is useful for cases like filtering, even when the annotations are from different GFF files.

Internally, a structure is kept to check if a sequence ID is already been stored to a file, in which case the annotation is written to that file, otherwise a random file handles (among the open ones) is chosen.

Parameters
  • file_handle (str, list) – a single or list of file handles (or file names), from which the GFF annotations are read

  • name_mask (str) – a string used as template for the output file names on which the function applies string.format()

  • num_files (int) – the number of files to split the records

Example

>>> import glob
>>> files = glob.glob('*.gff')
>>> name_mask = 'split-file-{0}.gff'
>>> split_gff_file(files, name_mask, 5)
mgkit.io.gff.write_gff(annotations, file_handle, verbose=True)[source]

Changed in version 0.1.12: added verbose argument

Write a GFF to file

Parameters
  • annotations (iterable) – iterable that returns GFFKegg or Annotation instances

  • file_handle (str, file) – file name or file handle to write to

  • verbose (bool) – if True, a message is logged