mgkit.taxon module¶

This module gives access to Uniprot taxonomy data. It also defines classes to filter, order and group data by taxa

exception mgkit.taxon.NoLcaFound[source]¶

Bases: Exception

New in version 0.1.13.

Raised if no lowest common ancestor can be found in the taxonomy

mgkit.taxon.TAXON_RANKS = ('superkingdom', 'kingdom', 'phylum', 'class', 'subclass', 'order', 'family', 'genus', 'species')¶: Taxonomy ranks included in the pickled data

mgkit.taxon.TAXON_ROOTS = ('archaea', 'bacteria', 'fungi', 'metazoa', 'environmental samples', 'viruses', 'viroidseukaryota', 'other sequences', 'unidentified', 'alveolata', 'amoebozoa', 'apusozoa', 'breviatea', 'centroheliozoa', 'choanoflagellida', 'diplomonadida', 'euglenozoa', 'formicata', 'heterolobosea', 'jakobida', 'malawimonadidae', 'oxymonadida', 'parabasalia', 'rhizaria', 'streptophyta', 'haptophyceae', 'chlorophyta', 'stramenopiles', 'cryptophyta', 'rhodophyta')¶: Root taxa used in analysis and filtering

mgkit.taxon.TaxonTuple¶

A representation of a Uniprot Taxon

alias of mgkit.taxon.UniprotTaxonTuple

class mgkit.taxon.Taxonomy(fname=None)[source]¶

Bases: object

Class that contains the whole Uniprot taxonomy. Defines some methods to easy access of taxonomy. Follows the conventions of NCBI Taxonomy.

Defines:

methods to load taxonomy from a pickle file or a generic file handle
can be iterated over and returns a generator its UniprotTaxon instances
can be used as a dictionary, in which the key is a taxon_id and the value is its UniprotTaxon instance

__contains__(taxon)[source]¶

Returns True if the taxon is in the taxonomy

Accepts an int (check for taxon_id) or an instance of UniprotTaxon

__getitem__(taxon_id)[source]¶: Defines dictionary behavior. Key is a taxon_id, the returned value is a UniprotTaxon instance

__iter__()[source]¶: Defines iterable behavior. Returns a generator for UniprotTaxon instances

__len__()[source]¶: Returns the number of taxa contained

__repr__()[source]¶: New in version 0.2.5.

add_lineage(**lineage)[source]¶

New in version 0.3.1.

Adds a lineage to the taxonomy. It’s passed by keyword arguments, where each key is a value in the TAXON_RANKS rankes and the value is the scientific name. Appended underscores ‘_’ will be stripped from the rank name. This is for cases such as class where the key is a reserved word in Python. Also one extra node can be added, such as strain/cultivar/subspecies and so on, but one only is expected to be passed.

Parameters

lineage (dict) – the lineage as a keyword arguments

Returns

the taxon_id of the last element in the lineage

Return type

int

Raises

ValueError – if more than a keyword argument is not contained in
TAXON_RANKS –

add_taxon(taxon_name, common_name='', rank='no rank', parent_id=None, lineage=None)[source]¶

Changed in version 0.5.7: added lineage and made new taxon_id negatives

New in version 0.3.1.

Adds a taxon to the taxonomy. If a taxon with the same name and rank is found, its taxon_id is returned, otherwise a new taxon_id is returned.

Parameters

taxon_name (str) – scientific name of the taxon
common_name (str) – common name
rank (str) – rank, defaults to ‘no rank’
parent_id (int) – taxon_id of the parent, defaults to None, which is the taxonomy root
lineage (tuple) – lineage attribute in TaxonTuple

Returns

the taxon_id of the added taxon (if new), or the taxon_id of the taxon with the same name and rank found in the taxonomy

Return type

int

Raises

KeyError – if more than one taxon has already the passed name and
rank and it can't be resolved by looking at the parent_id passed, –
the exception is raised. –

drop_taxon(taxon_id)[source]¶

New in version 0.3.1.

Drops a taxon and all taxa below it in the taxonomy. Also reset the name map for conistency.

Parameters: taxon_id (int) – taxon_id to drop from the taxonomy

find_by_name(s_name, rank=None, strict=True)[source]¶

Changed in version 0.2.3: the search is now case insensitive

Changed in version 0.3.1: added rank and strict parameter

Returns the taxon IDs associated with the scientific name provided

Parameters

s_name (str) – the scientific name
rank (str, None) – return only a taxon_id of a specific rank
strict (book) – if True and rank is not None, KeyError will be raised if multiple taxa have the same name and rank

Returns

a reference to the list of IDs that have for s_name, if rank is None. If rank is not None and one taxon is found, its taxon_id is returned, or None if no taxon is found. If strict is True and rank is not None, the set of taxon_ids found is resturned.

Return type

list

Raises

KeyError – If multiple taxa are found, a KeyError exception is
raised. –

gen_alt_map()[source]¶

gen_name_map()[source]¶: Changed in version 0.2.3: names are stored in the mapping as lowercase

Generate a name map, where to each scientific name in the taxonomy an id is associated.

get_by_lineage(lineage)[source]¶: New in version 0.5.7.

Returns the taxon_id provided an ID stored in the lineage attribute of the Taxon

get_lineage(taxon_id, names=False, only_ranked=True, with_last=True, **kwd)[source]¶

New in version 0.3.1.

Proxy for get_lineage(), with changed defaults. Other keyword arguments are passed to get_lineage()

Parameters

taxon_id (int) – taxon_id to return the lineage
names (bool) – if the elements of the list are converted into the scientific names
only_ranked (bool) – only return the ranked taxa
with_last (bool) – include the taxon_id passed to the list

Returns

the lineage of the passed taxon_id as a list of IDs or names

Return type

list

get_lineage_string(taxon_id, only_ranked=True, with_last=True, sep=';', rank=None, **kwd)[source]¶

New in version 0.3.3.

Generates a lineage string, with the possibility of getting another ranked taxon (via Taxonomy.get_ranked_taxon()) to another rank, such as phylum. Other keyword arguments are passed to called functions

Parameters

taxon_id (int) – taxon_id to return the lineage
only_ranked (bool) – only return the ranked taxa
with_last (bool) – include the taxon_id passed to the list
sep (str) – separator used to join the lineage string
rank (int or None) – if None the full lineage is returned, otherwise the lineage will be cut to the specified rank

Returns

lineage string

Return type

str

get_name_map()[source]¶: Returns a taxon_id->s_name dictionary

get_ranked_id(taxon_id, rank=None, it=False, include_higher=True)[source]¶

New in version 0.3.4.

Gets the ranked taxon of another one. Useful when it’s better to get a taxon_id instead of an instance of TaxonTuple. Internally, it relies on Taxonomy.get_ranked_taxon().

Parameters

taxon_id (int) – taxon_id
rank (str or None) – passed over
it (bool) – determines the return value. if True, a list is returned
include_higher (bool) – if True, any rank higher than the requested may be returned. If False and the rank cannot be returned, None is returned

Returns

The type returned is based on the it paramenter. If it is True, the return value is a list with the taxon_id of the ranked taxon as the sole value. If False, the returned value is the taxon_id. include_higher determines if the return value should be None if the exact rank was not found and include_higher is False

Return type

int or list

get_ranked_taxon(taxon_id, rank=None, ranks='superkingdom', 'kingdom', 'phylum', 'class', 'subclass', 'order', 'family', 'genus', 'species', roots=False)[source]¶

Changed in version 0.1.13: added roots argument

Traverse the branch of which the taxon argument is the leaf backward, to get the specific rank to which the taxon belongs to.

Warning

the roots options is kept for backward compatibility and should be be set to False

Parameters

taxon_id – id of the taxon or instance of UniprotTaxon
rank (str) – string that specify the rank, if None, the first valid rank will be searched. (i.e. the first with a value different from ‘’)
ranks – tuple of all taxonomy ranks, default to the default module value
roots (bool) – if True, uses TAXON_ROOTS to solve the root taxa

Returns

instance of TaxonTuple for the rank found.

is_ancestor(leaf_id, anc_ids)[source]¶

Changed in version 0.1.13: now uses is_ancestor() and changed behavior

Checks if a taxon is the leaf of another one, or a list of taxa.

Parameters

leaf_id (int) – leaf taxon id
anc_ids (int) – ancestor taxon id(s)

Return bool

True if the ancestor taxon is in the leaf taxon lineage

is_ranked_below(taxon_id, rank, equal=True)[source]¶

New in version 0.4.0.

Tests if a taxon_id is below the requested rank.

Parameters

taxon_id (int) – taxo_id to test
rank (str) – rank requested
equal (bool) – determines if the taxon_id tested may be of the requested rank

Returns

If the passed taxon_id is below the requested rank, it returns True. If taxon_id is of the rank requested and equal is True, the return value is True, if equal is False the return value is False. The return value is False otherwise.

Return type

bool

iter_ids()[source]¶: New in version 0.5.4.

Iterates over the taxon IDs

load_data(file_handle)[source]¶

Changed in version 0.2.3: now can use read msgpack serialised files

Changed in version 0.1.13: now accepts file handles and compressed files (if file names)

Loads serialised data from file name “file_handle” and accepts compressed files.

if the .msgpack string is found in the file name, the msgpack package is used instead of pickle

Parameters: file_handle (str, file) – file name to which save the instance data

property max_id¶: New in version 0.5.7.

Gets the highest taxon_id in the taxonomy

property min_id¶: New in version 0.5.7.

Gets the lowest taxon_id in the taxonomy

static parse_gtdb_lineage(lineage, sep=';')[source]¶

New in version 0.3.3.

Parse a GTDB lineage, one that defines the rank as a single letter, followed by __ for each taxon name. Taxa are separated by semicolon by default. Also the domain rank is renamed into superkingdom to allow mixing of taxonomies.

Returns: dictionary with the parsed lineage, which can be passed to Taxonomy.add_lineage()
Return type: dict

static parse_phylophlan_lineage(lineage, sep='|', field_sep='\t', id_col=10, name_col=9)[source]¶

New in version 0.5.7.

Parses a line from PhyloPhlan 3 taxonomy

Parameters

lineage (str) – line of PhyloPhlan 3 taxonomy
sep (str) – separator for the taxa string
field_sep (str) – field separator
id_col (int) – index of the column with NCBI IDs
name_col (int) – index of the column with the lineage

Returns

list of dictionaries, with values that can be used with TaxonTuple

Return type

list

read_from_gtdb_taxonomy(file_handle, use_gtdb_name=True, sep='\t')[source]¶

New in version 0.3.0.

Changed in version 0.3.1: replaced domain with superkingdom to support get_lineage

Reads a GTDB taxonomy file (tab separated genome_id/taxonomy) and populate the taxonomy instance. The method also return a dictionary of genome_id -> taxon_id.

Parameters

file_handle (file) – file with the taxonomy
use_gtdb_name (bool) – if True, the names are kept as-is in the s_name attribute of TaxonTuple and the “cleaned” version in c_name (e.g. f__Ammonifexaceae -> Ammonifexaceae). If False, the values are switched
sep (str) – separator between the columns of the file

Returns

dictionary of genome_id -> taxon_id, reflecting the created taxonomy

Return type

dict

Note

the taxon_id are generated, so there’s no guarantee they will be the same in a successive execution

read_from_ncbi_dump(nodes_file, names_file=None, merged_file=None)[source]¶

New in version 0.2.3.

Uses the nodes.dmp and optionally names.dmp, merged.dmp files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ to populate the taxonomy.

Parameters

nodes_file (str, file) – file name or handle to the file
names_file (str, file, None) – file name or handle to the file, if None, names won’t be added to the taxa
merged_file (str, file, None) – file name or handle to the file, if None, pointers to merged taxa won’t be added

read_from_phylophlan_taxonomy(file_name, field_sep='\t')[source]¶

New in version 0.5.7.

Parses a PhyloPhlan 3 taxonomy. NCBI IDs will be kept, new IDs will be negative to avoid confusion with NCBI IDs.

Parameters

file_name (str) – file name of the taxonomy
field_sep (str) – field separator in the file

read_taxonomy(f_handle, light=True)[source]¶

Changed in version 0.2.1: added light parameter

Deprecated since version 0.4.0: use Taxonomy.read_from_ncbi_dump()

Reads taxonomy from a file handle. The file needs to be a tab separated format return by a query on Uniprot. If light is True, lineage is not stored to decrease the memory usage. This is now the default.

New taxa will be added, duplicated taxa will be skipped.

Parameters: f_handle (handle) – file handle of the taxonomy file.

save_data(file_handle)[source]¶

Changed in version 0.2.3: now can use msgpack to serialise

Saves taxonomy data to a file handle or file name, can write compressed data if the file ends with “.gz”, “.bz2”

if the .msgpack string is found in the file name, the msgpack package is used instead of pickle

Parameters: file_handle (str, file) – file name to which save the instance data

class mgkit.taxon.UniprotTaxonTuple(taxon_id, s_name, c_name, rank, lineage, parent_id)¶

Bases: tuple

c_name¶: Alias for field number 2

lineage¶: Alias for field number 4

parent_id¶: Alias for field number 5

rank¶: Alias for field number 3

s_name¶: Alias for field number 1

taxon_id¶: Alias for field number 0

mgkit.taxon.UniprotTaxonomy¶: alias of mgkit.taxon.Taxonomy

mgkit.taxon.distance_taxa_ancestor(taxonomy, taxon_id, anc_id)[source]¶

New in version 0.1.16.

Function to calculate the distance between a taxon and the given ancestor

The distance is equal to the number of step in the taxonomy taken to arrive at the ancestor.

Parameters

taxonomy – Taxonomy instance
taxon_id (int) – taxonomic identifier
anc_id (int) – taxonomic identifier of the ancestor

Raturns:: int: distance between taxon_id and it ancestor anc_id

mgkit.taxon.distance_two_taxa(taxonomy, taxon_id1, taxon_id2)[source]¶

New in version 0.1.16.

Calculate the distance between two taxa. The distance is equal to the sum steps it takes to traverse the taxonomy until their last common ancestor.

Parameters

taxonomy – Taxonomy instance
taxon_id1 (int) – taxonomic identifier of first taxon
taxon_id2 (int) – taxonomic identifier of second taxon

Raturns:: int: distance between taxon_id1 and taxon_id2

mgkit.taxon.get_ancestor_map(leaf_ids, anc_ids, taxonomy)[source]¶

This function returns a dictionary where every leaf taxon is associated with the right ancestors in anc_ids

ex. {clostridium: [bacteria, clostridia]}

mgkit.taxon.get_lineage(taxonomy, taxon_id, names=False, only_ranked=False, with_last=False, add_rank=False, use_cname=False)[source]¶

New in version 0.2.1.

Changed in version 0.2.5: added only_ranked

Changed in version 0.3.0: added with_last

Changed in version 0.5.7: added add_rank and use_cname

Returns the lineage of a taxon_id, as a list of taxon_id or taxa names

Parameters

taxonomy – a Taxonomy instance
taxon_id (int) – taxon_id whose lineage to return
names (bool) – if True, the returned list contains the names of the taxa instead of the taxon_id
only_ranked (bool) – if True, only taxonomic levels whose rank is in data:TAXON_RANKS will be returned
with_last (bool) – if True, the passed taxon_id is included in the lineage
add_rank (bool) – prepend the names in the lineage with the first letter of the rank and 2 underscores
use_cnames (bool) – Use common name (c_name) instead of scientific name (s_name)

Returns

lineage of the taxon_id, the elements are int if names is False, and str when names is True. If a taxon has no scientific name, the common name is used. If only_ranked is True, the returned list only contains ranked taxa (according to TAXON_RANKS).

Return type

list

mgkit.taxon.is_ancestor(taxonomy, taxon_id, anc_id)[source]¶

Changed in version 0.1.16: if a taxon_id raises a KeyError, False is returned

Determine if the given taxon id (taxon_id) has anc_id as ancestor.

:param Taxonomy taxonomy: taxonomy used to test :param int taxon_id: leaf taxon to test :param int anc_id: ancestor taxon to test against

Return bool: True if anc_id is an ancestor of taxon_id or their the same

mgkit.taxon.last_common_ancestor(taxonomy, taxon_id1, taxon_id2)[source]¶

New in version 0.1.13.

Finds the last common ancestor of two taxon IDs. An alias to this function is in the same module, called lowest_common_ancestor for compatibility.

Parameters

taxonomy – Taxonomy instance used to test
taxon_id1 (int) – first taxon ID
taxon_id2 (int) – second taxon ID

Raturns:: int: taxon ID of the lowest common ancestor

Raises: NoLcaFound – if no common ancestor can be found

mgkit.taxon.last_common_ancestor_multiple(taxonomy, taxon_ids)[source]¶

New in version 0.2.5.

Applies last_common_ancestor() to an iterable that yields taxon_id while removing any None values. If the list is of one element, that taxon_id is returned.

Parameters

taxonomy – instance of Taxonomy
taxon_ids (iterable) – an iterable that yields taxon_id

Returns

the taxon_id that is the last common ancestor of all taxon_ids passed

Return type

int

Raises

NoLcaFound – when no common ancestry is found or the number of
*taxon_ids* is 0 –

mgkit.taxon.lowest_common_ancestor(taxonomy, taxon_id1, taxon_id2)¶

New in version 0.1.13.

Finds the last common ancestor of two taxon IDs. An alias to this function is in the same module, called lowest_common_ancestor for compatibility.

Parameters

taxonomy – Taxonomy instance used to test
taxon_id1 (int) – first taxon ID
taxon_id2 (int) – second taxon ID

Raturns:: int: taxon ID of the lowest common ancestor

Raises: NoLcaFound – if no common ancestor can be found

mgkit.taxon.parse_ncbi_taxonomy_merged_file(file_handle)[source]¶

New in version 0.2.3.

Parses the merged.dmp file where the merged taxon_id are stored. Available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Parameters: file_handle (str, file) – file name or handle to the file
Returns: dictionary with merged_id -> taxon_id
Return type: dict

mgkit.taxon.parse_ncbi_taxonomy_names_file(file_handle, name_classes='scientific name', 'common name')[source]¶

New in version 0.2.3.

Parses the names.dmp file where the names associated to a taxon_id are stored. Available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Parameters

file_handle (str, file) – file name or handle to the file
name_classes (tuple) – name classes to save, only the scientific and common name are stored

Returns

dictionary with merged_id -> taxon_id

Return type

dict

mgkit.taxon.parse_ncbi_taxonomy_nodes_file(file_handle, taxa_names=None)[source]¶

New in version 0.2.3.

Parses the nodes.dmp file where the nodes of the taxonomy are stored. Available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.

Parameters

file_handle (str, file) – file name or handle to the file
taxa_names (dict) – dictionary with the taxa names (returned from parse_ncbi_taxonomy_names_file())

Yields

TaxonTuple – TaxonTuple instance

mgkit.taxon.parse_uniprot_taxon(line, light=True)[source]¶: Changed in version 0.1.13: now accepts empty scientific names, for root taxa

Changed in version 0.2.1: added light parameter

Deprecated since version 0.4.0.

Parses a Uniprot taxonomy file (tab delimited) line and returns a UniprotTaxonTuple instance. If light is True, lineage is not stored to decrease the memory usage. This is now the default.

mgkit.taxon.taxa_distance_matrix(taxonomy, taxon_ids)[source]¶

New in version 0.1.16.

Given a list of taxonomic identifiers, returns a distance matrix in a pairwise manner by using distance_two_taxa() on all possible two element combinations of taxon_ids.

Parameters

taxonomy – Taxonomy instance
taxon_ids (iterable) – list taxonomic identifiers

Returns

matrix with the pairwise distances of all taxon_ids

Return type

pandas.DataFrame

mgkit.taxon module¶

MGKit: Metagenomic framework

Navigation

Related Topics