taxon-utils - Taxonomy Utilities¶

Overview¶

The script contains commands used to access functionality related to taxonomy, without the need to write ad-hoc code for functionality that can be part of a workflow. One example is access to the the last common ancestor function contained in the mgkit.taxon.

Last Common Ancestor (lca and lca_line)¶

These commands expose the functionality of last_common_ancestor_multiple(), making it accessible via the command line. They differ in the input file format and the choice of output files.

the lca command can be used to define the last common ancestor of contigs from the annotation in a GFF file. The command uses the taxon_ids from all annotations belonging to a contig/sequence, if they have a bitscore higher or equal to the one passed (50 by default). The default output of the command is a tab separated file where the first column is the contig/sequence name, the taxon_id of the last common ancestor, its scientific/common name and its lineage.

For example:

contig_21   172788  uncultured phototrophic eukaryote   cellular organisms,environmental samples

If the -r is used, by passing the fasta file containing the nucleotide sequences the output file is a GFF where for each an annotation for the full contig length contains the same information of the tab separated file format.

The lca_line command accept as input a file where each line consist of a list of taxon_ids. The separator for the list can be changed and it defaults to TAB. The last common ancestor for all taxa on a line is searched. The ouput of this command is the same as the tab separated file of the lca command, with the difference that instead of the first column, which in this command becames a list of all taxon_ids that were used to find the last common ancestor for that line. The list of taxon_ids is separated by semicolon “;”.

Note

Both also accept the -n option, to report the config/line and the taxon_ids that had no common ancestors. These are treated as errors and do not appear in the output file.

Krona Output¶

New in version 0.3.0.

The lca command supports the writing of a file compatible with Krona. The output file can be used with the ktImportText/ImportText.pl script included with KronaTools. Specifically, the output from taxon_utils will be a file with all the lineages found (tab separated), that can be used with:

$ ktImportText -q taxon_utils_ouput

Note the use of -q to make the script count the lineages. Sequences with no LCA found will be marked as No LCA in the graph, the -n is not required.

If a fasta file is passed, the format is changed to add the number of base pairs for each contig, showing the number of bases for each taxonomic assignment. The option -kt is not needed in this case. In that case, to generate a Krona plot:

$ ktImportText taxon_utils_ouput

Note

Please note that the output won’t include any sequence that didn’t have a hit with the software used. If that’s important, the -kt option can be used to add a number of Unknown lines at the end, to read the total supplied.

Filter by Taxon¶

The filter command of this script allows to filter a GFF file using the taxon_id attribute to include only some annotations, or exclude some. The filter is based on the mgkit.taxon.is_ancestor function, and the mgkit.filter.taxon.filter_taxon_by_id_list. It can also filter a table (tab separated values) when the first element is an ID and the second is a taxon_id. An example of a table of this sort is the output of the download-ncbi-taxa.sh and download-uniprot-taxa.sh, where each accession of a database is associated to a taxon_id.

Multiple taxon_id can be passed, either for inclusion or exclusion. If both exclusion and inclusion is used, the first check is on the inclusion and then on the exclusion. In alternative to passing taxon_id, taxon_names can be passed, with values such as ‘cellular organisms’ that needs to be quoted. Example:

$ taxon-utils filter -i 2 -in archaea -en prevotella -t taxonomy.pickle in.gff out.gff

Which will keep only line that are from Bacteria (taxon_id=2) and exclude those from the genus Prevotella. It will be also include Archaea.

Multiple inclusion and exclusion flags can be put:

$ taxon-utils filter -i 2 -i 2172 -t taxonomy in.gff out.gff

In particular, the inclusion flag is tested first and then the exclusion is tested. So a line like this one:

printf "TEST\t838\nTEST\t1485" | taxon-utils filter -p -t taxonomy.pickle -i 2 -i 1485 -e 838

Will produce TEST 1485, because both Prevotella (838) and Clostridium (1485) are Bacteria (2) OR Prevotella, but Prevotella must be excluded according to the exclusion option. This line also illustrate that a tab-separated file, where the second column contains taxon IDs, can be filtered. In particular it can be applied to files produced by download-ncbi-taxa.sh or download-uniprot-taxa.sh (see Download Taxonomy).

Warning

Annotations with no taxon_id are not included in the output of both filters

Convert Taxa Tables to HDF5¶

This command is used to convert the taxa tables download from Uniprot and NCBI, using the scripts mentioned in download-data, download-uniprot-taxa.sh and download-ncbi-taxa into a HDF5 file that can be used with the addtaxa command in add-gff-info - Add informations to GFF annotations.

The advantage is a faster lookup of the IDs. The other is a smaller memory footprint when a great number of annotations are kept in memory.

Extract taxonomy¶

This command allows to print the taxonomy in a file created with MGKit. The default behaviour is to print the scientific name, taxon_id, rank and lineage (only ranked taxa) in a tab separated file (or standard output). IDs can also be passed along with names:

taxon-utils get -i 2147 -o prevotella -i 3688 -i 1485 -o methanobrevibacter taxonomy.msgpack

INFO - mgkit.taxon: Loading taxonomy from file taxonomy.msgpack
Acholeplasma    2147    genus   Bacteria;Tenericutes;Mollicutes;Acholeplasmatales;Acholeplasmataceae;Acholeplasma
Prevotella      838     genus   Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella
Salicaceae      3688    family  Eukaryota;Viridiplantae;Streptophyta;Magnoliopsida;Malpighiales;Salicaceae
Methanobrevibacter      2172    genus   Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanobrevibacter
Clostridium     1485    genus   Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium

The taxa separator can be changed from ; with -x and common names instead of scientific names can be used with the -a option.

Match names¶

Besides the option to change the column separator, it is possible to only print specific taxa (case insensitive search, but need the correct name) and print headers for the output table. A partial and fuzzy search are performed, however these are only reported, unless option -p is passed.

The lineage is printed only for the taxa requested, but all children can be included by using the -c option.

Import Taxonomy¶

The import command allows the use of taxonomies other than NCBI and write a file that can be used with other scripts. At the moment only PhyloPhlan is supported.

PhyloPhlan¶

Tested with version 3 of the script, the taxonomy file name is similar to SGB.Sep20.txt.bz2. The file contains also NCBI IDs for known taxa and those will be kept. The IDs created for non-NCBI taxa will be negative integers, to distinguish them.

Changes¶

Changed in version 0.5.7: added –only-ids -x and -a option to get command. Added import command

Changed in version 0.5.0: added get command to taxon-utils to print the taxonomy or search in it

Changed in version 0.3.4: changed interface and behaviour for filter, also now can filter tables; lca has changed the interface and allows the output of a 2 column table

Changed in version 0.3.1: added to_hdf command

Changed in version 0.3.1: added -j option to lca, which outputs a JSON file with the LCA results

Changed in version 0.3.0: added -k and -kt options for Krona output, lineage now includes the LCA also added -a option to select between lineages with only ranked taxa. Now it defaults to all components.

Changed in version 0.2.6: added feat-type option to lca command, added phylum output to nolca

New in version 0.2.5.

Options¶

taxon-utils¶

Main function

taxon-utils [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

--cite¶

filter¶

Filter a GFF file or a table based on taxonomy

taxon-utils filter [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-p, --table¶

-t, --taxonomy <taxonomy>¶: Required Taxonomy file

-i, --include-taxon-id <include_taxon_id>¶: Include only taxon_ids

-in, --include-taxon-name <include_taxon_name>¶: Include only taxon_names

-e, --exclude-taxon-id <exclude_taxon_id>¶: Exclude taxon_ids

-en, --exclude-taxon-name <exclude_taxon_name>¶: Exclude taxon_names

--progress¶: Shows Progress Bar

Arguments

INPUT_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument

get¶

Extract the taxonomy as a CSV file

taxon-utils get [OPTIONS] TAXONOMY_FILE [OUTPUT_FILE]

Options

-v, --verbose¶

-d, --header¶: Include header in the output

-a, --use-cname¶: Use the common name if present

-s, --separator <separator>¶: column separator

-x, --tax-sep <tax_sep>¶

taxa separator

Default: ;

-o, --only-names <only_names>¶: Only get matched taxon names

-i, --only-ids <only_ids>¶: Only get matched taxon IDs

--name-file <name_file>¶: File with names to search

--id-file <id_file>¶: File with IDs to search

-p, --partial¶: Use partial matches if any found (implies -o)

-z, --no-fuzzy¶: Avoid fuzzy name search

-c, --include-children¶: Include taxa that are children of the requested (implies -o)

Arguments

TAXONOMY_FILE¶: Required argument

OUTPUT_FILE¶: Optional argument

import¶

Create a MGKit taxonomy from an alternative taxonomy

taxon-utils import [OPTIONS] IMPORT_FILE TAXONOMY_FILE

Options

-v, --verbose¶

-t, --tax-type <tax_type>¶

Type of taxonomy to import

Default: phylophlan
Options: phylophlan

Arguments

IMPORT_FILE¶: Required argument

TAXONOMY_FILE¶: Required argument

lca¶

Finds the last common ancestor for each sequence in a GFF file

taxon-utils lca [OPTIONS] [GFF_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-t, --taxonomy <taxonomy>¶: Required Taxonomy file

-n, --no-lca <no_lca>¶: File to which write records with no LCA

-a, --only-ranked¶: If set, only taxa that have a rank will be used in the lineageself. This is not advised for lineages such as Viruses, where the top levels have no rank

-b, --bitscore <bitscore>¶

Minimum bitscore accepted

Default: 0

-m, --rename¶: Emulates BLAST behaviour for headers (keep left of first space)

-s, --sorted¶: If the GFF file is sorted (all of a sequence annotations are contiguos) can use less memory, sort -s -k 1,1 can be used

-ft, --feat-type <feat_type>¶

Feature type used if the output is a GFF (default is LCA)

Default: LCA

-g, --group-by-attr <group_by_attr>¶

Attribute to get the LCA for - default to sequence

Default: seq_id

-r, --reference <reference>¶: Required reference file for the GFF, if krona is the format, contig lengths are added

-p, --simple-table¶: Uses a 2 column table format (seq_id taxon_id) TAB separated

-kt, --krona-total <krona_total>¶: Total number of raw sequences (used to output correct percentages in Krona

-f, --out-format <out_format>¶

Format of output file

Default: tab
Options: krona|json|tab|gff

--progress¶: Shows Progress Bar

Arguments

GFF_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument

lca_line¶

Finds the last common ancestor for all IDs in a text file line

taxon-utils lca_line [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-t, --taxonomy <taxonomy>¶: Required Taxonomy file

-n, --no-lca <no_lca>¶: File to which write records with no LCA

-a, --only-ranked¶: If set, only taxa that have a rank will be used in the lineageself. This is not advised for lineages such as Viruses, where the top levels have no rank

-s, --separator <separator>¶: separator for taxon_ids (defaults to TAB)

Arguments

INPUT_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument

to_hdf¶

Convert a taxa table to HDF5, with the input as tabular format, defaults to stdin. Output file, defaults to (taxa-table.hf5)

taxon-utils to_hdf [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-n, --table-name <table_name>¶

Name of the table/storage to use

Default: taxa

-w, --overwrite¶: Overwrite the file, instead of appending to it

-s, --index-size <index_size>¶

Maximum number of characters for the gene_id

Default: 12

-c, --chunk-size <chunk_size>¶

Chunk size to use when reading the input file

Default: 5000000

--progress¶: Shows Progress Bar

Arguments

INPUT_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument

taxon-utils - Taxonomy Utilities¶

Overview¶

Last Common Ancestor (lca and lca_line)¶

Krona Output¶

Filter by Taxon¶

Convert Taxa Tables to HDF5¶

Extract taxonomy¶

Match names¶

Import Taxonomy¶

PhyloPhlan¶

Changes¶

Options¶

taxon-utils¶

filter¶

get¶

import¶

lca¶

lca_line¶

to_hdf¶

MGKit: Metagenomic framework

Navigation

Related Topics