Download Taxonomy

A bash script called download-taxonomy.sh is installed along with MGKit. This script download the relevant files from NCBI using wget, and save the taxonomy file that can be used with MGKit to a file called taxonomy.pickle.

Since the script uses wget to download the file taxdump.tar.gz, if wget can’t be found, the scripts fails. To avoid this situation, the file can be downloaded in another way, and the script detects if the file exists, avoiding the call of wget.

The script can also save the file with another file name, if this is passed when the script is invoked. if the file extension contains .msgpack, the msgpack module is used to write the taxonomy, otherwise pickle is used.

The advantage of msgpack is faster read/write and better compression ratio; it needs an additional module (msgpack) that is not installed by default.

Download Accession/TaxonID

There are 2 separate scripts to download these tables:

  • download-uniprot-taxa.sh will download a table for Uniprot databases

  • download-ncbi-taxa.sh for BLAST DBs from NCBI, by default for nt, but nr can be downloaded with download-ncbi-taxa.sh prot

In particular, nr refers to the protein database in NCBI, while nt refers to the nucleotidic one. Both Uniprot Swissprot and TrEMBL are downloaded by the first scripts.

Note

Since version 0.4.4, if a PROGBAR enviroment variable is set, the progress bar (default in wget) is used, instead of the dot progress, which is more suitable for interactive use of the script