Changes

0.5.7

This release includes some changes to the script to handle MAGs and alternative taxonomies (right now from PhyloPhlan). Also a few more scripts and commands were introduced, for handling bigger dataset or streamline some work. Additionally, I’ll try to update a Docker image (currently only for version 0.5.6) so an alternative to Conda is possible. The last time I tried to install pysam via pip I wasn’t able to, so a Docker image is a good compromise for performance, portability and ease of installation.

Also several bugfixes were made, so I want this version out first, while the next version (0.6.0) will be used to clean the documentation, code and introduce more tests. I also want to clean and update the tutorials.

Added

Changed

Deprecated

  • snp_parser shouldn’t be used anymore, instead use pnps-gen vcf

0.5.6

  • mgkit.net.uniprot tests have problems with upstream API and are skipped for now

  • added option to taxon-utils lca to use the reference file to add base pair counts to the Krona output

  • several bugfixes

0.5.5

  • fasta-utils translate gained an option to translate the current frame of a sequence, assumes the sequences are ORFs

  • edit-gff table gained an option to skip comments, user can indicated the string for comments. For example -c ‘#’ for comments starting with ‘#’

0.5.4

taxon-utils get

When using -o option in taxon-utils get the script will try an exact match, followed by a partial and finally a fuzzy search of the names passed. The alternative names will be reported but not used, unless the -p option is used.

The -c option will output also all the taxa that are children of the passed names.

0.5.3

0.5.2

  • Fixed a bug when using –rank None in pnps-gen

0.5.1

  • get_gene_taxon_dataframe: changed: Changed in version 0.5.1: gene_map can be None, use_uid can be passed to the underline function

  • added option to include the lineage as a string in pnps-gen

  • added option to use the uids from the GFF instead of gene_id, this does not require the GFF file, they are embedded into the .pickle file

  • by default pnps-gen returns the taxon included in the GFF and not a ranked taxon

  • added option to make a different type of table in pnps-gen rank

0.5.0

Added

  • taxon-utils get command to query a taxonomy file

  • pnps-gen to generate a table of pN/pS values

0.4.4

Added

Changed

0.4.3

Fixes

mgkit.align.SamtoolsDepth in version 0.4.2 was using a weakref.WeakValueDictionary to speed up the recovery of memory from the internal dictionary. In the tests on MacOS the memory was mostly kept, but on Linux when submitted as a job it seems to be freed instantly. This also impact the add-gff-info cov_samtools command, since it uses this class - it will run, but reports that the number of sequences not found in the samtools depth file is the same as the number of sequences in the GFF file.

0.4.2

  • Fixed reading of Expasy files. The reading was not changed to adopt Python3 conventions like the rest of the routines. Included a test that dowload the expasy file and parses it

  • Optimisations of add-gff-info cov_samtools and the mgkit.align routines used

Added

Changed

0.4.1

Sanity checks for several mistakes, including never changed the Programming language version in the setup.py from 2.7. Tested installation under Python 3.6, with tox. Also removed the last bit of code using progressbar2.

0.4.0

This version was tested under Python 3.5, but the tests (with tox) were run also under Python 2.7. However, from the next release Python 2.7 will be removed gradually (as Python 2.7 won’t be supported/patched anymore from next year).

Changed

Requires pandas version >=0.24 because now a pandas.SparseArray is used for add-gff-info cov_samtools. Before, when reading the depth files from samtools the array for each sequence was kept in memory, while now only the ones in the GFF file are used.

mgkit.align:

mgkit.io.gff

mgkit.mappings.eggnog:

mgkit.snps.funcs:

Deprecated

Tests

Removed the last portions that used nosetets and better integrated pytest with setup.py. Now uses AppVeyor for testing the build and running tests under Python 3.

In cases where the testing environment has no or limited internet connection, tests that require an internet connection can be skipped by setting the following environment variable before running the tests:

$ export MGKIT_TESTS_CONN_SKIP=T

0.3.4

General cleanup and testing release. Major changes:

  • general moving to Python2 (2.7) and Python3 (3.5+) support, using the future package and when convenient checks for the version of python installed

  • setup includes now all the optional dependencies, since this makes it easier to deal with conda environments

  • move to pytest from nose, since it allows some functionality that interests me, along with the reorganisation of the test modules and skips of tests that cannot be executed (like mongodb)

  • move from urlib to using requests, which also helps with python3 support

  • more careful with some dependencies, like the lzma module and msgpack

  • addition of more tests, to help the porting to python3, along with a tox configuration

  • matplotlib.pyplot is still in the mgkit.plots.unused, but it is not imported when the parent package is, now. It is still needed in the mgkit.plots.utils functions, so the import has been moved inside the function. This should help with virtual environments and test suites

  • renamed mgkit.taxon.UniprotTaxonomy to mgkit.taxon.Taxonomy, since it’s really NCBI taxonomy and it’s preferred to download the data from there. Same for mgkit.taxon.UniprotTaxonTuple to mgkit.taxon.TaxonTuple, with an alias for old name there, but will be removed in a later version

  • download_data was removed. Taxonomy should be downloaded using download-taxonomy.sh, and the mgkit.mappings is in need of refactoring to remove old and now ununsed functionality

  • added mgkit.taxon.Taxonomy.get_ranked_id()

  • using a sphinx plugin to render the jupyter notebooks instead of old solution

  • rerun most of the tutorial and updated commands for newest available software (samtools/bcftools) and preferred the SNP calling from bcftools

Scripts

This is a summary of notable changes, it is advised to check the changes in the command line interface for several scripts

  • changed several scripts command line interface, to adapt to the use of click

  • taxon-utils lca has one options only to specify the output format, also adding the option to output a format that can be used by add-gff-info addtaxa

  • taxon-utils filter support the filtering of table files, when they are in a 2-columns format, such as those that are downloaded by download-ncbi-taxa.sh

  • removed the eggnog and taxonomy commands from add-gff-info, the former since it’s not that useful, the latter because it’s possible to achieve the same results using taxon-utils with the new output option

  • removed the rand command of fastq-utils since it was only for testing and the FastQ parser is the one from mgkit.io.fastq

  • substantial changes where made to commands values and sequence of the filter-gff script

  • sampling-utils rand_seq now can save the model used and reload it

  • removed download_data and download_profiles, since they are not going to be used in the next tutorial and it is preferred now to use BLAST and then find the LCA with taxon-utils

Python3

At the time of writing all tests pass on Python 3.5, but more tests are needed, along with some new ones for the blast parser and the scripts. Some important changes:

  • mgkit.io.gff.Annotation uses its uid to hash the instance. This allows the use in sets (mainly for filtering). However, hashing is not supported in mgkit.io.gff.GenomicRange.

  • mgkit.io.utils.open_file() now always opens and writes files in binary mode. This is one of the suggestions to keep compatibility between 2.x and 3.x. So if used directly the output must be decoded from ascii, which is the format used in text files (at least bioinformatics). However, this is not required for the parsers, like mgkit.io.gff.parse_gff(), mgkit.io.fasta.load_fasta() along with others (and the correspective write_ functions): they return unicode strings when parsing and decode into ascii when writing.

In general new projects will be worked on using Python 3.5 and the next releases will prioritise that (0.4.0 and later). If bugfixes are needed and Python 3 cannot be used, this version branch (0.3.x) will be used to fix bugs for users.

0.3.3

Added

Changed

0.3.2

Removed deprecated code

0.3.1

This release adds several scripts and commands. Successive releases 0.3.x releases will be used to fix bugs and refine the APIs and CLI. Most importantly, since the publishing of the first paper using the framework, the releases will go torward the removal of as much deprecated code as possible. At the same time, a general review of the code to be able to run on Python3 (probably via the six package) will start. The general idea is to reach as a full removal of legacy code in 0.4.0, while full Python3 compatibility is the aim of 0.5.0, which also means dropping dependencies that are not compatible with Python3.

Added

Changed

  • changed domain to superkingdom as for NCBI taxonomy in mgkit.taxon.UniprotTaxonomy.read_from_gtdb_taxonomy()

  • updated scripts documentation to include installed but non advertised scripts (like translate_seq)

  • mgkit.kegg.KeggReaction was reworked to only store the equation information

  • some commands in fastq-utils - Fastq Utilities did not support standard in/out, also added the script usage to the script details

  • translate_seq now supports standard in/out

  • added haplotypes parameter to mgkit.snps.funcs.combine_sample_snps()

  • an annotation from mgkit.db.mongo.GFFDB now doesn’t include the lineage, because it conflicts with the string used in a GFF file

  • an mgkit.io.gff.Annotation.coverage() now returns a float instead od a int

  • moved code from package mgkit.io to mgkit.io.utils

  • changed behaviour of mgkit.utils.common.union_range()

  • removed mgkit.utils.common.range_substract_()

  • added progressbar2 as installation requirement

  • changed how mgkit.taxon.UniprotTaxonomy.find_by_name()

Fixed

Besides smaller fixes:

Deprecated

0.3.0

A lot of bugs were fixed in this release, especially for reading NCBI taxonomy and using the msgpack format to save a UniprotTaxonomy instance. Also added a tutorial for profiling a microbial community using MGKit and BLAST (Profile a Community with BLAST)

Added

Changed

0.2.5

Changed

Added

0.2.4

Changed

  • mgkit.utils.sequence.get_contigs_info() now accepts a dictionary name->seq or a list of sequences, besides a file name (r536)

  • add-gff-info counts command now removes trailing commas from the samples list

  • the axes are turned off after the dendrogram is plo

Fixed

  • the snp_parser script requirements were set wrong in setup.py (r540)

  • uncommented lines to download sample data to build documentation (r533)

  • add-gff-info uniprot command now writes the lineage attribute correctly (r538)

0.2.3

The installation dependencies are more flexible, with only numpy as being required. To install every needed packages, you can use:

$ pip install mgkit[full]

Added

  • new option to pass the query sequences to blast2gff, this allows to add the correct frame of the annotation in the GFF

  • added the attributes evalue, subject_start and subject_end to the output of blast2gff. The subject start and end position allow to understand on which frame of the subject sequence the match was found

  • added the options to annotate the heatmap with the numbers. Also updated the relative example notebook

  • Added the option to reads the taxonomy from NCBI dump files, using mgkit.taxon.UniprotTaxonomy.read_from_ncbi_dump(). This make it faster to get the taxonomy file

  • added argument to return information from mgkit.net.embl.datawarehouse_search(), in the form of tab separated data. The argument fields can be used when display is set to report. An example on how to use it is in the function documentation

  • added a bash script download-taxonomy.sh that download the taxonomy

  • added script venv-docs.sh to build the documentation in HTML under a virtual environment. matplotlib on MacOS X raises a RuntimeError, because of a bug in virtualenv, the documentation can be first build with this, after the script create-apidoc.sh is create the API documentation. The rest of the documentation (e.g. the PDF) can be created with make as usual, afterwards

  • added mgkit.net.pfam, with only one function at the moment, that returns the descriptions of the families.

  • added pfam command to add-gff-info, using the mentioned function, it adds the description of the Pfam families in the GFF file

  • added a new exception, used internally when an additional dependency is needed

Changed

  • using the NCBI taxonomy dump has two side effects:

    • the scientific/common names are kept as is, not lower cased as was before

    • a merged file is provided for taxon_id that changed. While the old taxon_id is kept in the taxonomy, this point to the new taxon, to keep backward compatibility

  • renamed the add-gff-info gitaxa command to addtaxa. It now accepts more data sources (dictionaries) and is more general

  • changed mgkit.net.embl.datawarehouse_search() to automatically set the limit at 100,000 records

  • the taxonomy can now be saved using msgpack, making it faster to read/write it. It’s also more compact and better compression ratio

  • the mgkit.plots.heatmap.grouped_spine() now accept the rotation of the labels as option

  • added option to use another attribute for the gene_id in the get-gff-info script gtf command

  • added a function to compare the version of MGKit used, throwing a warning, when it’s different (mgkit.check_version())

  • removed test for old SNPs structures and added the same tests for the new one

  • mgkit.snps.classes.GeneSNP now caches the number of synonymous and non-synonymous SNPs for better speed

  • mgkit.io.gff.GenomicRange.__contains__() now also accepts a tuple (start, end) or another GenomicRange instance

Fixed

0.2.2

Added

Changed

Removed

  • deprecated code from the snps package

0.2.1

Added

Changed

Deprecated

  • mgkit.filter.taxon.filter_taxonomy_by_lineage()

  • mgkit.filter.taxon.filter_taxonomy_by_rank()

Removed

  • removed old filter_gff script

0.2.0

  • added creation of wheel distribution

  • changes to ensure compatibility with alter pandas versions

  • mgkit.io.gff.Annotation.get_ec() now returns a set, reflected changes in tests

  • added a –cite option to scripts

  • fixes to tutorial

  • updated documentation for sphinx 1.3

  • changes to diagrams

  • added decoration to raise warnings for deprecated functions

  • added possibility for mgkit.counts.func.load_sample_counts() info_dict to be a function instead of a dictionary

  • consolidation of some eggNOG structures

  • added more spine options in mgkit.plots.heatmap.grouped_spine()

  • added a length property to mgkit.io.gff.Annotation

  • changed filter-gff script to customise the filtering function, from the default one, also updated the relative documentation

  • fixed a few plot functions

0.1.16

0.1.15

0.1.14

0.1.13

0.1.12

0.1.11

  • removed rst2pdf for generating a PDF for documentation. Latex is preferred

  • corrections to documentation and example script

  • removed need for joblib library in translate_seq script: used only if available (for using multiple processors)

  • deprecated mgkit.snps.funcs.combine_snps_in_dataframe() and mgkit.snps.funcs.combine_snps_in_dataframe(): mgkit.snps.funcs.combine_sample_snps() should be used

  • refactored some tests and added more

  • added docs_req.txt to help build the documentation ont readthedocs.org

  • renamed mgkit.snps.classes.GeneSyn gid and taxon attributes to gene_id and taxon_id. The old names are still available for use (via properties), but the will be taken out in later versions. Old pickle data should be loaded and saved again before in this release

  • added a few convenience functions to ease the use of combine_sample_snps()

  • added function mgkit.snps.funcs.significance_test() to test the distributions of genes share between two taxa.

  • fixed an issue with deinterleaving sequence data from khmer

  • added mgkit.snps.funcs.flat_sample_snps()

  • Added method to mgkit.kegg.KeggClientRest to get names for all ids of a certain type (more generic than the various get_*_names)

  • added first implementation of mgkit.kegg.KeggModule class to parse a Kegg module entry

  • mgkit.snps.conv_func.get_rank_dataframe(), mgkit.snps.conv_func.get_gene_map_dataframe()