Changes¶

0.5.7¶

This release includes some changes to the script to handle MAGs and alternative taxonomies (right now from PhyloPhlan). Also a few more scripts and commands were introduced, for handling bigger dataset or streamline some work. Additionally, I’ll try to update a Docker image (currently only for version 0.5.6) so an alternative to Conda is possible. The last time I tried to install pysam via pip I wasn’t able to, so a Docker image is a good compromise for performance, portability and ease of installation.

Also several bugfixes were made, so I want this version out first, while the next version (0.6.0) will be used to clean the documentation, code and introduce more tests. I also want to clean and update the tutorials.

Added¶

mgkit.io.gff.Annotation.get_fc() to access the functional categories
mgkit.taxon.Taxonomy.get_by_lineage() to interrogate by full lineage
mgkit.taxon.Taxonomy.max_id() and mgkit.taxon.Taxonomy.min_id() to find the last added taxon IDs
mgkit.taxon.Taxonomy.parse_phylophlan_lineage() parses a line from the taxonomy file
mgkit.taxon.Taxonomy.read_from_phylophlan_taxonomy() reads the PhyloPhlan taxonomy file
added bins command to add-gff-info
added script count-utils for manipulation of count data
added a rename command to edit-gff to rename attributes
added options in edit-gff table for using prodigal genes, strip Kegg tags and using a default value
added filter and info commands to fasta-utils
added vcf command to pnps-gen to parse the VCF file for pN/pS generation. snp_parser is deprecated now.
added more options to taxon-utils get to extract information from the taxonomy
added dict-utils for manipulate dictionary files

Changed¶

mgkit.io.gff.parse_gff() now accept a filter_func to filter annotations
mgkit.taxon.Taxonomy.add_taxon() now adds the lineage property of a taxon, and when a taxon is new, the ID will be negative
mgkit.taxon.get_lineage() now accepts a rank and use_cname parameter
blast2gff blastdb will add the query coverage if the query fasta file is passed

Deprecated¶

snp_parser shouldn’t be used anymore, instead use pnps-gen vcf

0.5.6¶

mgkit.net.uniprot tests have problems with upstream API and are skipped for now
added option to taxon-utils lca to use the reference file to add base pair counts to the Krona output
several bugfixes

0.5.5¶

fasta-utils translate gained an option to translate the current frame of a sequence, assumes the sequences are ORFs
edit-gff table gained an option to skip comments, user can indicated the string for comments. For example -c ‘#’ for comments starting with ‘#’

Changed¶

mgkit.utils.dictionary.text_to_dict()

0.5.4¶

added mgkit.taxon.Taxonomy.iter_ids() to iterate over Taxonomy, yielding Taxon IDs
added options -p and -c to taxon-utils get

taxon-utils get¶

When using -o option in taxon-utils get the script will try an exact match, followed by a partial and finally a fuzzy search of the names passed. The alternative names will be reported but not used, unless the -p option is used.

The -c option will output also all the taxa that are children of the passed names.

0.5.3¶

Added parameters to support the partial pN/pS calculations to mgkit.snps.funcs.combine_sample_snps()
Added options to pnps-gen script to output only pS or pN to script pnps-gen

0.5.2¶

Fixed a bug when using –rank None in pnps-gen

0.5.1¶

get_gene_taxon_dataframe: changed: Changed in version 0.5.1: gene_map can be None, use_uid can be passed to the underline function
added option to include the lineage as a string in pnps-gen
added option to use the uids from the GFF instead of gene_id, this does not require the GFF file, they are embedded into the .pickle file
by default pnps-gen returns the taxon included in the GFF and not a ranked taxon
added option to make a different type of table in pnps-gen rank

0.5.0¶

Added¶

taxon-utils get command to query a taxonomy file
pnps-gen to generate a table of pN/pS values

0.4.4¶

Added¶

mgkit.utils.dictionary.dict_to_text() mgkit.utils.dictionary.text_to_dict() to read/write simple dictionary files (tables)
filter-gff overlap command, added option to not use the strand information in filtering the overlaps and also to make multiple iterations (max 10) to better remove overlaps
mgkit.io.gff.Annotation.has_attr() and mgkit.io.gff.Annotation.del_attr()
a new script, edit-gff to view a GFF as table and perform general edits on it

Changed¶

download-ncbi-taxa.sh and download-uniprot-taxa.sh (Download Taxonomy): if a PROGBAR enviroment variable is set, the progress bar (default in wget) is used
changed mgkit.io.gff.Annotation.set_attr() to allow changing standard attributes
added some checks for unexpected lengths in add-gff-info exp_syn, check the log for this cases
mgkit.utils.sequence.get_seq_expected_syn_count() silently skips codons containing N or which is not of length 3

0.4.3¶

Fixes¶

mgkit.align.SamtoolsDepth in version 0.4.2 was using a weakref.WeakValueDictionary to speed up the recovery of memory from the internal dictionary. In the tests on MacOS the memory was mostly kept, but on Linux when submitted as a job it seems to be freed instantly. This also impact the add-gff-info cov_samtools command, since it uses this class - it will run, but reports that the number of sequences not found in the samtools depth file is the same as the number of sequences in the GFF file.

0.4.2¶

Fixed reading of Expasy files. The reading was not changed to adopt Python3 conventions like the rest of the routines. Included a test that dowload the expasy file and parses it
Optimisations of add-gff-info cov_samtools and the mgkit.align routines used

Added¶

mgkit.mappings.enzyme.parse_expasy_dat()
mgkit.align.SamtoolsDepth.advance_file()
option -m to calculate average coverage in add-gff-info cov_samtools

Changed¶

fix for detection of compressed files mgkit.io.gff.parse_gff()
Fix for compressed files already opened in mgkit.io.utils.open_file()
mgkit.align.SamtoolsDepth: several optimisations and changes to support a scanning approach, instead of lookup table. No exception is raised when a sequence is not found in the file, instead assuming that the coverage is 0
mgkit.align.read_samtools_depth() was changed, and now it returns lists instead of numpy arrays - this increases the speed of reading to about 3-4x in some tests
mgkit.align.read_samtools_depth() also assumes that lines read have a ‘n’ at the end and avoid using strip this should be a safe assumptions under Pyuthon3
mgkit.align.SamtoolsDepth now uses a weakref.WeakValueDictionary for SamtoolsDepth.data to improve release of memory

0.4.1¶

Sanity checks for several mistakes, including never changed the Programming language version in the setup.py from 2.7. Tested installation under Python 3.6, with tox. Also removed the last bit of code using progressbar2.

0.4.0¶

This version was tested under Python 3.5, but the tests (with tox) were run also under Python 2.7. However, from the next release Python 2.7 will be removed gradually (as Python 2.7 won’t be supported/patched anymore from next year).

Added¶

Added –progress option to several scripts

mgkit.counts.glm:

mgkit.graphs

mgkit.taxon:

mgkit.taxon.Taxonomy.is_ranked_below()

Changed¶

Requires pandas version >=0.24 because now a pandas.SparseArray is used for add-gff-info cov_samtools. Before, when reading the depth files from samtools the array for each sequence was kept in memory, while now only the ones in the GFF file are used.

mgkit.align:

mgkit.align.SamtoolsDepth: uses pandas.SparseArray now. It should use less memory, but needs pandas version > 0.24
mgkit.align.read_samtools_depth(): now returns 3 array, instead of 2. Also added seq_ids to skip lines

mgkit.io.gff

mgkit.io.gff.from_gff: added encoding parameter
mgkit.io.gff.parse_gff: In some cases ASCII decoding is not enough, so it is parametrised now
mgkit.io.gff.split_gff_file: added encoding parameter

mgkit.mappings.eggnog:

mgkit.mappings.eggnog.NOGInfo: made file reading compatible with Python 3

mgkit.snps.funcs:

mgkit.snps.funcs.combine_sample_snps(): added store_uids

Deprecated¶

mgkit.io.blast.add_blast_result_to_annotation()
mgkit.taxon.Taxonomy.read_taxonomy(): use Taxonomy.read_from_ncbi_dump()
mgkit.taxon.Taxonomy.parse_uniprot_taxon()

Tests¶

Removed the last portions that used nosetets and better integrated pytest with setup.py. Now uses AppVeyor for testing the build and running tests under Python 3.

In cases where the testing environment has no or limited internet connection, tests that require an internet connection can be skipped by setting the following environment variable before running the tests:

$ export MGKIT_TESTS_CONN_SKIP=T

0.3.4¶

General cleanup and testing release. Major changes:

general moving to Python2 (2.7) and Python3 (3.5+) support, using the future package and when convenient checks for the version of python installed
setup includes now all the optional dependencies, since this makes it easier to deal with conda environments
move to pytest from nose, since it allows some functionality that interests me, along with the reorganisation of the test modules and skips of tests that cannot be executed (like mongodb)
move from urlib to using requests, which also helps with python3 support
more careful with some dependencies, like the lzma module and msgpack
addition of more tests, to help the porting to python3, along with a tox configuration
matplotlib.pyplot is still in the mgkit.plots.unused, but it is not imported when the parent package is, now. It is still needed in the mgkit.plots.utils functions, so the import has been moved inside the function. This should help with virtual environments and test suites
renamed mgkit.taxon.UniprotTaxonomy to mgkit.taxon.Taxonomy, since it’s really NCBI taxonomy and it’s preferred to download the data from there. Same for mgkit.taxon.UniprotTaxonTuple to mgkit.taxon.TaxonTuple, with an alias for old name there, but will be removed in a later version
download_data was removed. Taxonomy should be downloaded using download-taxonomy.sh, and the mgkit.mappings is in need of refactoring to remove old and now ununsed functionality
added mgkit.taxon.Taxonomy.get_ranked_id()
using a sphinx plugin to render the jupyter notebooks instead of old solution
rerun most of the tutorial and updated commands for newest available software (samtools/bcftools) and preferred the SNP calling from bcftools

Scripts¶

This is a summary of notable changes, it is advised to check the changes in the command line interface for several scripts

changed several scripts command line interface, to adapt to the use of click
taxon-utils lca has one options only to specify the output format, also adding the option to output a format that can be used by add-gff-info addtaxa
taxon-utils filter support the filtering of table files, when they are in a 2-columns format, such as those that are downloaded by download-ncbi-taxa.sh
removed the eggnog and taxonomy commands from add-gff-info, the former since it’s not that useful, the latter because it’s possible to achieve the same results using taxon-utils with the new output option
removed the rand command of fastq-utils since it was only for testing and the FastQ parser is the one from mgkit.io.fastq
substantial changes where made to commands values and sequence of the filter-gff script
sampling-utils rand_seq now can save the model used and reload it
removed download_data and download_profiles, since they are not going to be used in the next tutorial and it is preferred now to use BLAST and then find the LCA with taxon-utils

Python3¶

At the time of writing all tests pass on Python 3.5, but more tests are needed, along with some new ones for the blast parser and the scripts. Some important changes:

mgkit.io.gff.Annotation uses its uid to hash the instance. This allows the use in sets (mainly for filtering). However, hashing is not supported in mgkit.io.gff.GenomicRange.
mgkit.io.utils.open_file() now always opens and writes files in binary mode. This is one of the suggestions to keep compatibility between 2.x and 3.x. So if used directly the output must be decoded from ascii, which is the format used in text files (at least bioinformatics). However, this is not required for the parsers, like mgkit.io.gff.parse_gff(), mgkit.io.fasta.load_fasta() along with others (and the correspective write_ functions): they return unicode strings when parsing and decode into ascii when writing.

In general new projects will be worked on using Python 3.5 and the next releases will prioritise that (0.4.0 and later). If bugfixes are needed and Python 3 cannot be used, this version branch (0.3.x) will be used to fix bugs for users.

0.3.3¶

Added¶

module mgkit.counts.glm, with functions used to help the fit of Generalised Linear Models (GLM)
mgkit.io.fastq.load_fastq_rename()
added sync, sample_stream and rand_seq commands to sampling-utils script
mgkit.utils.sequence.extrapolate_model()
mgkit.utils.sequence.qualities_model_constant()
mgkit.utils.sequence.qualities_model_decrease()
mgkit.utils.sequence.random_qualities()
mgkit.utils.sequence.random_sequences()
mgkit.utils.sequence.random_sequences_codon()
mgkit.taxon.UniprotTaxonomy.get_lineage_string()
mgkit.taxon.UniprotTaxonomy.parse_gtdb_lineage()
mgkit.net.uniprot.get_gene_info_iter()

Changed¶

mgkit.io.fastq.write_fastq_sequence()
added seq_id as a special attribute to mgkit.io.gff.Annotation.get_attr()
mgkit.io.gff.from_prodigal_frag() is tested and fixed
added cache in mgkit.utils.dictionary.HDFDict
mgkit.utils.sequence.sequence_gc_content() now returns 0.5 when denominator is 0
add-gff-info addtaxa -a now accept seq_id as lookup, to use output from taxon-utils lca (after cutting output)

Deprecated¶

mgkit.io.fastq.convert_seqid_to_old()

0.3.2¶

Removed deprecated code

0.3.1¶

This release adds several scripts and commands. Successive releases 0.3.x releases will be used to fix bugs and refine the APIs and CLI. Most importantly, since the publishing of the first paper using the framework, the releases will go torward the removal of as much deprecated code as possible. At the same time, a general review of the code to be able to run on Python3 (probably via the six package) will start. The general idea is to reach as a full removal of legacy code in 0.4.0, while full Python3 compatibility is the aim of 0.5.0, which also means dropping dependencies that are not compatible with Python3.

Added¶

mgkit.graphs.from_kgml() to make a graph from a KGML file (allows for directionality)
mgkit.graphs.add_module_compounds(): updates a graph with compounds information as needed
mgkit.kegg.parse_reaction(): parses a reaction equation from Kegg
added –no-frame option to hmmer2gff - Convert HMMER output to GFF, to use non translated protein sequences. Also changed the mgkit.io.gff.from_hmmer() function to enable this behaviour
added options –num-gt and –num-lt to the values command of filter-gff - Filter GFF annotations to filter based on > and < inequality, in addition to >= and <=
added uid as command in fasta-utils - Fasta Utilities to make unique fasta headers
methods to make mgkit.db.mongo.GFFDB to behave like a dictionary (an annotation uid can be used as a key to retrieve it, instead of a query), this includes the possibility to iterate over it, but what is yielded are the values, not the keys (i.e. mgkit.io.gff.Annotation instances, not uid)
added mgkit.counts.func.from_gff() to load count data stored inside a GFF, as is the case when the counts command of add-gff-info - Add informations to GFF annotations is used’
added mgkit.kegg.KeggClientRest.conv() and mgkit.kegg.KeggClientRest.find() operations to mgkit.kegg.KeggClientRest
mgkit.kegg.KeggClientRest now caches calls to several methods. The cache can be written to disk using mgkit.kegg.KeggClientRest.write_cache() or emptied via mgkit.kegg.KeggClientRest.empty_cache()
added mgkit.utils.dictionary.merge_dictionaries() to merge multiple dictionaries where the keys contain different values
added a Docker file to make a preconfigured mgkit/jupyter build
added C functions (using Cython) for tetramer/kmer counting. The C functions are the default, with the pure python implementation having a _ appended to their names. This is because the Cython functions cannot have docstrings
added mgkit.io.gff.annotation_coverage_sorted()
added mgkit.io.gff.Annotation.to_dict()
added mgkit.plots.utils.legend_patches() to create matplotlib patches, to be in legends
added scripts download IDs to taxa tables from NCBI/Uniprot
added mgkit.io.utils.group_tuples_by_key()
added cov command to get-gff-info - Extract informations to GFF annotations and filter-gff - Filter GFF annotations
added mgkit.io.fasta.load_fasta_prodigal(), to load the fasta file from prodigal for called genes (tested on aminoacids)
added option to output a JSON file to the lca command in ref:taxon-utils and cov command in get-gff-info - Extract informations to GFF annotations
added a bash script, sort-gff.sh to help sort a GFF
added mgkit.taxon.UniprotTaxonomy.get_lineage() which simplifies the use of mgkit.taxon.get_lineage()
added mgkit.io.fastq.load_fastq() as a simple parser for fastq files
added a new script, sampling-utils - Resampling Utilities
added mgkit.utils.common.union_ranges() and mgkit.utils.common.complement_ranges()
added to_hdf command to taxon-utils - Taxonomy Utilities to create a HDF5 file to lookup taxa tables from NCBI/Uniprot
added –hdf-table option to addtaxa command in add-gff-info - Add informations to GFF annotations
mgkit.taxon.UniprotTaxonomy.add_taxon(), mgkit.taxon.UniprotTaxonomy.add_lineage() and mgkit.taxon.UniprotTaxonomy.drop_taxon()

Changed¶

changed domain to superkingdom as for NCBI taxonomy in mgkit.taxon.UniprotTaxonomy.read_from_gtdb_taxonomy()
updated scripts documentation to include installed but non advertised scripts (like translate_seq)
mgkit.kegg.KeggReaction was reworked to only store the equation information
some commands in fastq-utils - Fastq Utilities did not support standard in/out, also added the script usage to the script details
translate_seq now supports standard in/out
added haplotypes parameter to mgkit.snps.funcs.combine_sample_snps()
an annotation from mgkit.db.mongo.GFFDB now doesn’t include the lineage, because it conflicts with the string used in a GFF file
an mgkit.io.gff.Annotation.coverage() now returns a float instead od a int
moved code from package mgkit.io to mgkit.io.utils
changed behaviour of mgkit.utils.common.union_range()
removed mgkit.utils.common.range_substract_()
added progressbar2 as installation requirement
changed how mgkit.taxon.UniprotTaxonomy.find_by_name()

Fixed¶

Besides smaller fixes:

mgkit.plots.abund.draw_circles() behaviour when sizescale doesn’t have the same shape as order
parser is now correct for taxon-utils - Taxonomy Utilities, to include the Krona options
ondition when a blast output is empty, hence lineno is not initialised when a message is logged

Deprecated¶

translate_seq will be removed in version 0.4.0, instead use the similar command in fasta-utils - Fasta Utilities

0.3.0¶

A lot of bugs were fixed in this release, especially for reading NCBI taxonomy and using the msgpack format to save a UniprotTaxonomy instance. Also added a tutorial for profiling a microbial community using MGKit and BLAST (Profile a Community with BLAST)

Added¶

mgkit.align.read_samtools_depth() to read the samtools depth format iteratively (returns a generator)
mgkit.align.SamtoolsDepth, used to cache the samtools depth format, while requesting region coverage
mgkit.kegg.KeggModule.find_submodules(), mgkit.kegg.KeggModule.parse_entry2()
mgkit.mappings.enzyme.get_mapping_level()
mgkit.utils.dictionary.cache_dict_file() to cache a large dictionary file (tab separated file with 2 columns), an example of its usage is in the documentation
mgkit.taxon.UniprotTaxonomy.read_from_gtdb_taxonomy() to read a custom taxonomy from a tab separated file. The taxon_id are not guaranteed to be stable between runs
added cov_samtools to add-gff-info script
added mgkit.workflow.fasta_utils and correspondent script fasta-utils
added options -k and -kt to taxon_utils, which outputs a file that can be used with Krona ktImportText (needs to use -q with this script)

Changed¶

added no_zero parameter to mgkit.io.blast.parse_accession_taxa_table()
changed behaviour of mgkit.kegg.KeggModule and some of its methods.
added with_last parameter to mgkit.taxon.get_lineage()
added –split option to add-gff-info exp_syn and get-gff-info sequence scripts, to emulate BLAST behaviour in parsing sequence headers
added -c option to add-gff-info addtaxa

0.2.5¶

Changed¶

added the only_ranked argument to mgkit.taxon.get_lineage()
add-gff-info addtaxa (add-gff-info - Add informations to GFF annotations) doesn’t preload the GFF file if a dictionary is used instead of the taxa table
blast2gff blastdb ((blast2gff - Convert BLAST output to GFF) offers more options to control the format of the header in the DB used
added the sequence command to filter-gff (filter-gff - Filter GFF annotations), to filter all annotations on a per-sequence base, based on mean bitscore or other comparisons

Added¶

added mgkit.counts.func.load_counts_from_gff()
added mgkit.io.blast.parse_accession_taxa_table()
added mgkit.plots.abund.draw_axis_internal_triangle()
added representation of mgkit.taxon.UniprotTaxonomy, it show the number of taxa in the instance
added mgkit.taxon.last_common_ancestor_multiple()
added taxon_utils (taxon-utils - Taxonomy Utilities) to filter GFF based on their taxonomy and find the last common ancestor for a reference sequence based on either GFF annotations or a list of taxon_ids (in a text file)

0.2.4¶

Changed¶

mgkit.utils.sequence.get_contigs_info() now accepts a dictionary name->seq or a list of sequences, besides a file name (r536)
add-gff-info counts command now removes trailing commas from the samples list
the axes are turned off after the dendrogram is plo

Fixed¶

the snp_parser script requirements were set wrong in setup.py (r540)
uncommented lines to download sample data to build documentation (r533)
add-gff-info uniprot command now writes the lineage attribute correctly (r538)

0.2.3¶

The installation dependencies are more flexible, with only numpy as being required. To install every needed packages, you can use:

$ pip install mgkit[full]

Added¶

new option to pass the query sequences to blast2gff, this allows to add the correct frame of the annotation in the GFF
added the attributes evalue, subject_start and subject_end to the output of blast2gff. The subject start and end position allow to understand on which frame of the subject sequence the match was found
added the options to annotate the heatmap with the numbers. Also updated the relative example notebook
Added the option to reads the taxonomy from NCBI dump files, using mgkit.taxon.UniprotTaxonomy.read_from_ncbi_dump(). This make it faster to get the taxonomy file
added argument to return information from mgkit.net.embl.datawarehouse_search(), in the form of tab separated data. The argument fields can be used when display is set to report. An example on how to use it is in the function documentation
added a bash script download-taxonomy.sh that download the taxonomy
added script venv-docs.sh to build the documentation in HTML under a virtual environment. matplotlib on MacOS X raises a RuntimeError, because of a bug in virtualenv, the documentation can be first build with this, after the script create-apidoc.sh is create the API documentation. The rest of the documentation (e.g. the PDF) can be created with make as usual, afterwards
added mgkit.net.pfam, with only one function at the moment, that returns the descriptions of the families.
added pfam command to add-gff-info, using the mentioned function, it adds the description of the Pfam families in the GFF file
added a new exception, used internally when an additional dependency is needed

Changed¶

using the NCBI taxonomy dump has two side effects:
- the scientific/common names are kept as is, not lower cased as was before
- a merged file is provided for taxon_id that changed. While the old taxon_id is kept in the taxonomy, this point to the new taxon, to keep backward compatibility
renamed the add-gff-info gitaxa command to addtaxa. It now accepts more data sources (dictionaries) and is more general
changed mgkit.net.embl.datawarehouse_search() to automatically set the limit at 100,000 records
the taxonomy can now be saved using msgpack, making it faster to read/write it. It’s also more compact and better compression ratio
the mgkit.plots.heatmap.grouped_spine() now accept the rotation of the labels as option
added option to use another attribute for the gene_id in the get-gff-info script gtf command
added a function to compare the version of MGKit used, throwing a warning, when it’s different (mgkit.check_version())
removed test for old SNPs structures and added the same tests for the new one
mgkit.snps.classes.GeneSNP now caches the number of synonymous and non-synonymous SNPs for better speed
mgkit.io.gff.GenomicRange.__contains__() now also accepts a tuple (start, end) or another GenomicRange instance

Fixed¶

a bug in the gitaxa (now addtaxa) command: when a taxon_id was not found in the table, the wrong taxon_name and lineage was inserted
bug in mgkit.snps.classes.GeneSNP that prevented the correct addition of values
fixed bug in mgkit.snps.funcs.flat_sample_snps() with the new class
mgkit.io.gff.parse_gff() now correctly handles comment lines and stops parsing if the fasta file at the end of a GFF is found

0.2.2¶

Added¶

new commands for the add-gff-info script (add-gff-info - Add informations to GFF annotations):
- eggnog to add information from eggNOG HMMs (at the moment the 4.5 Viral)
- counts and fpkms to add count data (correctly exported to mongodb)
- gitaxa to add taxonomy information directly from GI identifiers from NCBI
added blastdb command to blast2gff script (blast2gff - Convert BLAST output to GFF)
updated MGKit GFF Specifications
added gtf command to get-gff-info script (get-gff-info - Extract informations to GFF annotations) to convert a GFF to GTF, that is accepted by featureCounts, in conjunction with the counts command of add-gff-info
added method to mgkit.snps.classes.RatioMixIn.calc_ratio_flag to calculate special cases of pN/pS

Changed¶

added argument in functions of the mgkit.snps.conv_func to bypass the default filters
added use_uid argument to mgkit.snps.funcs.combine_sample_snps() to use the uid instead of the gene_id when calculating pN/pS
added flag_values argument to mgkit.snps.funcs.combine_sample_snps() to use mgkit.snps.classes.RatioMixIn.calc_ratio_flag instead of mgkit.snps.classes.RatioMixIn.calc_ratio

Removed¶

deprecated code from the snps package

0.2.1¶

Added¶

added mgkit.db.mongo
added mgkit.db.dbm
added mgkit.io.gff.Annotation.get_mappings()
added mgkit.io.gff.Annotation.to_json()
added mgkit.io.gff.Annotation.to_mongodb()
added mgkit.io.gff.from_json()
added mgkit.io.gff.from_mongodb()
added mgkit.taxon.get_lineage()
added mgkit.utils.sequence.get_contigs_info()
added mongodb and dbm commands to script get-gff-info
added kegg command to add-gff-info script, caching results and -d option to uniprot command
added -ft option to blast2gff script
added -ko option to download_profiles
added new HMMER tutorial
added another notebook to the plot examples, for misc. tips
added a script that downloads from figshare the tutorial data]
added function to get an enzyme full name (mgkit.mappings.enzyme.get_enzyme_full_name())
added example notebook for using GFF annotations and the mgkit.db.dbm, mgkit.db.mongo modules

Changed¶

mgkit.io.blast.parse_uniprot_blast()
mgkit.io.gff.Annotation
mgkit.io.gff.GenomicRange
mgkit.io.gff.from_hmmer()
mgkit.taxon.UniprotTaxonomy.read_taxonomy()
mgkit.taxon.parse_uniprot_taxon()
changed behaviour of hmmer2gff script
changed tutorial notebook to specify the directory where the data is

Deprecated¶

mgkit.filter.taxon.filter_taxonomy_by_lineage()
mgkit.filter.taxon.filter_taxonomy_by_rank()

Removed¶

removed old filter_gff script

0.2.0¶

added creation of wheel distribution
changes to ensure compatibility with alter pandas versions
mgkit.io.gff.Annotation.get_ec() now returns a set, reflected changes in tests
added a –cite option to scripts
fixes to tutorial
updated documentation for sphinx 1.3
changes to diagrams
added decoration to raise warnings for deprecated functions
added possibility for mgkit.counts.func.load_sample_counts() info_dict to be a function instead of a dictionary
consolidation of some eggNOG structures
added more spine options in mgkit.plots.heatmap.grouped_spine()
added a length property to mgkit.io.gff.Annotation
changed filter-gff script to customise the filtering function, from the default one, also updated the relative documentation
fixed a few plot functions

0.1.16¶

changed default parameter for mgkit.plots.boxplot.add_values_to_boxplot()
Added include_only filter option to the default snp filters mgkit.consts.DEFAULT_SNP_FILTER
the default filter for SNPs now use an include only option, by default including only protozoa, archaea, fungi and bacteria in the matrix
added widths parameter to def mgkit.plots.boxplot.boxplot_dataframe() function, added function mgkit.plots.boxplot.add_significance_to_boxplot() and updated example boxplot notebook for new function example
use_dist and dist_func parameters to the mgkit.plots.heatmap.dendrogram() function
added a few constants and functions to calculate the distance matrices of taxa: mgkit.taxon.taxa_distance_matrix(), mgkit.taxon.distance_taxa_ancestor() and mgkit.taxon.distance_two_taxa()
mgkit.kegg.KeggClientRest.link_ids() now accept a dictionary as list of ids
if the conversion of an Annotation attribute (first 8 columns) raises a ValueError in mgkit.io.gff.from_gff(), by default the parser keeps the string version (cases for phase, where is ‘.’ instead of a number)
treat cases where an attribute is set with no value in mgkit.io.gff.from_gff()
added mgkit.plots.colors.palette_float_to_hex() to convert floating value palettes to string
forces vertical alignment of tick labels in heatmaps
added parameter to get a consensus sequence for an AA alignment, by adding the nucl parameter to mgkit.utils.sequence.Alignment.get_consensus()
added mgkit.utils.sequence.get_variant_sequence() to get variants of a sequence, essentially changing the sequence according to the SNPs passed
added method to get an aminoacid sequence from Annotation in mgkit.io.gff.Annotation.get_aa_seq() and added the possibility to pass a SNP to get the variant sequence of an Annotation in mgkit.io.gff.Annotation.get_nuc_seq().
added exp_syn command to add-gff-info script
changed GTF file conversion
changed behaviour of mgkit.taxon.is_ancestor(): if a taxon_id raises a KeyError, False is now returned. In other words, if the taxon_id is not found in the taxonomy, it’s not an ancestor
added mgkit.io.gff.GenomicRange.__contains__(). It tests if a position is inside the range
added mgkit.io.gff.GenomicRange.get_relative_pos(). It returns a position relative to the GenomicRange start
fixed documentation and bugs (Annotation.get_nuc_seq)
added mgkit.io.gff.Annotation.is_syn(). It returns True if a SNP is synonymous and False if non-synonymous
added to_nuc parameter to mgkit.io.gff.from_nuc_blast() function. It to_nuc is False, it is assumed that the hit was against an amino acidic DB, in which case the phase should always set to 0
reworked internal of snp_parser script. It doesn’t use SNPDat anymore
updated tutorial
added ipython notebook as an example to explore data from the tutorial
cleaned deprecated code, fixed imports, added tests and documentation

0.1.15¶

changed name of mgkit.taxon.lowest_common_ancestor() to mgkit.taxon.last_common_ancestor(), the old function name points to the new one
added mgkit.counts.func.map_counts_to_category() to remap counts from one ID to another
added get-gff-info script to extract information from GFF files
script download_data can now download only taxonomy data
added more script documentation
added examples on gene prediction
added function mgkit.io.gff.from_hmmer() to parse HMMER results and return mgkit.io.gff.Annotation instances
added mgkit.io.gff.Annotation.to_gtf() to return a GTF line, mgkit.io.gff.Annotation.add_gc_content() and mgkit.io.gff.Annotation.add_gc_ratio() to calculate GC content and ratio respectively
added mgkit.io.gff.parse_gff_files() to parse multiple GFF files
added uid_used parameter to several functions in mgkit.counts.func
added mgkit.plots.abund to plot abundance plots
added example notebooks for plots
HTSeq is now required only by the scripts that uses it, snp_parser and fastq_utils
added function to convert numbers when reading from htseq count files
changed behavior of -b option in add-gff-info taxonomy command
added mgkit.io.gff.get_annotation_map()

0.1.14¶

added ipthon notebooks to the documentation. As of this version the included ones (in docs/source/examples) are for two plot modules. Also added a bash script to convert them into rst files to be included with the documentation. The .rst are not versioned, and they must be rebuild, meaning that one of the requirements for building the docs is to have IPython installed with the notebook extension
now importing some packages automatically import the subpackages as well
refactored mgkit.plots into a package, with most of the original functions imported into it, for backward compatibility
added mgkit.graphs.build_weighted_graph()
added box_vert parameter in mgkit.plots.boxplot.add_values_to_boxplot(), the default will be changed in a later version (kept for compatibility with older scripts/notebooks)
added an heatmap module to the plots package. Examples are in the notebook
added mgkit.align.covered_annotation_bp() to find the number of bp covered by reads in annotations (as opposed to using the annotation length)
added documentation to mgkit.mappings.eggnog.NOGInfo and an additional method
added mgkit.net.uniprot.get_uniprot_ec_mappings() as it was used in a few scripts already
added mgkit.mappings.enzyme.change_mapping_level() and other to deal with EC numbers. Also improved documentation with some examples
added mgkit.counts.func.load_sample_counts_to_genes() and mgkit.counts.func.load_sample_counts_to_taxon(), for mapping counts to only genes or taxa. Also added index parameter in mgkit.counts.func.map_counts() to accomodate the changes
added mgkit.net.uniprot.get_ko_to_eggnog_mappings() to get mappings of KO identifiers to eggNOG
added mgkit.io.gff.split_gff_file() to split a gff into several ones, assuring that all annotations for a sequence is in the same file; useful to split massive GFF files before filtering
added mgkit.counts.func.load_deseq2_results() to load DESeq2 results in CSV format
added mgkit.counts.scaling.scale_rpkm() for scale with rpkm a count table
added caching options to mgkit.counts.func.load_sample_counts() and others
fixes and improvements to documentation

0.1.13¶

added counts package, including functions to load HTSeq-counts results and scaling
added mgkit.filter.taxon.filter_by_ancestor(), as a convenience function
deprecated functions in mgkit.io.blast module, added more to parse blast outputs (some specific)
mgkit.io.fasta.load_fasta() returns uppercase sequences, added a function (mgkit.io.fasta.split_fasta_file()) to split fasta files
added more methods to mgkit.io.gff.Annotation to complete API from old annotations
fixed mgkit.io.gff.Annotation.dbq property to return an int (bug in filtering with filter-gff)
added function to extract the sequences covered by annotations, using the mgkit.io.gff.Annotation.get_nuc_seq() method
added mgkit.io.gff.correct_old_annotations() to update old annotated GFF to new conventions
added mgkit.io.gff.group_annotations_by_ancestor() and mgkit.io.gff.group_annotations_sorted()
moved deprecated GFF classes/modules in mgkit.io.gff_old
added mgkit.io.uniprot module to read/write Uniprot files
added mgkit.kegg.KeggClientRest.get_ids_names() to remove old methods to get specific class names used to retrieve (they are deprecated at the moment)
added mgkit.kegg.KeggModule to parse a Kegg module entry
added mgkit.net.embl.datawarehouse_search() to search EMBL resources
made mgkit.net.uniprot.query_uniprot() more flexible
added/changed plot function in mgkit.plots
added enum34 as a dependency for Python versions below 3.4
changed classes to hold SNPs data: deprecated mgkit.snps.classes.GeneSyn, replaced by mgkit.snps.classes.GeneSNP which the enum module for mgkit.snps.classes.SNPType
added mgkit.taxon.NoLcaFound
fixed behaviour of mgkit.taxon.UniprotTaxonomy.get_ranked_taxon() for newer taxonomies
change behaviour of mgkit.taxon.UniprotTaxonomy.is_ancestor() to use module mgkit.taxon.is_ancestor() and accept multiple taxon IDs to test
mgkit.taxon.UniprotTaxonomy.load_data() now accept compressed data and file handles
added mgkit.taxon.lowest_common_ancestor() to find the lowest common ancestor of two taxon IDs
changed behaviour of mgkit.taxon.parse_uniprot_taxon()
added functions to get GC content, ratio of a sequence and it composition to mgkit.utils.sequence
added more options to blast2gff script
added coverage, taxonomy and unipfile to add-gff-info
refactored snp_parser to use new classes
added possibility to use sorted GFF files as input for filter-gff to use less memory (the examples show how to use sort in Unix)

0.1.12¶

added functions to elongate annotations, measure the coverage of them and diff GFF files in mgkit.io.gff
added ranges_length and union_ranges to mgkit.utils.common
added script filter-gff, filter_gff will be deprecated
added script blast2gff to convert blast output to a GFF
removed unneeded dependencies to build docs
added script add-gff-info to add more annotations to GFF files
added mgkit.io.blast.parse_blast_tab() to parse BLAST tabular format
added mgkit.io.blast.parse_uniprot_blast() to return annotations from a BLAST tabular file
added mgkit.graph module
added classes mgkit.io.gff.Annotation and mgkit.io.gff.GenomicRange and deprecated old classes to handle GFF annotations (API not stable)
added mgkit.io.gff.DuplicateKeyError raised in parsing GFF files
added functions used to return annotations from several sources
added option gff_type in mgkit.io.gff.load_gff()
added mgkit.net.embl.dbfetch()
added mgkit.net.uniprot.get_gene_info() and mgkit.net.uniprot.query_uniprot() mgkit.net.uniprot.parse_uniprot_response()
added apply_func_to_values to mgkit.utils.dictionary
added mgkit.snps.conv_func.get_full_dataframe(), mgkit.snps.conv_func.get_gene_taxon_dataframe()
added more tests

0.1.11¶

removed rst2pdf for generating a PDF for documentation. Latex is preferred
corrections to documentation and example script
removed need for joblib library in translate_seq script: used only if available (for using multiple processors)
deprecated mgkit.snps.funcs.combine_snps_in_dataframe() and mgkit.snps.funcs.combine_snps_in_dataframe(): mgkit.snps.funcs.combine_sample_snps() should be used
refactored some tests and added more
added docs_req.txt to help build the documentation ont readthedocs.org
renamed mgkit.snps.classes.GeneSyn gid and taxon attributes to gene_id and taxon_id. The old names are still available for use (via properties), but the will be taken out in later versions. Old pickle data should be loaded and saved again before in this release
added a few convenience functions to ease the use of combine_sample_snps()
added function mgkit.snps.funcs.significance_test() to test the distributions of genes share between two taxa.
fixed an issue with deinterleaving sequence data from khmer
added mgkit.snps.funcs.flat_sample_snps()
Added method to mgkit.kegg.KeggClientRest to get names for all ids of a certain type (more generic than the various get_*_names)
added first implementation of mgkit.kegg.KeggModule class to parse a Kegg module entry
mgkit.snps.conv_func.get_rank_dataframe(), mgkit.snps.conv_func.get_gene_map_dataframe()