Changes¶
0.5.7¶
This release includes some changes to the script to handle MAGs and alternative taxonomies (right now from PhyloPhlan). Also a few more scripts and commands were introduced, for handling bigger dataset or streamline some work. Additionally, I’ll try to update a Docker image (currently only for version 0.5.6) so an alternative to Conda is possible. The last time I tried to install pysam via pip I wasn’t able to, so a Docker image is a good compromise for performance, portability and ease of installation.
Also several bugfixes were made, so I want this version out first, while the next version (0.6.0) will be used to clean the documentation, code and introduce more tests. I also want to clean and update the tutorials.
Added¶
mgkit.io.gff.Annotation.get_fc()
to access the functional categoriesmgkit.taxon.Taxonomy.get_by_lineage()
to interrogate by full lineagemgkit.taxon.Taxonomy.max_id()
andmgkit.taxon.Taxonomy.min_id()
to find the last added taxon IDsmgkit.taxon.Taxonomy.parse_phylophlan_lineage()
parses a line from the taxonomy filemgkit.taxon.Taxonomy.read_from_phylophlan_taxonomy()
reads the PhyloPhlan taxonomy fileadded bins command to add-gff-info
added script count-utils for manipulation of count data
added a rename command to edit-gff to rename attributes
added options in edit-gff table for using prodigal genes, strip Kegg tags and using a default value
added filter and info commands to fasta-utils
added vcf command to pnps-gen to parse the VCF file for pN/pS generation. snp_parser is deprecated now.
added more options to taxon-utils get to extract information from the taxonomy
added dict-utils for manipulate dictionary files
Changed¶
mgkit.io.gff.parse_gff()
now accept a filter_func to filter annotationsmgkit.taxon.Taxonomy.add_taxon()
now adds the lineage property of a taxon, and when a taxon is new, the ID will be negativemgkit.taxon.get_lineage()
now accepts a rank and use_cname parameterblast2gff blastdb will add the query coverage if the query fasta file is passed
Deprecated¶
snp_parser shouldn’t be used anymore, instead use pnps-gen vcf
0.5.6¶
mgkit.net.uniprot tests have problems with upstream API and are skipped for now
added option to taxon-utils lca to use the reference file to add base pair counts to the Krona output
several bugfixes
0.5.5¶
fasta-utils translate gained an option to translate the current frame of a sequence, assumes the sequences are ORFs
edit-gff table gained an option to skip comments, user can indicated the string for comments. For example -c ‘#’ for comments starting with ‘#’
0.5.4¶
added
mgkit.taxon.Taxonomy.iter_ids()
to iterate over Taxonomy, yielding Taxon IDsadded options -p and -c to taxon-utils get
taxon-utils get¶
When using -o option in taxon-utils get the script will try an exact match, followed by a partial and finally a fuzzy search of the names passed. The alternative names will be reported but not used, unless the -p option is used.
The -c option will output also all the taxa that are children of the passed names.
0.5.3¶
Added parameters to support the partial pN/pS calculations to
mgkit.snps.funcs.combine_sample_snps()
Added options to pnps-gen script to output only pS or pN to script
pnps-gen
0.5.2¶
Fixed a bug when using –rank None in
pnps-gen
0.5.1¶
get_gene_taxon_dataframe: changed: Changed in version 0.5.1: gene_map can be None, use_uid can be passed to the underline function
added option to include the lineage as a string in pnps-gen
added option to use the uids from the GFF instead of gene_id, this does not require the GFF file, they are embedded into the .pickle file
by default pnps-gen returns the taxon included in the GFF and not a ranked taxon
added option to make a different type of table in pnps-gen rank
0.5.0¶
Added¶
taxon-utils get command to query a taxonomy file
pnps-gen
to generate a table of pN/pS values
0.4.4¶
Added¶
mgkit.utils.dictionary.dict_to_text()
mgkit.utils.dictionary.text_to_dict()
to read/write simple dictionary files (tables)filter-gff overlap command, added option to not use the strand information in filtering the overlaps and also to make multiple iterations (max 10) to better remove overlaps
mgkit.io.gff.Annotation.has_attr()
andmgkit.io.gff.Annotation.del_attr()
a new script, edit-gff to view a GFF as table and perform general edits on it
Changed¶
download-ncbi-taxa.sh and download-uniprot-taxa.sh (Download Taxonomy): if a PROGBAR enviroment variable is set, the progress bar (default in wget) is used
changed
mgkit.io.gff.Annotation.set_attr()
to allow changing standard attributesadded some checks for unexpected lengths in add-gff-info exp_syn, check the log for this cases
mgkit.utils.sequence.get_seq_expected_syn_count()
silently skips codons containing N or which is not of length 3
0.4.3¶
Fixes¶
mgkit.align.SamtoolsDepth
in version 0.4.2 was using a weakref.WeakValueDictionary
to speed up the recovery of memory from the internal dictionary. In the tests on MacOS the memory was mostly kept, but on Linux when submitted as a job it seems to be freed instantly. This also impact the add-gff-info cov_samtools command, since it uses this class - it will run, but reports that the number of sequences not found in the samtools depth file is the same as the number of sequences in the GFF file.
0.4.2¶
Fixed reading of Expasy files. The reading was not changed to adopt Python3 conventions like the rest of the routines. Included a test that dowload the expasy file and parses it
Optimisations of add-gff-info cov_samtools and the
mgkit.align
routines used
Added¶
option -m to calculate average coverage in add-gff-info cov_samtools
Changed¶
fix for detection of compressed files
mgkit.io.gff.parse_gff()
Fix for compressed files already opened in
mgkit.io.utils.open_file()
mgkit.align.SamtoolsDepth
: several optimisations and changes to support a scanning approach, instead of lookup table. No exception is raised when a sequence is not found in the file, instead assuming that the coverage is 0mgkit.align.read_samtools_depth()
was changed, and now it returns lists instead of numpy arrays - this increases the speed of reading to about 3-4x in some testsmgkit.align.read_samtools_depth()
also assumes that lines read have a ‘n’ at the end and avoid using strip this should be a safe assumptions under Pyuthon3mgkit.align.SamtoolsDepth
now uses a weakref.WeakValueDictionary forSamtoolsDepth.data
to improve release of memory
0.4.1¶
Sanity checks for several mistakes, including never changed the Programming language version in the setup.py from 2.7. Tested installation under Python 3.6, with tox. Also removed the last bit of code using progressbar2.
0.4.0¶
This version was tested under Python 3.5, but the tests (with tox) were run also under Python 2.7. However, from the next release Python 2.7 will be removed gradually (as Python 2.7 won’t be supported/patched anymore from next year).
Added¶
Added –progress option to several scripts
Changed¶
Requires pandas version >=0.24 because now a pandas.SparseArray is used for add-gff-info cov_samtools. Before, when reading the depth files from samtools the array for each sequence was kept in memory, while now only the ones in the GFF file are used.
mgkit.align.SamtoolsDepth
: uses pandas.SparseArray now. It should use less memory, but needs pandas version > 0.24mgkit.align.read_samtools_depth()
: now returns 3 array, instead of 2. Also added seq_ids to skip lines
mgkit.io.gff.from_gff
: added encoding parametermgkit.io.gff.parse_gff
: In some cases ASCII decoding is not enough, so it is parametrised nowmgkit.io.gff.split_gff_file
: added encoding parameter
mgkit.mappings.eggnog.NOGInfo
: made file reading compatible with Python 3
mgkit.snps.funcs.combine_sample_snps()
: added store_uids
Deprecated¶
mgkit.taxon.Taxonomy.read_taxonomy()
: use Taxonomy.read_from_ncbi_dump()mgkit.taxon.Taxonomy.parse_uniprot_taxon()
Tests¶
Removed the last portions that used nosetets and better integrated pytest with setup.py. Now uses AppVeyor for testing the build and running tests under Python 3.
In cases where the testing environment has no or limited internet connection, tests that require an internet connection can be skipped by setting the following environment variable before running the tests:
$ export MGKIT_TESTS_CONN_SKIP=T
0.3.4¶
General cleanup and testing release. Major changes:
general moving to Python2 (2.7) and Python3 (3.5+) support, using the future package and when convenient checks for the version of python installed
setup includes now all the optional dependencies, since this makes it easier to deal with conda environments
move to pytest from nose, since it allows some functionality that interests me, along with the reorganisation of the test modules and skips of tests that cannot be executed (like mongodb)
move from urlib to using requests, which also helps with python3 support
more careful with some dependencies, like the lzma module and msgpack
addition of more tests, to help the porting to python3, along with a tox configuration
matplotlib.pyplot
is still in themgkit.plots.unused
, but it is not imported when the parent package is, now. It is still needed in themgkit.plots.utils
functions, so the import has been moved inside the function. This should help with virtual environments and test suitesrenamed
mgkit.taxon.UniprotTaxonomy
tomgkit.taxon.Taxonomy
, since it’s really NCBI taxonomy and it’s preferred to download the data from there. Same formgkit.taxon.UniprotTaxonTuple
tomgkit.taxon.TaxonTuple
, with an alias for old name there, but will be removed in a later versiondownload_data was removed. Taxonomy should be downloaded using download-taxonomy.sh, and the
mgkit.mappings
is in need of refactoring to remove old and now ununsed functionalityusing a sphinx plugin to render the jupyter notebooks instead of old solution
rerun most of the tutorial and updated commands for newest available software (samtools/bcftools) and preferred the SNP calling from bcftools
Scripts¶
This is a summary of notable changes, it is advised to check the changes in the command line interface for several scripts
changed several scripts command line interface, to adapt to the use of click
taxon-utils lca has one options only to specify the output format, also adding the option to output a format that can be used by add-gff-info addtaxa
taxon-utils filter support the filtering of table files, when they are in a 2-columns format, such as those that are downloaded by download-ncbi-taxa.sh
removed the eggnog and taxonomy commands from add-gff-info, the former since it’s not that useful, the latter because it’s possible to achieve the same results using taxon-utils with the new output option
removed the rand command of fastq-utils since it was only for testing and the FastQ parser is the one from
mgkit.io.fastq
substantial changes where made to commands values and sequence of the filter-gff script
sampling-utils rand_seq now can save the model used and reload it
removed download_data and download_profiles, since they are not going to be used in the next tutorial and it is preferred now to use BLAST and then find the LCA with taxon-utils
Python3¶
At the time of writing all tests pass on Python 3.5, but more tests are needed, along with some new ones for the blast parser and the scripts. Some important changes:
mgkit.io.gff.Annotation
uses its uid to hash the instance. This allows the use in sets (mainly for filtering). However, hashing is not supported inmgkit.io.gff.GenomicRange
.mgkit.io.utils.open_file()
now always opens and writes files in binary mode. This is one of the suggestions to keep compatibility between 2.x and 3.x. So if used directly the output must be decoded from ascii, which is the format used in text files (at least bioinformatics). However, this is not required for the parsers, likemgkit.io.gff.parse_gff()
,mgkit.io.fasta.load_fasta()
along with others (and the correspective write_ functions): they return unicode strings when parsing and decode into ascii when writing.
In general new projects will be worked on using Python 3.5 and the next releases will prioritise that (0.4.0 and later). If bugfixes are needed and Python 3 cannot be used, this version branch (0.3.x) will be used to fix bugs for users.
0.3.3¶
Added¶
module
mgkit.counts.glm
, with functions used to help the fit of Generalised Linear Models (GLM)added sync, sample_stream and rand_seq commands to sampling-utils script
mgkit.taxon.UniprotTaxonomy.get_lineage_string()
mgkit.taxon.UniprotTaxonomy.parse_gtdb_lineage()
Changed¶
added seq_id as a special attribute to
mgkit.io.gff.Annotation.get_attr()
mgkit.io.gff.from_prodigal_frag()
is tested and fixedadded cache in
mgkit.utils.dictionary.HDFDict
mgkit.utils.sequence.sequence_gc_content()
now returns 0.5 when denominator is 0add-gff-info addtaxa -a now accept seq_id as lookup, to use output from taxon-utils lca (after cutting output)
Deprecated¶
0.3.2¶
Removed deprecated code
0.3.1¶
This release adds several scripts and commands. Successive releases 0.3.x releases will be used to fix bugs and refine the APIs and CLI. Most importantly, since the publishing of the first paper using the framework, the releases will go torward the removal of as much deprecated code as possible. At the same time, a general review of the code to be able to run on Python3 (probably via the six package) will start. The general idea is to reach as a full removal of legacy code in 0.4.0, while full Python3 compatibility is the aim of 0.5.0, which also means dropping dependencies that are not compatible with Python3.
Added¶
mgkit.graphs.from_kgml()
to make a graph from a KGML file (allows for directionality)mgkit.graphs.add_module_compounds()
: updates a graph with compounds information as neededmgkit.kegg.parse_reaction()
: parses a reaction equation from Keggadded –no-frame option to hmmer2gff - Convert HMMER output to GFF, to use non translated protein sequences. Also changed the
mgkit.io.gff.from_hmmer()
function to enable this behaviouradded options –num-gt and –num-lt to the values command of filter-gff - Filter GFF annotations to filter based on > and < inequality, in addition to >= and <=
added uid as command in fasta-utils - Fasta Utilities to make unique fasta headers
methods to make
mgkit.db.mongo.GFFDB
to behave like a dictionary (an annotation uid can be used as a key to retrieve it, instead of a query), this includes the possibility to iterate over it, but what is yielded are the values, not the keys (i.e.mgkit.io.gff.Annotation
instances, not uid)added
mgkit.counts.func.from_gff()
to load count data stored inside a GFF, as is the case when the counts command of add-gff-info - Add informations to GFF annotations is used’added
mgkit.kegg.KeggClientRest.conv()
andmgkit.kegg.KeggClientRest.find()
operations tomgkit.kegg.KeggClientRest
mgkit.kegg.KeggClientRest
now caches calls to several methods. The cache can be written to disk usingmgkit.kegg.KeggClientRest.write_cache()
or emptied viamgkit.kegg.KeggClientRest.empty_cache()
added
mgkit.utils.dictionary.merge_dictionaries()
to merge multiple dictionaries where the keys contain different valuesadded a Docker file to make a preconfigured mgkit/jupyter build
added C functions (using Cython) for tetramer/kmer counting. The C functions are the default, with the pure python implementation having a _ appended to their names. This is because the Cython functions cannot have docstrings
added
mgkit.plots.utils.legend_patches()
to create matplotlib patches, to be in legendsadded scripts download IDs to taxa tables from NCBI/Uniprot
added cov command to get-gff-info - Extract informations to GFF annotations and filter-gff - Filter GFF annotations
added
mgkit.io.fasta.load_fasta_prodigal()
, to load the fasta file from prodigal for called genes (tested on aminoacids)added option to output a JSON file to the lca command in ref:taxon-utils and cov command in get-gff-info - Extract informations to GFF annotations
added a bash script, sort-gff.sh to help sort a GFF
added
mgkit.taxon.UniprotTaxonomy.get_lineage()
which simplifies the use ofmgkit.taxon.get_lineage()
added
mgkit.io.fastq.load_fastq()
as a simple parser for fastq filesadded a new script, sampling-utils - Resampling Utilities
added
mgkit.utils.common.union_ranges()
andmgkit.utils.common.complement_ranges()
added to_hdf command to taxon-utils - Taxonomy Utilities to create a HDF5 file to lookup taxa tables from NCBI/Uniprot
added –hdf-table option to addtaxa command in add-gff-info - Add informations to GFF annotations
mgkit.taxon.UniprotTaxonomy.add_taxon()
,mgkit.taxon.UniprotTaxonomy.add_lineage()
andmgkit.taxon.UniprotTaxonomy.drop_taxon()
Changed¶
changed domain to superkingdom as for NCBI taxonomy in
mgkit.taxon.UniprotTaxonomy.read_from_gtdb_taxonomy()
updated scripts documentation to include installed but non advertised scripts (like translate_seq)
mgkit.kegg.KeggReaction
was reworked to only store the equation informationsome commands in fastq-utils - Fastq Utilities did not support standard in/out, also added the script usage to the script details
translate_seq now supports standard in/out
added haplotypes parameter to
mgkit.snps.funcs.combine_sample_snps()
an annotation from
mgkit.db.mongo.GFFDB
now doesn’t include the lineage, because it conflicts with the string used in a GFF filean
mgkit.io.gff.Annotation.coverage()
now returns a float instead od a intmoved code from package
mgkit.io
tomgkit.io.utils
changed behaviour of
mgkit.utils.common.union_range()
removed
mgkit.utils.common.range_substract_()
added progressbar2 as installation requirement
changed how
mgkit.taxon.UniprotTaxonomy.find_by_name()
Fixed¶
Besides smaller fixes:
mgkit.plots.abund.draw_circles()
behaviour when sizescale doesn’t have the same shape as orderparser is now correct for taxon-utils - Taxonomy Utilities, to include the Krona options
ondition when a blast output is empty, hence lineno is not initialised when a message is logged
Deprecated¶
translate_seq will be removed in version 0.4.0, instead use the similar command in fasta-utils - Fasta Utilities
0.3.0¶
A lot of bugs were fixed in this release, especially for reading NCBI taxonomy and using the msgpack format to save a UniprotTaxonomy instance. Also added a tutorial for profiling a microbial community using MGKit and BLAST (Profile a Community with BLAST)
Added¶
mgkit.align.read_samtools_depth()
to read the samtools depth format iteratively (returns a generator)mgkit.align.SamtoolsDepth
, used to cache the samtools depth format, while requesting region coveragemgkit.kegg.KeggModule.find_submodules()
,mgkit.kegg.KeggModule.parse_entry2()
mgkit.utils.dictionary.cache_dict_file()
to cache a large dictionary file (tab separated file with 2 columns), an example of its usage is in the documentationmgkit.taxon.UniprotTaxonomy.read_from_gtdb_taxonomy()
to read a custom taxonomy from a tab separated file. The taxon_id are not guaranteed to be stable between runsadded cov_samtools to add-gff-info script
added
mgkit.workflow.fasta_utils
and correspondent script fasta-utilsadded options -k and -kt to taxon_utils, which outputs a file that can be used with Krona ktImportText (needs to use -q with this script)
Changed¶
added no_zero parameter to
mgkit.io.blast.parse_accession_taxa_table()
changed behaviour of
mgkit.kegg.KeggModule
and some of its methods.added with_last parameter to
mgkit.taxon.get_lineage()
added –split option to add-gff-info exp_syn and get-gff-info sequence scripts, to emulate BLAST behaviour in parsing sequence headers
added -c option to add-gff-info addtaxa
0.2.5¶
Changed¶
added the only_ranked argument to
mgkit.taxon.get_lineage()
add-gff-info addtaxa (add-gff-info - Add informations to GFF annotations) doesn’t preload the GFF file if a dictionary is used instead of the taxa table
blast2gff blastdb ((blast2gff - Convert BLAST output to GFF) offers more options to control the format of the header in the DB used
added the sequence command to filter-gff (filter-gff - Filter GFF annotations), to filter all annotations on a per-sequence base, based on mean bitscore or other comparisons
Added¶
added representation of
mgkit.taxon.UniprotTaxonomy
, it show the number of taxa in the instanceadded taxon_utils (taxon-utils - Taxonomy Utilities) to filter GFF based on their taxonomy and find the last common ancestor for a reference sequence based on either GFF annotations or a list of taxon_ids (in a text file)
0.2.4¶
Changed¶
mgkit.utils.sequence.get_contigs_info()
now accepts a dictionary name->seq or a list of sequences, besides a file name (r536)add-gff-info counts command now removes trailing commas from the samples list
the axes are turned off after the dendrogram is plo
Fixed¶
the snp_parser script requirements were set wrong in setup.py (r540)
uncommented lines to download sample data to build documentation (r533)
add-gff-info uniprot command now writes the lineage attribute correctly (r538)
0.2.3¶
The installation dependencies are more flexible, with only numpy as being required. To install every needed packages, you can use:
$ pip install mgkit[full]
Added¶
new option to pass the query sequences to blast2gff, this allows to add the correct frame of the annotation in the GFF
added the attributes evalue, subject_start and subject_end to the output of blast2gff. The subject start and end position allow to understand on which frame of the subject sequence the match was found
added the options to annotate the heatmap with the numbers. Also updated the relative example notebook
Added the option to reads the taxonomy from NCBI dump files, using
mgkit.taxon.UniprotTaxonomy.read_from_ncbi_dump()
. This make it faster to get the taxonomy fileadded argument to return information from
mgkit.net.embl.datawarehouse_search()
, in the form of tab separated data. The argument fields can be used when display is set to report. An example on how to use it is in the function documentationadded a bash script download-taxonomy.sh that download the taxonomy
added script venv-docs.sh to build the documentation in HTML under a virtual environment. matplotlib on MacOS X raises a RuntimeError, because of a bug in virtualenv, the documentation can be first build with this, after the script create-apidoc.sh is create the API documentation. The rest of the documentation (e.g. the PDF) can be created with make as usual, afterwards
added
mgkit.net.pfam
, with only one function at the moment, that returns the descriptions of the families.added pfam command to add-gff-info, using the mentioned function, it adds the description of the Pfam families in the GFF file
added a new exception, used internally when an additional dependency is needed
Changed¶
using the NCBI taxonomy dump has two side effects:
the scientific/common names are kept as is, not lower cased as was before
a merged file is provided for taxon_id that changed. While the old taxon_id is kept in the taxonomy, this point to the new taxon, to keep backward compatibility
renamed the add-gff-info gitaxa command to addtaxa. It now accepts more data sources (dictionaries) and is more general
changed
mgkit.net.embl.datawarehouse_search()
to automatically set the limit at 100,000 recordsthe taxonomy can now be saved using msgpack, making it faster to read/write it. It’s also more compact and better compression ratio
the
mgkit.plots.heatmap.grouped_spine()
now accept the rotation of the labels as optionadded option to use another attribute for the gene_id in the get-gff-info script gtf command
added a function to compare the version of MGKit used, throwing a warning, when it’s different (
mgkit.check_version()
)removed test for old SNPs structures and added the same tests for the new one
mgkit.snps.classes.GeneSNP
now caches the number of synonymous and non-synonymous SNPs for better speedmgkit.io.gff.GenomicRange.__contains__()
now also accepts a tuple (start, end) or another GenomicRange instance
Fixed¶
a bug in the gitaxa (now addtaxa) command: when a taxon_id was not found in the table, the wrong taxon_name and lineage was inserted
bug in
mgkit.snps.classes.GeneSNP
that prevented the correct addition of valuesfixed bug in
mgkit.snps.funcs.flat_sample_snps()
with the new classmgkit.io.gff.parse_gff()
now correctly handles comment lines and stops parsing if the fasta file at the end of a GFF is found
0.2.2¶
Added¶
new commands for the add-gff-info script (add-gff-info - Add informations to GFF annotations):
eggnog to add information from eggNOG HMMs (at the moment the 4.5 Viral)
counts and fpkms to add count data (correctly exported to mongodb)
gitaxa to add taxonomy information directly from GI identifiers from NCBI
added blastdb command to blast2gff script (blast2gff - Convert BLAST output to GFF)
updated MGKit GFF Specifications
added gtf command to get-gff-info script (get-gff-info - Extract informations to GFF annotations) to convert a GFF to GTF, that is accepted by featureCounts, in conjunction with the counts command of add-gff-info
added method to
mgkit.snps.classes.RatioMixIn.calc_ratio_flag
to calculate special cases of pN/pS
Changed¶
added argument in functions of the
mgkit.snps.conv_func
to bypass the default filtersadded use_uid argument to
mgkit.snps.funcs.combine_sample_snps()
to use the uid instead of the gene_id when calculating pN/pSadded flag_values argument to
mgkit.snps.funcs.combine_sample_snps()
to usemgkit.snps.classes.RatioMixIn.calc_ratio_flag
instead ofmgkit.snps.classes.RatioMixIn.calc_ratio
Removed¶
deprecated code from the snps package
0.2.1¶
Added¶
added
mgkit.db.mongo
added
mgkit.db.dbm
added
mgkit.io.gff.from_json()
added mongodb and dbm commands to script get-gff-info
added kegg command to add-gff-info script, caching results and -d option to uniprot command
added -ft option to blast2gff script
added -ko option to download_profiles
added new HMMER tutorial
added another notebook to the plot examples, for misc. tips
added a script that downloads from figshare the tutorial data]
added function to get an enzyme full name (
mgkit.mappings.enzyme.get_enzyme_full_name()
)added example notebook for using GFF annotations and the
mgkit.db.dbm
,mgkit.db.mongo
modules
Changed¶
mgkit.taxon.UniprotTaxonomy.read_taxonomy()
changed behaviour of hmmer2gff script
changed tutorial notebook to specify the directory where the data is
Deprecated¶
mgkit.filter.taxon.filter_taxonomy_by_lineage()
mgkit.filter.taxon.filter_taxonomy_by_rank()
Removed¶
removed old filter_gff script
0.2.0¶
added creation of wheel distribution
changes to ensure compatibility with alter pandas versions
mgkit.io.gff.Annotation.get_ec()
now returns a set, reflected changes in testsadded a –cite option to scripts
fixes to tutorial
updated documentation for sphinx 1.3
changes to diagrams
added decoration to raise warnings for deprecated functions
added possibility for
mgkit.counts.func.load_sample_counts()
info_dict to be a function instead of a dictionaryconsolidation of some eggNOG structures
added more spine options in
mgkit.plots.heatmap.grouped_spine()
added a length property to
mgkit.io.gff.Annotation
changed filter-gff script to customise the filtering function, from the default one, also updated the relative documentation
fixed a few plot functions
0.1.16¶
changed default parameter for
mgkit.plots.boxplot.add_values_to_boxplot()
Added include_only filter option to the default snp filters
mgkit.consts.DEFAULT_SNP_FILTER
the default filter for SNPs now use an include only option, by default including only protozoa, archaea, fungi and bacteria in the matrix
added widths parameter to def
mgkit.plots.boxplot.boxplot_dataframe()
function, added functionmgkit.plots.boxplot.add_significance_to_boxplot()
and updated example boxplot notebook for new function exampleuse_dist and dist_func parameters to the
mgkit.plots.heatmap.dendrogram()
functionadded a few constants and functions to calculate the distance matrices of taxa:
mgkit.taxon.taxa_distance_matrix()
,mgkit.taxon.distance_taxa_ancestor()
andmgkit.taxon.distance_two_taxa()
mgkit.kegg.KeggClientRest.link_ids()
now accept a dictionary as list of idsif the conversion of an Annotation attribute (first 8 columns) raises a ValueError in
mgkit.io.gff.from_gff()
, by default the parser keeps the string version (cases for phase, where is ‘.’ instead of a number)treat cases where an attribute is set with no value in
mgkit.io.gff.from_gff()
added
mgkit.plots.colors.palette_float_to_hex()
to convert floating value palettes to stringforces vertical alignment of tick labels in heatmaps
added parameter to get a consensus sequence for an AA alignment, by adding the nucl parameter to
mgkit.utils.sequence.Alignment.get_consensus()
added
mgkit.utils.sequence.get_variant_sequence()
to get variants of a sequence, essentially changing the sequence according to the SNPs passedadded method to get an aminoacid sequence from Annotation in
mgkit.io.gff.Annotation.get_aa_seq()
and added the possibility to pass a SNP to get the variant sequence of an Annotation inmgkit.io.gff.Annotation.get_nuc_seq()
.added exp_syn command to add-gff-info script
changed GTF file conversion
changed behaviour of
mgkit.taxon.is_ancestor()
: if a taxon_id raises a KeyError, False is now returned. In other words, if the taxon_id is not found in the taxonomy, it’s not an ancestoradded
mgkit.io.gff.GenomicRange.__contains__()
. It tests if a position is inside the rangeadded
mgkit.io.gff.GenomicRange.get_relative_pos()
. It returns a position relative to the GenomicRange startfixed documentation and bugs (Annotation.get_nuc_seq)
added
mgkit.io.gff.Annotation.is_syn()
. It returns True if a SNP is synonymous and False if non-synonymousadded to_nuc parameter to
mgkit.io.gff.from_nuc_blast()
function. It to_nuc is False, it is assumed that the hit was against an amino acidic DB, in which case the phase should always set to 0reworked internal of snp_parser script. It doesn’t use SNPDat anymore
updated tutorial
added ipython notebook as an example to explore data from the tutorial
cleaned deprecated code, fixed imports, added tests and documentation
0.1.15¶
changed name of
mgkit.taxon.lowest_common_ancestor()
tomgkit.taxon.last_common_ancestor()
, the old function name points to the new oneadded
mgkit.counts.func.map_counts_to_category()
to remap counts from one ID to anotheradded get-gff-info script to extract information from GFF files
script download_data can now download only taxonomy data
added more script documentation
added examples on gene prediction
added function
mgkit.io.gff.from_hmmer()
to parse HMMER results and returnmgkit.io.gff.Annotation
instancesadded
mgkit.io.gff.Annotation.to_gtf()
to return a GTF line,mgkit.io.gff.Annotation.add_gc_content()
andmgkit.io.gff.Annotation.add_gc_ratio()
to calculate GC content and ratio respectivelyadded
mgkit.io.gff.parse_gff_files()
to parse multiple GFF filesadded uid_used parameter to several functions in
mgkit.counts.func
added
mgkit.plots.abund
to plot abundance plotsadded example notebooks for plots
HTSeq is now required only by the scripts that uses it, snp_parser and fastq_utils
added function to convert numbers when reading from htseq count files
changed behavior of -b option in add-gff-info taxonomy command
0.1.14¶
added ipthon notebooks to the documentation. As of this version the included ones (in docs/source/examples) are for two plot modules. Also added a bash script to convert them into rst files to be included with the documentation. The .rst are not versioned, and they must be rebuild, meaning that one of the requirements for building the docs is to have IPython installed with the notebook extension
now importing some packages automatically import the subpackages as well
refactored
mgkit.plots
into a package, with most of the original functions imported into it, for backward compatibilityadded box_vert parameter in
mgkit.plots.boxplot.add_values_to_boxplot()
, the default will be changed in a later version (kept for compatibility with older scripts/notebooks)added an heatmap module to the plots package. Examples are in the notebook
added
mgkit.align.covered_annotation_bp()
to find the number of bp covered by reads in annotations (as opposed to using the annotation length)added documentation to
mgkit.mappings.eggnog.NOGInfo
and an additional methodadded
mgkit.net.uniprot.get_uniprot_ec_mappings()
as it was used in a few scripts alreadyadded
mgkit.mappings.enzyme.change_mapping_level()
and other to deal with EC numbers. Also improved documentation with some examplesadded
mgkit.counts.func.load_sample_counts_to_genes()
andmgkit.counts.func.load_sample_counts_to_taxon()
, for mapping counts to only genes or taxa. Also added index parameter inmgkit.counts.func.map_counts()
to accomodate the changesadded
mgkit.net.uniprot.get_ko_to_eggnog_mappings()
to get mappings of KO identifiers to eggNOGadded
mgkit.io.gff.split_gff_file()
to split a gff into several ones, assuring that all annotations for a sequence is in the same file; useful to split massive GFF files before filteringadded
mgkit.counts.func.load_deseq2_results()
to load DESeq2 results in CSV formatadded
mgkit.counts.scaling.scale_rpkm()
for scale with rpkm a count tableadded caching options to
mgkit.counts.func.load_sample_counts()
and othersfixes and improvements to documentation
0.1.13¶
added counts package, including functions to load HTSeq-counts results and scaling
added
mgkit.filter.taxon.filter_by_ancestor()
, as a convenience functiondeprecated functions in
mgkit.io.blast
module, added more to parse blast outputs (some specific)mgkit.io.fasta.load_fasta()
returns uppercase sequences, added a function (mgkit.io.fasta.split_fasta_file()
) to split fasta filesadded more methods to
mgkit.io.gff.Annotation
to complete API from old annotationsfixed
mgkit.io.gff.Annotation.dbq
property to return an int (bug in filtering with filter-gff)added function to extract the sequences covered by annotations, using the
mgkit.io.gff.Annotation.get_nuc_seq()
methodadded
mgkit.io.gff.correct_old_annotations()
to update old annotated GFF to new conventionsadded
mgkit.io.gff.group_annotations_by_ancestor()
andmgkit.io.gff.group_annotations_sorted()
moved deprecated GFF classes/modules in
mgkit.io.gff_old
added
mgkit.io.uniprot
module to read/write Uniprot filesadded
mgkit.kegg.KeggClientRest.get_ids_names()
to remove old methods to get specific class names used to retrieve (they are deprecated at the moment)added
mgkit.kegg.KeggModule
to parse a Kegg module entryadded
mgkit.net.embl.datawarehouse_search()
to search EMBL resourcesmade
mgkit.net.uniprot.query_uniprot()
more flexibleadded/changed plot function in
mgkit.plots
added enum34 as a dependency for Python versions below 3.4
changed classes to hold SNPs data: deprecated
mgkit.snps.classes.GeneSyn
, replaced bymgkit.snps.classes.GeneSNP
which the enum module formgkit.snps.classes.SNPType
added
mgkit.taxon.NoLcaFound
fixed behaviour of
mgkit.taxon.UniprotTaxonomy.get_ranked_taxon()
for newer taxonomieschange behaviour of
mgkit.taxon.UniprotTaxonomy.is_ancestor()
to use modulemgkit.taxon.is_ancestor()
and accept multiple taxon IDs to testmgkit.taxon.UniprotTaxonomy.load_data()
now accept compressed data and file handlesadded
mgkit.taxon.lowest_common_ancestor()
to find the lowest common ancestor of two taxon IDschanged behaviour of
mgkit.taxon.parse_uniprot_taxon()
added functions to get GC content, ratio of a sequence and it composition to
mgkit.utils.sequence
added more options to blast2gff script
added coverage, taxonomy and unipfile to add-gff-info
refactored snp_parser to use new classes
added possibility to use sorted GFF files as input for filter-gff to use less memory (the examples show how to use sort in Unix)
0.1.12¶
added functions to elongate annotations, measure the coverage of them and diff GFF files in
mgkit.io.gff
added ranges_length and union_ranges to
mgkit.utils.common
added script filter-gff, filter_gff will be deprecated
added script blast2gff to convert blast output to a GFF
removed unneeded dependencies to build docs
added script add-gff-info to add more annotations to GFF files
added
mgkit.io.blast.parse_blast_tab()
to parse BLAST tabular formatadded
mgkit.io.blast.parse_uniprot_blast()
to return annotations from a BLAST tabular fileadded
mgkit.graph
moduleadded classes
mgkit.io.gff.Annotation
andmgkit.io.gff.GenomicRange
and deprecated old classes to handle GFF annotations (API not stable)added
mgkit.io.gff.DuplicateKeyError
raised in parsing GFF filesadded functions used to return annotations from several sources
added option gff_type in
mgkit.io.gff.load_gff()
added
mgkit.net.embl.dbfetch()
added
mgkit.net.uniprot.get_gene_info()
andmgkit.net.uniprot.query_uniprot()
mgkit.net.uniprot.parse_uniprot_response()
added apply_func_to_values to
mgkit.utils.dictionary
added
mgkit.snps.conv_func.get_full_dataframe()
,mgkit.snps.conv_func.get_gene_taxon_dataframe()
added more tests
0.1.11¶
removed rst2pdf for generating a PDF for documentation. Latex is preferred
corrections to documentation and example script
removed need for joblib library in translate_seq script: used only if available (for using multiple processors)
deprecated
mgkit.snps.funcs.combine_snps_in_dataframe()
andmgkit.snps.funcs.combine_snps_in_dataframe()
:mgkit.snps.funcs.combine_sample_snps()
should be usedrefactored some tests and added more
added docs_req.txt to help build the documentation ont readthedocs.org
renamed
mgkit.snps.classes.GeneSyn
gid and taxon attributes to gene_id and taxon_id. The old names are still available for use (via properties), but the will be taken out in later versions. Old pickle data should be loaded and saved again before in this releaseadded a few convenience functions to ease the use of
combine_sample_snps()
added function
mgkit.snps.funcs.significance_test()
to test the distributions of genes share between two taxa.fixed an issue with deinterleaving sequence data from khmer
Added method to
mgkit.kegg.KeggClientRest
to get names for all ids of a certain type (more generic than the various get_*_names)added first implementation of
mgkit.kegg.KeggModule
class to parse a Kegg module entrymgkit.snps.conv_func.get_rank_dataframe()
,mgkit.snps.conv_func.get_gene_map_dataframe()