{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Examples of the *mgkit.db* package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mgkit.io import gff\n",
    "import mgkit.net\n",
    "from mgkit.db import dbm\n",
    "from mgkit.db import mongo\n",
    "import mgkit.taxon\n",
    "import gzip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Example GFF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This will just load the data from the repository and save it in the same directory as\n",
    "# this notebook\n",
    "data = mgkit.net.url_read('https://bitbucket.org/setsuna80/mgkit/downloads/assembly.gff.gz')\n",
    "# The data is compressed\n",
    "open('assembly.gff.gz', 'w').write(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GFF Annotations\n",
    "The are a few ways to load the GFF, but the result of *parse_gff* is a generator that yields one annotation at a time when it's iterate over. One way to keep the annotations in memory is building a dictionary, with the unique identifier for each annotation (called **uid**) used as key, while the annotation object is the value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# *mgkit.io.gff.parse_gff* can read compressed data, gzip, bz2 (also lzma on python3)\n",
    "annotations = {\n",
    "    annotation.uid: annotation\n",
    "    for annotation in gff.parse_gff('assembly.gff.gz')\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each annotation is parsed and an instance of the **mgkit.io.gff.Annotation** is created. The class contains several properties, like the unique identifier (uid), the gene identifier (gene_id) and the taxon identifier (taxon_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NODE_57290(-):1-87\n",
      "d002b31c-1d78-438c-b8f9-aba791807724 Q72QU2 2\n"
     ]
    }
   ],
   "source": [
    "annotation = annotations['d002b31c-1d78-438c-b8f9-aba791807724']\n",
    "print annotation\n",
    "\n",
    "print annotation.uid, annotation.gene_id, annotation.taxon_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other properties and methods can be accessed, like the *Annotation.get_mappings* to get a dictionary of all mappings, or using the *len* function on the instance to get it's length (or using the property *length*)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "87 87\n",
      "{'ko': ['K03695']}\n"
     ]
    }
   ],
   "source": [
    "print len(annotation), annotation.length\n",
    "print annotation.get_mappings()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Taxonomy and Annotations\n",
    "When using metagenomics, one of the problem is associated functionality to taxonomy. MGKit contains a class that can read the taxonomy from [Uniprot](www.uniprot.org), which is compatible with NCBI taxonomy. The **mgkit.taxon** contains the **UniprotTaxonomy** that is use to store and in part search the taxonomy. The module contains many more functions to resolve different levels of the taxonomy. A few examples applied to the annotations loaded follow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# This will just load the data from the repository and save it in the same directory as\n",
    "# this notebook\n",
    "data = mgkit.net.url_read(\n",
    "    \"https://bitbucket.org/setsuna80/mgkit/downloads/taxonomy.pickle.gz\"\n",
    ")\n",
    "open('taxonomy.pickle.gz', 'w').write(data)\n",
    "del data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Using compress taxonomy files makes it slower to load\n",
    "taxonomy = mgkit.taxon.UniprotTaxonomy('taxonomy.pickle.gz')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[171549]"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# to find the Bacteoidales taxon identifier\n",
    "taxonomy.find_by_name('bacteroidales')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7233587c-b80d-4908-8ead-92734deeec81 Q7MV19\n",
      "5322b316-46e5-44cf-9eb1-ef94355c7855 Q01VN6\n",
      "a7308f6f-7b17-4b00-8afa-92ebecef3dd3 Q8XP14\n",
      "195118b7-1236-48ad-8812-e0ec3100e7d9 Q7MV19\n",
      "14b3cc41-050a-4949-b085-75db0cda12ec Q8A294\n",
      "d1dad026-09ac-48e4-95fe-158e39d96a0d P49008\n",
      "01b819f8-1444-4f25-a3fa-93e160fa58c2 Q7MVL1\n",
      "4b3ce614-cc8a-47ea-a046-f9ca7c7ab16c Q5LI72\n",
      "65bae5c6-0d23-4a08-ae3f-aec2763f4621 Q7MV19\n",
      "3aef43ea-4e94-4940-bf85-743950e5ad8a Q9AGG3\n",
      "16794c3c-97b8-4453-8d14-a5e37c8969b4 A6LB11\n",
      "bd92adff-b8d9-411f-9488-7604eb580fd6 Q89YZ6\n",
      "3441f906-f63d-45fe-a4e5-e639439d19db A6LD25\n",
      "cd08ae89-1f1e-4875-851e-c0c55de8c764 A6LA51\n",
      "b3bf4054-4f31-4a8a-bf19-fd0e65c56867 A6LI30\n",
      "44bdfb77-1606-4194-b410-9a22c75b3b5b Q7MV19\n",
      "3b5e126e-25ec-460d-a439-2520bebe0a3d A6KZH6\n",
      "e908c5b1-9dec-4406-b952-009aab3fd778 A6LDS1\n",
      "b425ef29-0bc0-4de7-ad96-abe5c7b75f96 Q8A9M7\n",
      "2a3558c3-6f7b-49a4-a8d3-c2b0cef287d6 Q7MXZ1\n",
      "376d70e0-6591-4b2b-9a06-1d9fb7fdbc66 Q7M9Y2\n",
      "ff7fd5ef-9be2-404c-8137-89f368071a4e Q8A294\n",
      "2665ff2c-4e9a-4c7a-9604-8433fa2ae202 A6LHY5\n",
      "dd9a44d5-ed1e-4350-b05c-f0cfd510e669 A6L170\n",
      "255e75a1-a59c-43fd-9396-17a3566b3063 Q8A0F5\n",
      "49474358-7962-4b0c-b52a-5de935f17bfc A6LFA6\n",
      "27eb1efe-ff07-401c-93db-958a38e866bc Q7MWM7\n",
      "746805f5-0fdc-4499-953f-7be496b9c784 Q7MU65\n",
      "e3be2158-c013-4e58-a073-ab8e3c893094 Q8A8Y4\n",
      "e028b0e9-802f-4f1b-b055-f5ecca786170 Q8A1D3\n",
      "f2919fc6-d8e2-4fe7-ac9f-152c46d0ebbb Q7MV19\n",
      "b65468b2-d4e7-456b-871d-9cd96fa4dd48 Q02XT4\n",
      "e1643d1d-12c3-4397-a6a7-d2a24f203c4a Q8A294\n",
      "0d9cd52c-5969-49f7-866c-e8c5c9783b79 Q8A294\n",
      "cdd362ba-448f-475f-a638-d6473b471572 A6LD68\n",
      "7af092eb-20c5-46b4-b8bb-e9b0c99a8ce5 Q5LGH0\n",
      "2d9172a4-fe51-4baa-a8fb-66f020ba6452 Q7MVL1\n",
      "Number of annotation: 37\n"
     ]
    }
   ],
   "source": [
    "# to find all the annotations that belong to the Order Bacteroidales\n",
    "count = 0\n",
    "for annotation in annotations.itervalues():\n",
    "    if mgkit.taxon.is_ancestor(taxonomy, annotation.taxon_id, 171549):\n",
    "        count += 1\n",
    "        print annotation.uid, annotation.gene_id\n",
    "print \"Number of annotation:\", count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "set(['arthropoda', 'microsporidia', 'korarchaeota', 'viruses', 'nematoda', 'bacteroidetes', 'nanoarchaeota', 'tenericutes', 'thermotogae', 'chlorophyta', 'cellular organisms', 'fibrobacteres', 'bacteria', 'euryarchaeota', 'verrucomicrobia', 'annelida', 'eukaryota', 'aquificae', 'ascomycota', 'actinobacteria', 'chlorobi', 'deferribacteres', 'archaea', 'bacillariophyta', 'streptophyta', 'chlamydiae', 'apicomplexa', 'dictyoglomi', 'cloacimonetes', 'gemmatimonadetes', 'thaumarchaeota', 'proteobacteria', 'acidobacteria', 'spirochaetes', 'cyanobacteria', 'firmicutes', 'chloroflexi', 'planctomycetes', 'chordata', 'euglenida', 'elusimicrobia', 'basidiomycota', 'xanthophyceae', 'nitrospirae', 'fusobacteria', 'deinococcus-thermus', 'platyhelminthes', 'crenarchaeota'])\n"
     ]
    }
   ],
   "source": [
    "# to find out the Phyla represented in the annotations\n",
    "print set(\n",
    "    taxonomy.get_ranked_taxon(annotation.taxon_id, rank='phylum').s_name\n",
    "    for annotation in annotations.itervalues()\n",
    ")   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "escherichia coli (strain k12) [131567, 2, 1224, 1236, 91347, 543, 561, 562]\n"
     ]
    }
   ],
   "source": [
    "# to get the lineage of the first annotations\n",
    "annotation = annotations['b97ead95-81a7-4caf-8d25-349ee6e276c1']\n",
    "print taxonomy[annotation.taxon_id].s_name, mgkit.taxon.get_lineage(taxonomy, annotation.taxon_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "escherichia coli (strain k12) ['cellular organisms', 'bacteria', 'proteobacteria', 'gammaproteobacteria', 'enterobacteriales', 'enterobacteriaceae', 'escherichia', 'escherichia coli']\n"
     ]
    }
   ],
   "source": [
    "# to get the names, quickly\n",
    "annotation = annotations['b97ead95-81a7-4caf-8d25-349ee6e276c1']\n",
    "print taxonomy[annotation.taxon_id].s_name, mgkit.taxon.get_lineage(taxonomy, annotation.taxon_id, names=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Issues\n",
    "Keeping the annotations in memory can lead to a high memory usage, as well as a long time traversing all of them to specifically filter them. MGKit uses two solutions to interface with DBs, one is using a *dbm-like* database, *semidbm* and the other is using *MongoDB*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## semidbm\n",
    "Packages to use *dbm* database are included with Python, but they depend on the type of OS python is installed onto. A pure Python implementation of a dbm is [semidbm](https://github.com/jamesls/semidbm). As other *dbm*, it works in a similar way as a dictionary, while keeping the memory usage low. To create a *semidbm* DB from annotations, the **get-gff-info** can be used, using the **dbm** command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "assembly-db\r\n",
      "INFO - mgkit.db.dbm: DB \"assembly-db\" opened/created\r\n",
      "INFO - mgkit.io.gff: Loading GFF from file (assembly.gff.gz)\r\n"
     ]
    }
   ],
   "source": [
    "!get-gff-info dbm -d assembly-db assembly.gff.gz "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or interactively, using *mgkit.db.dbm.create_gff_dbm*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "assembly-db\n"
     ]
    }
   ],
   "source": [
    "db = dbm.create_gff_dbm(annotations.itervalues(), 'assembly-db')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which also return an instance of db. *semidbm* allows the use of only strings as keys and strings as values, so for the same annotation as before, you see what MGKit stores in it, the actual GFF line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'NODE_57290\\tBLAST\\tCDS\\t1\\t87\\t51.6\\t-\\t0\\tSRR001322_cov=\"0\";SRR001323_cov=\"0\";SRR001325_cov=\"3\";SRR001326_cov=\"0\";bitscore=\"51.6\";cov=\"3\";db=\"UNIPROT-SP\";dbq=\"10\";exp_nonsyn=\"200\";exp_syn=\"61\";gene_id=\"Q72QU2\";identity=\"75.9\";map_KO=\"K03695\";taxon_db=\"NCBI-NT\";taxon_id=\"2\";uid=\"d002b31c-1d78-438c-b8f9-aba791807724\"\\n'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db['d002b31c-1d78-438c-b8f9-aba791807724']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The GFF line must then be converted back into an **Annotation** instance. To automate the process, the **mgkit.db.dbm.GFFDB** class wraps the *semidbm*. The same example as the one above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "NODE_57290(-):1-87"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db = dbm.GFFDB('assembly-db')\n",
    "db['d002b31c-1d78-438c-b8f9-aba791807724']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can also be iterated over as a dictionary (for compatibility, both *iteritems* and *items* return an iterator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50dccb4d-3a49-41ed-bf8c-a1906172d8a5 NODE_49806(+):3-116\n"
     ]
    }
   ],
   "source": [
    "for uid in db.db:\n",
    "    print uid, db[uid]\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50dccb4d-3a49-41ed-bf8c-a1906172d8a5 NODE_49806(+):3-116\n"
     ]
    }
   ],
   "source": [
    "for uid, annotation in db.iteritems():\n",
    "    print uid, annotation\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this class, it is possible to use a DB as a drop-in replacement for a dictionary in a script that used annotations stored in memory in MGKit. The [examples using the taxonomy](#Taxonomy-and-Annotations) will works in the same way, for example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using MongoDB\n",
    "[MongoDB](https://www.mongodb.org/) is Document based DB that is not based on SQL. One of the advantage of it the absence of a schema, which makes it easy to insert annotations into it. Moreover, the data in a MongoDB is easily accessible from a variety of programming languages, as well as its own shell. Another advantage is the possiblity to query the annotations and index specific values to speed up them.\n",
    "\n",
    "In the same way as with *dbm*, the **get-gff-info** can help produce a file that can be directly loaded into a *mongod* instance.\n",
    "\n",
    "The following example uses **pymongo** (the official client library for Python) and requires a **mongod** instance running on the same machine. The annotations will be imported into the **test** database, into the **gff** collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2015-12-04T15:38:41.355+1000\tconnected to: localhost\n",
      "2015-12-04T15:38:41.355+1000\tdropping: test.gff\n",
      "INFO - mgkit.io.gff: Loading GFF from file (<stdin>)\n",
      "2015-12-04T15:38:43.830+1000\timported 9135 documents\n"
     ]
    }
   ],
   "source": [
    "!gunzip -c assembly.gff.gz | get-gff-info mongodb | mongoimport --db test --collection gff --drop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use the **pymongo** module directly or just use the **mgkit.db.mongo.GFFDB** class to automate connection and conversion of the **JSON** documents back into **Annotation** objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "db = mongo.GFFDB('test', 'gff')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "303fbf1f-8140-4f9e-9c44-ae089e67bdc3 O93746\n"
     ]
    }
   ],
   "source": [
    "for annotation in db.find_annotation():\n",
    "    print annotation.uid, annotation.gene_id\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The DB can be queried by passing the **GFF.find_annotation** method the same query that are explained in [Pymongo documentation](https://docs.mongodb.org/getting-started/python/client/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NODE_22940(-):2-97\n",
      "NODE_8691(+):2-88\n",
      "NODE_8691(+):5-91\n",
      "NODE_30222(+):11-97\n",
      "NODE_30222(+):2-82\n",
      "NODE_30222(+):8-94\n",
      "NODE_30222(+):5-91\n",
      "NODE_36783(+):11-115\n",
      "NODE_2009(-):3-104\n",
      "NODE_2009(-):12-110\n",
      "NODE_19876(+):3-113\n",
      "NODE_35927(-):2-76\n",
      "NODE_35927(-):8-163\n",
      "NODE_31317(+):2-73\n",
      "NODE_31317(+):5-88\n",
      "NODE_29415(+):29-100\n",
      "NODE_45868(-):1-96\n",
      "NODE_1013(-):33-128\n",
      "NODE_39238(-):1-90\n",
      "NODE_39238(-):4-93\n",
      "NODE_6581(-):3-116\n",
      "NODE_40758(-):2-163\n",
      "NODE_7805(-):1-117\n",
      "NODE_28135(+):3-116\n",
      "NODE_8575(+):34-123\n",
      "NODE_8575(+):28-114\n",
      "NODE_6979(+):1-99\n",
      "NODE_35052(-):2-106\n",
      "NODE_13245(-):2-94\n",
      "NODE_13245(-):5-97\n",
      "NODE_30508(+):1-99\n",
      "NODE_19190(+):18-227\n",
      "NODE_19190(+):3-113\n",
      "NODE_16671(+):2-106\n"
     ]
    }
   ],
   "source": [
    "# To look for all annotations that have the KO mapping to K01883\n",
    "for annotation in db.find_annotation({'map.ko': 'K01883'}):\n",
    "    print annotation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NODE_22940(-):2-97\n",
      "NODE_30222(+):11-97\n",
      "NODE_30222(+):2-82\n",
      "NODE_30222(+):8-94\n",
      "NODE_30222(+):5-91\n",
      "NODE_40758(-):2-163\n"
     ]
    }
   ],
   "source": [
    "# To look for all annotations that have the KO mapping to K01883 *AND*\n",
    "# the taxonomy was inferred from a blast to NCBI (see refinement of\n",
    "# taxonomy in theTutorial - Gene Prediction)\n",
    "for annotation in db.find_annotation({'map.ko': 'K01883', 'taxon_db': 'NCBI-NT'}):\n",
    "    print annotation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NODE_36848(-):2-94\n",
      "NODE_58432(+):8-124\n",
      "NODE_48731(+):5-118\n",
      "NODE_13988(+):20-190\n",
      "NODE_10564(-):3-101\n",
      "NODE_61599(+):8-106\n",
      "NODE_58191(+):1-99\n",
      "NODE_36561(+):5-115\n",
      "NODE_33951(-):13-99\n",
      "NODE_20537(-):6-101\n",
      "NODE_72294(-):3-95\n"
     ]
    }
   ],
   "source": [
    "# Finding all annotation from a specific taxon\n",
    "for annotation in db.find_annotation({'taxon_id': 224911}):\n",
    "    print annotation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Taxonomy\n",
    "The usual approach about the taxonomy is to traverse all the annotations (those returned by one of the previous queries, even) and use the functionality in the **mgkit.taxon** module. It is possible to repeat the example that search all annotations that belong to Order *Bacteroidales*, but the records must be loaded with the lineage into the DB. This can be done having a taxonomy file, *taxonomy.pickle.gz* in our case, with the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2015-12-04T16:32:13.785+1000\tconnected to: localhost\n",
      "2015-12-04T16:32:13.786+1000\tdropping: test.gff\n",
      "2015-12-04T16:32:16.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:19.783+1000\ttest.gff\t0.0 B\n",
      "INFO - mgkit.taxon: Loading taxonomy from file taxonomy.pickle.gz\n",
      "2015-12-04T16:32:22.785+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:25.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:28.784+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:31.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:34.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:37.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:40.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:43.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:46.785+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:49.785+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:52.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:55.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:32:58.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:01.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:04.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:07.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:10.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:13.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:16.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:19.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:22.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:25.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:28.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:31.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:34.785+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:37.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:40.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:43.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:46.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:49.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:52.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:55.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:33:58.785+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:01.784+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:04.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:07.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:10.785+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:13.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:16.784+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:19.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:22.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:25.785+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:28.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:31.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:34.783+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:37.780+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:40.782+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:43.781+1000\ttest.gff\t0.0 B\n",
      "2015-12-04T16:34:46.785+1000\ttest.gff\t0.0 B\n",
      "INFO - mgkit.workflow.extract_gff_info: Using cached calls to lineage\n",
      "INFO - mgkit.io.gff: Loading GFF from file (<stdin>)\n",
      "2015-12-04T16:34:49.783+1000\ttest.gff\t2.7 MB\n",
      "2015-12-04T16:34:51.874+1000\timported 9135 documents\n"
     ]
    }
   ],
   "source": [
    "!gunzip -c assembly.gff.gz | get-gff-info mongodb -t taxonomy.pickle.gz | mongoimport --db test --collection gff --drop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The script will first load the taxonomy and add to each record in the database the **lineage** key. This contains an array of integers, that are the output of the **mgkit.taxon.lineage** function and can be searched using:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NODE_33533(-):2-64\n",
      "NODE_18827(+):2-127\n",
      "NODE_25363(+):3-95\n",
      "NODE_69486(+):1-111\n",
      "NODE_13380(-):3-95\n",
      "NODE_8404(+):3-176\n",
      "NODE_71367(+):2-106\n",
      "NODE_50779(-):1-102\n",
      "NODE_20694(+):129-221\n",
      "NODE_38976(+):4-102\n",
      "NODE_69904(+):9-110\n",
      "NODE_1963(-):2-94\n",
      "NODE_41194(-):18-98\n",
      "NODE_47622(+):1-99\n",
      "NODE_56590(+):2-103\n",
      "NODE_66803(+):23-169\n",
      "NODE_14043(+):4-96\n",
      "NODE_35099(+):18-122\n",
      "NODE_48598(-):20-97\n",
      "NODE_58511(+):1-96\n",
      "NODE_70185(+):2-103\n",
      "NODE_56348(-):4-93\n",
      "NODE_56348(-):13-102\n",
      "NODE_56348(-):10-99\n",
      "NODE_32336(-):1-114\n",
      "NODE_59685(+):3-107\n",
      "NODE_57945(+):12-134\n",
      "NODE_59259(-):1-108\n",
      "NODE_28794(-):5-133\n",
      "NODE_72312(-):1-96\n",
      "NODE_37438(+):3-107\n",
      "NODE_6370(+):123-224\n",
      "NODE_67647(+):2-100\n",
      "NODE_28480(-):1-93\n",
      "NODE_72226(+):8-103\n",
      "NODE_46503(+):3-104\n",
      "NODE_20236(+):1-90\n",
      "Number of annotation: 37\n"
     ]
    }
   ],
   "source": [
    "count = 0\n",
    "for annotation in db.find_annotation({'lineage': 171549}):\n",
    "    count += 1\n",
    "    print annotation\n",
    "print \"Number of annotation:\", count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And as you can see, the number of annotations is the same as the [example above](#Taxonomy-and-Annotations). The use of MongoDB to store the annotations can make it simplier to use richer queries, even from other languages."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}