mgkit.net.embl module¶
Access EMBL Services
-
mgkit.net.embl.
EMBL_DBID
= 'embl_cds'¶ Default database id
-
exception
mgkit.net.embl.
EntryNotFound
[source]¶ Bases:
Exception
Raised if at least one entry was not found by
get_sequences_by_ids()
.NOT_FOUND
is used to check if any entry wasn’t downloaded.
-
mgkit.net.embl.
LOG
= <Logger mgkit.net.embl (WARNING)>¶ Log instance for this module
-
mgkit.net.embl.
NONE_FOUND
= 'ERROR 12.+?.\\n?'¶ Regular expression to check if no entry was found, used by
NoEntryFound
-
mgkit.net.embl.
NOT_FOUND
= 'Entry: .+? not found.\\n'¶ Appears in the resulting fasta (not tried on other formats) in the case that at least one entry wasn’t found.
-
exception
mgkit.net.embl.
NoEntryFound
[source]¶ Bases:
Exception
Raised if no sequences where found by
get_sequences_by_ids()
, the check is based on theNONE_FOUND
variable.
-
mgkit.net.embl.
URL_REST
= 'http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/'¶ Base URL for EMBL DBFetch REST API
-
mgkit.net.embl.
datawarehouse_search
(query, domain='sequence', result='sequence_release', display='fasta', offset=0, length=100000, contact=None, download='gzip', url='http://www.ebi.ac.uk/ena/data/warehouse/search?', fields=None)[source]¶ Changed in version 0.2.3: added fields parameter to retrieve tab separated information
New in version 0.1.13.
Perform a datawarehouse search on EMBL dbs. Instructions on the query language used to query the datawarehouse are available at this page with more details about the databases domains at this page
- Parameters
query (str) – query for the search enging
domain (str) – database domain to search
result (str) – domain result requested
display (str) – display option (format to retrieve the entries)
offset (int) – the offset of the search results, defaults to the first
length (int) – number of results to retrieve at the specified offset and the limit is automatically set a 100,000 records for query
contact (str) – email of the user
download (str) – type of response. Gzip responses are automatically decompressed
url (str) – base URL for the resource
fields (None, iterable) – must be an iterable of fields to be returned if display is set to report
- Returns
the raw request
- Return type
Examples
Querying EMBL for all sequences of type rRNA of the Clostridium genus. Only from the EMBL release database in fasta format:
>>> query = 'tax_tree(1485) AND mol_type="rRNA"' >>> result = 'sequence_release' >>> display = 'fasta' >>> data = embl.datawarehouse_search(query, result=result, ... display=display) >>> len(data) 35919
Each entry taxon_id from the same data can be retrieved using report as the display option and fields an iterable of fields to just (‘accession’, tax_id’):
>>> query = 'tax_tree(1485) AND mol_type="rRNA"' >>> result = 'sequence_release' >>> display = 'report' >>> fields = ('accession', 'tax_id') >>> data = embl.datawarehouse_search(query, result=result, display=display, fields=fields)
-
mgkit.net.embl.
dbfetch
(embl_ids, db='embl', contact=None, out_format='seqxml', num_req=10)[source]¶ New in version 0.1.12.
Function that allows to use dbfetch service (REST). More information on the output formats and the database available at the service page
- Parameters
- Returns
a list with the results from each request sent. Each request sent has a maximum number num_req of ids, so the number of items in the list depends by the number of ids in embl_ids and the value of num_req.
- Return type
-
mgkit.net.embl.
get_sequences_by_ids
(embl_ids, contact=None, out_format='fasta', num_req=10, embl_db='embl_cds', strict=False)[source]¶ Changed in version 0.3.4: removed compress as it’s bases on the requests package
Downloads entries using EBI REST API. It can download one entry at a time or accept an iterable and all sequences will be downloaded in batches of at most num_req.
It’s fairly general, so can be customised, from the DB used to the output format: all batches are simply concatenate.
Note
There are some checks on the some errors reported by the EMBL api, but not documented, in particular two errors, which are just reported as text lines in the fasta file (the only one tested at this time).
The are two possible cases:
if no entry was found
NoEntryFound
will be raised.if at least one entry wasn’t found:
if strict is False (the default) the error will be just logged as a debug message
if strict is True
EntryNotFound
is raised
- Parameters
embl_ids (iterable, str) – list of ids to download
contact (str) – email address to be passed in the query
format (str) – format of the entry
num_req (int) – number of entries to download with each request
embl_db (str) – db to which the ids refer to
strict (bool) – if True, a check on the number of entries retrieved is performed
- Returns
the entries requested
- Return type
- Raises
EntryNotFound – if at least an entry was not found
NoEntryFound – if NO entry were found
Warning
The number of sequences that can be downloaded at a time is 11, it seems, since the returned sequences for each request was at most 11. I didn’t find any mention of this in the API docs, but it may be a restriction that’s temporary.