mgkit.io.blast module¶
Blast routines and parsers
-
mgkit.io.blast.
add_blast_result_to_annotation
(annotation, gi_taxa_dict, taxonomy, threshold=60)[source]¶ Deprecated since version 0.4.0.
Adds blast information to a GFF annotation.
- Parameters
annotation – GFF annotation object
gi_taxa_dict (dict) – dictionary returned by
parse_gi_taxa_table()
.taxonomy – Uniprot taxonomy, used to add the taxon name to the annotation
-
mgkit.io.blast.
parse_accession_taxa_table
(file_handle, acc_ids=None, key=1, value=2, num_lines=1000000, no_zero=True)[source]¶ New in version 0.2.5.
Changed in version 0.3.0: added no_zero
This function superseeds
parse_gi_taxa_table()
, since NCBI is deprecating the GIDs in favor of accessions like X53318. The new file can be found at the NCBI ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid, for DNA sequences (nt DB) nucl_gb.accession2taxid.gz.The file contains 4 columns, the first one is the accession without its version, the second one includes the version, the third column is the taxonomic identifier and the fourth is either the old GID or na.
The column used as key is the second, since by default the fasta headers used in NCBI DBs use the versioned identifier. To use the GID as key, the key parameter can be set to 3, but if no identifier is found (na as per the file README), the line is skipped.
- Parameters
file_handle (str, file) – file name or open file handle
acc_ids (None, list) – if it’s not None only the keys included in the passed acc_ids list will be returned
key (int) – 0-based index for the column to use as accession. Defaults to the versioned accession that is used in GenBank fasta files.
num_lines (None, int) – number of which a message is logged. If None, no message is logged
no_zero (bool) – if True (default) a key with taxon_id of 0 is not yield
Note
GIDs are being phased out in September 2016: http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/
-
mgkit.io.blast.
parse_blast_tab
(file_handle, seq_id=0, ret_col=0, 1, 2, 6, 7, 11, key_func=None, value_funcs=None)[source]¶ New in version 0.1.12.
Parses blast output tab format, returning for each line a key (the query id) and the columns requested in a tuple.
- Parameters
file_handle (file) – file name or file handle for the blast ouput
seq_id (int) – index for the column which has the query id
ret_col (list, None) – list of indexes for the columns to be returned or None if all columns must be returned
key_func (None, func) – function to transform the query id value in the key returned. If None, the query id is used
value_funcs (None, list) – list of functions to transform the value of all the requested columns. If None the values are not converted
- Yields
tuple – iterator of tuples with the first element being the query id after key_func is applied, if requested and the second element of the tuple is a tuple with the requested columns ret_col
¶ column index
description
0
query name
1
subject name
2
percent identities
3
aligned length
4
number of mismatched positions
5
number of gap positions
6
query sequence start
7
query sequence end
8
subject sequence start
9
subject sequence end
10
e-value
11
bit score
-
mgkit.io.blast.
parse_fragment_blast
(file_handle, bitscore=40.0)[source]¶ New in version 0.1.13.
Parse the output of a BLAST output where the sequences are the single annotations, so the sequence names are the uid of the annotations.
The only returned values are the best hits, maxed by bitscore and identity.
- Parameters
- Yields
tuple – a tuple whose first element is the uid (the sequence name) and the second is the a list of tuples whose first element is the GID (NCBI identifier), the second one is the identity and the third is the bitscore of the hit.
-
mgkit.io.blast.
parse_uniprot_blast
(file_handle, bitscore=40, db='UNIPROT-SP', dbq=10, name_func=None, feat_type='CDS', seq_lengths=None)[source]¶ New in version 0.1.12.
Changed in version 0.1.13: added name_func argument
Changed in version 0.2.1: added feat_type
Changed in version 0.2.3: added seq_lengths and added subject start and end and e-value
Parses BLAST results in tabular format using
parse_blast_tab()
, applying a basic bitscore filter. Returns the annotations associated with each BLAST hit.- Parameters
file_handle (str, file) – file name or open file handle
bitscore (int, float) – the minimum bitscore for an annotation to be accepted
db (str) – database used
dbq (int) – an index indicating the quality of the sequence database used; this value is used in the filtering of annotations
name_func (func) – function to convert the name of the database sequences. Defaults to lambda x: x.split(‘|’)[1], which can be be used with fasta files provided by Uniprot
feat_type (str) – feature type in the GFF
seq_lengths (dict) – dictionary with the sequences lengths, used to deduct the frame of the ‘-‘ strand
- Yields
Annotation – instances of
mgkit.io.gff.Annotation
instance of each BLAST hit.