mgkit.io.blast module

Blast routines and parsers

mgkit.io.blast.add_blast_result_to_annotation(annotation, gi_taxa_dict, taxonomy, threshold=60)[source]

Deprecated since version 0.4.0.

Adds blast information to a GFF annotation.

Parameters
  • annotation – GFF annotation object

  • gi_taxa_dict (dict) – dictionary returned by parse_gi_taxa_table().

  • taxonomy – Uniprot taxonomy, used to add the taxon name to the annotation

mgkit.io.blast.parse_accession_taxa_table(file_handle, acc_ids=None, key=1, value=2, num_lines=1000000, no_zero=True)[source]

New in version 0.2.5.

Changed in version 0.3.0: added no_zero

This function superseeds parse_gi_taxa_table(), since NCBI is deprecating the GIDs in favor of accessions like X53318. The new file can be found at the NCBI ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid, for DNA sequences (nt DB) nucl_gb.accession2taxid.gz.

The file contains 4 columns, the first one is the accession without its version, the second one includes the version, the third column is the taxonomic identifier and the fourth is either the old GID or na.

The column used as key is the second, since by default the fasta headers used in NCBI DBs use the versioned identifier. To use the GID as key, the key parameter can be set to 3, but if no identifier is found (na as per the file README), the line is skipped.

Parameters
  • file_handle (str, file) – file name or open file handle

  • acc_ids (None, list) – if it’s not None only the keys included in the passed acc_ids list will be returned

  • key (int) – 0-based index for the column to use as accession. Defaults to the versioned accession that is used in GenBank fasta files.

  • num_lines (None, int) – number of which a message is logged. If None, no message is logged

  • no_zero (bool) – if True (default) a key with taxon_id of 0 is not yield

Note

GIDs are being phased out in September 2016: http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/

mgkit.io.blast.parse_blast_tab(file_handle, seq_id=0, ret_col=0, 1, 2, 6, 7, 11, key_func=None, value_funcs=None)[source]

New in version 0.1.12.

Parses blast output tab format, returning for each line a key (the query id) and the columns requested in a tuple.

Parameters
  • file_handle (file) – file name or file handle for the blast ouput

  • seq_id (int) – index for the column which has the query id

  • ret_col (list, None) – list of indexes for the columns to be returned or None if all columns must be returned

  • key_func (None, func) – function to transform the query id value in the key returned. If None, the query id is used

  • value_funcs (None, list) – list of functions to transform the value of all the requested columns. If None the values are not converted

Yields

tuple – iterator of tuples with the first element being the query id after key_func is applied, if requested and the second element of the tuple is a tuple with the requested columns ret_col

BLAST+ used with -outfmt 6, default columns

column index

description

0

query name

1

subject name

2

percent identities

3

aligned length

4

number of mismatched positions

5

number of gap positions

6

query sequence start

7

query sequence end

8

subject sequence start

9

subject sequence end

10

e-value

11

bit score

mgkit.io.blast.parse_fragment_blast(file_handle, bitscore=40.0)[source]

New in version 0.1.13.

Parse the output of a BLAST output where the sequences are the single annotations, so the sequence names are the uid of the annotations.

The only returned values are the best hits, maxed by bitscore and identity.

Parameters
  • file_handle (str, file) – file name or open file handle

  • bitscore (float) – minimum bitscore for accepting a hit

Yields

tuple – a tuple whose first element is the uid (the sequence name) and the second is the a list of tuples whose first element is the GID (NCBI identifier), the second one is the identity and the third is the bitscore of the hit.

mgkit.io.blast.parse_uniprot_blast(file_handle, bitscore=40, db='UNIPROT-SP', dbq=10, name_func=None, feat_type='CDS', seq_lengths=None)[source]

New in version 0.1.12.

Changed in version 0.1.13: added name_func argument

Changed in version 0.2.1: added feat_type

Changed in version 0.2.3: added seq_lengths and added subject start and end and e-value

Parses BLAST results in tabular format using parse_blast_tab(), applying a basic bitscore filter. Returns the annotations associated with each BLAST hit.

Parameters
  • file_handle (str, file) – file name or open file handle

  • bitscore (int, float) – the minimum bitscore for an annotation to be accepted

  • db (str) – database used

  • dbq (int) – an index indicating the quality of the sequence database used; this value is used in the filtering of annotations

  • name_func (func) – function to convert the name of the database sequences. Defaults to lambda x: x.split(‘|’)[1], which can be be used with fasta files provided by Uniprot

  • feat_type (str) – feature type in the GFF

  • seq_lengths (dict) – dictionary with the sequences lengths, used to deduct the frame of the ‘-‘ strand

Yields

Annotation – instances of mgkit.io.gff.Annotation instance of each BLAST hit.