blast2gff - Convert BLAST output to GFF¶
Overview¶
Blast output conversion in GFF requires a BLAST+ tabular format which can be
obtained by using the –outfmt 6 option with the default columns, as
specified in mgkit.io.blast.parse_blast_tab()
. The script can get data
from the standard in and ouputs GFF lines on the standard output by default.
Uniprot¶
The Function mgkit.io.blast.parse_uniprot_blast()
is used, which filters
BLAST hits based on bitscore and adds by default a db attribute to the
annotation with the value UNIPROT-SP, indicating that the SwissProt db is
used and a dbq attribute with the value 10. The feature type used in the GFF
is CDS.
BlastDB¶
If a BlastDB, such as nt or nr was used, the blastdb command offers some quick defaults to parse BLAST results.
It now includes options to control the way the sequence header are formatted. Options to change the separator used, as well as the column used as gene_id. This was added because at the moment the GI identifier (the second column in the header) is used, but it’s being phased out in favour of the embl/gb/dbj (right now the fourth column in the header). This should easy the transition to the new format and makes it easier to adapt an older pipeline/blastdb to newer files (like the ID to TAXA files).
The header from the a ncbi-nt header looks like this:
gi|160361034|gb|CP000884.1
This is the default output accepted by the blastdb command. The fields are separated by | (pipe) and the GI is used (–gene-index 1, since internally the string is split by the separator and the second element is take - lists indices are 0-based in Python). This output uses the following options:
--header-sep '|' --gene-index 1
Notice the single quotes to pass the pipe symbol, since bash would interpret it as pipeing to the next coommand otherwise. This is the default.
In case, for the same header, we want to use the gb identifier, the only option to be specified is:
--gene-index 3
This will get the fourth element of the header (since we’re splitting by pipe).
As in the uniprot command, the gene_id can be set to use the whole header, using the -n option. Useful in case the BLAST db that was used was custom made. While pipe is used in major databases, it was made the default, by if the db used has different conventions the separator can be changed. There’s also the options of later changing the gene_id in the output GFF if necessary.
Changes¶
Changed in version 0.5.7: in command blastdb if the fasta file is passed, the query coverage is calculated
Changed in version 0.3.4: using click instead of argparse
Changed in version 0.2.6: added -r option to blastdb
Changed in version 0.2.5: added more options to give user control to the blastdb command
New in version 0.2.3: added –fasta-file option, added more data from a blsat hit
New in version 0.2.2: added blastdb command
Changed in version 0.2.1: added -ft option
Changed in version 0.1.13: added -n and -k parameters to uniprot command
New in version 0.1.12.
Options¶
blast2gff¶
Main function
blast2gff [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
-
--cite
¶
blastdb¶
Reads a BLAST output file [blast-file] in tabular format (using -outfmt 6) and outputs a GFF file [gff-file]
blast2gff blastdb [OPTIONS] [BLAST_FILE] [GFF_FILE]
Options
-
-v
,
--verbose
¶
-
-db
,
--db-used
<db_used>
¶ blastdb used
- Default
NCBI-NT
-
-n
,
--no-split
¶
if used, the script assumes that the sequence header will be used as gene_id
-
-s
,
--header-sep
<header_sep>
¶ The separator for the header, defaults to ‘|’ (pipe)
- Default
-
-i
,
--gene-index
<gene_index>
¶ Which of the header columns (0-based) to use as gene_id (defaults to 1 - the second column)
- Default
1
-
-r
,
--remove-version
¶
if used, the script removes the version information from the gene_id
-
-a
,
--fasta-file
<fasta_file>
¶ Optional FASTA file with the query sequences, if passed, query coverage is calculated
-
-dbq
,
--db-quality
<db_quality>
¶ Quality of the DB used
- Default
10
-
-b
,
--bitscore
<bitscore>
¶ Minimum bitscore to keep the annotation
- Default
0.0
-
-k
,
--attr-value
<attr_value>
¶ Additional attribute and value to add to each annotation, in the form attr:value
-
-ft
,
--feat-type
<feat_type>
¶ Feature type to use in the GFF
- Default
CDS
-
--progress
¶
Shows Progress Bar
Arguments
-
BLAST_FILE
¶
Optional argument
-
GFF_FILE
¶
Optional argument
uniprot¶
Reads a BLAST output file [blast-file] in tabular format (using -outfmt 6) from a Uniprot DB and outputs a GFF file [gff-file]
blast2gff uniprot [OPTIONS] [BLAST_FILE] [GFF_FILE]
Options
-
-v
,
--verbose
¶
-
-db
,
--db-used
<db_used>
¶ Uniprot database used with BLAST
- Default
UNIPROT-SP
-
-n
,
--no-split
¶
if used, the script assumes that the sequence header will be used as gene_id
-
-a
,
--fasta-file
<fasta_file>
¶ Optional FASTA file with the query sequences
-
-dbq
,
--db-quality
<db_quality>
¶ Quality of the DB used
- Default
10
-
-b
,
--bitscore
<bitscore>
¶ Minimum bitscore to keep the annotation
- Default
0.0
-
-k
,
--attr-value
<attr_value>
¶ Additional attribute and value to add to each annotation, in the form attr:value
-
-ft
,
--feat-type
<feat_type>
¶ Feature type to use in the GFF
- Default
CDS
-
--progress
¶
Shows Progress Bar
Arguments
-
BLAST_FILE
¶
Optional argument
-
GFF_FILE
¶
Optional argument