filter-gff - Filter GFF annotations¶
Overview¶
Filters GFF annotations in different ways.
Value Filtering¶
Enables filtering of GFF annotations based on the the values of attributes of a GFF annotation. The filters are based on equality of numbers (internally converted into float) and strings, a string contained in the value of an attribute less or greater than are included as well. The length of annotation has the attribute length and can be tested.
Overlap Filtering¶
Filters overlapping annotations using the functions
mgkit.filter.gff.choose_annotation()
and
mgkit.filter.gff.filter_annotations()
, after the annotations are grouped
by both sequence and strand. If the GFF is sorted by sequence name and strand,
the -t can be used to make the filtering use less memory. It can be sorted in
Unix using sort -s -k 1,1 -k 7,7 gff_file, which applies a stable sort using
the sequence name as the first key and the strand as the second key.
Note
It is also recommended to use:
export LC_ALL=C
To speed up the sorting
The above digram describes the internals of the script.
The annotations needs first to be grouped by seq_id and strand, forming a group
that can be then be passed to mgkit.filter.gff.filter_annotations()
.
This function:
sort annotations by bit score, from the highest to the lowest
loop over all combination of N=2 annotations:
choose which of the two annotations to discard if they overlap for a the required amount of bp (defaults to 100bp)
in which case, the preference is given to the db quality first, than the bit score and finally the lenght of annotation, the one with the highest values is kept
While the default behaviour is the same, now it is posible to decided the
function used to discard one the two annotations. It is possible to use the
-c argument to pass a string that defines the function. The string passed must
start with or without a +. Using + translates into the builtin function
max while no + translates into min from the second character on, any
number of attributes can be used, separated by commas. The attributes, however,
must be one of the properties defined in mgkit.io.gff.Annotation
,
bitscore that returns the value converted in a float. Internally the
attributes are stored as strings, so for attributes that have no properties in
the class, such as evalue, the float builtin is applied.
The tuples built for both annotations are then passed to the comparison function to be selected and the value returned by it is discarded. The order of the elements in the string is important to define the priority given to each element in the comparison and the leftmost one has the highesst priority.
Examples of function strings:
-dbq,bitscore,length becomes max((ann1.dbq, ann1.bitscore, ann1.length), (ann2.dbq, ann2.bitscore, ann2.length) - This is default and previously only choice
-bitscore,length,dbq uses the same elements but gives lowest priority to dbq
+evalue: will discard the annotation with the highest evalue
Per Sequence Values¶
The sequence command allows to filter on a per sequence basis, using functions such as the median, quantile and mean on attributes like evalue, bitscore and identity. The file can be passed as sorted already, saving memory (like in the overlap command), but it’s not needed to sort the file by strand, only by the first column.
Coverage Filtering¶
The cov command calculates the coverage of annotations as a measure of the percentage of each reference sequence length. A minimum coverage percentage can be used to keep the annotations of sequences that have a greater or equal coverage than the specified one.
Changes¶
New in version 0.1.12.
Changed in version 0.1.13: added –sorted option
Changed in version 0.2.0: changed option -c to accept a string to filter overlap
Changed in version 0.2.5: added sequence command
Changed in version 0.2.6: added length as attribute and min/max, and ge is the default comparison for command sequence, –sort-attr to overlap
Changed in version 0.3.1: added –num-gt and –num-lt to values command, added cov command
Changed in version 0.3.4: moved to use click for argument parsing reworked the values, sequence commands
Changed in version 0.4.4: overlap command: added option to not use the strand information and added an option to make multiple passes of overlap for each sequence
Options¶
filter-gff¶
Main function
filter-gff [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
-
--cite
¶
cov¶
Filter on a per coverage basis
filter-gff cov [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-f
,
--reference
<reference>
¶ Required Reference FASTA file for the GFF
-
-s
,
--strand-specific
¶
If the coverage must be calculated on each strand
-
-t
,
--sorted
¶
Assumes the GFF to be correctly sorted
-
-c
,
--min-coverage
<min_coverage>
¶ Minimum coverage for the contig/strand
-
-r
,
--rename
¶
Emulates BLAST in reading the FASTA file (keeps only the header before the first space)
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
overlap¶
Use overlapping filter
filter-gff overlap [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-s
,
--size
<size>
¶ Size of the overlap that triggers the filter
- Default
100
-
-t
,
--sorted
¶
If the GFF file is sorted (all of a sequence annotations are contiguos and sorted by strand) can use less memory, sort -s -k 1,1 -k 7,7 can be used
-
-c
,
--choose-func
<choose_func>
¶ Function to choose between two overlapping annotations
-
-a
,
--sort-attr
<sort_attr>
¶ Attribute to sort annotations before filtering (default bitscore)
- Default
bitscore
- Options
bitscore|identity|length
-
-d
,
--no-strand
¶
Strand information is not used, if ‘-t’ is used, sort GFF file with sort -s -k 1,1
-
-n
,
--iterations
<iterations>
¶ Max number of iteration over which filter the overlaps
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
sequence¶
Filter on a per sequence basis
filter-gff sequence [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-t
,
--sorted
¶
If the GFF file is sorted (all of a sequence annotations are contiguos) can use less memory, sort -s -k 1,1 can be used
-
-a
,
--attribute
<attribute>
¶ Attribute on which to apply the filter
- Default
bitscore
- Options
evalue|bitscore|identity|length
-
-f
,
--function
<function>
¶ Function for filtering
- Default
mean
- Options
mean|median|quantile|std|max|min
-
-l
,
--value
<value>
¶ Value for the function (used for std and quantile)
-
-c
,
--comparison
<comparison>
¶ Type of comparison (e.g. ge -> greater than or equal to)
- Default
ge
- Options
gt|ge|lt|le
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
values¶
Filter based on values
filter-gff values [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
--str-eq
<str_eq>
¶ filter by custom key:value, if the argument is ‘key:value’ the annotation is kept if it contains an attribute ‘key’ whose value is exactly ‘value’ as a string
-
--str-in
<str_in>
¶ Same as ‘–str-eq’ but ‘value’ is contained in the attribute
-
--num-eq
<num_eq>
¶ Same as ‘–str-eq’ but ‘value’ is a number which is equal or greater than
-
--num-ge
<num_ge>
¶ Same as ‘–str-eq’ but ‘value’ is a number which is equal or greater than
-
--num-le
<num_le>
¶ Same as ‘–num-ge’ but ‘value’ is a number which is equal or less than
-
--num-gt
<num_gt>
¶ Same as ‘–str-eq’ but ‘value’ is a number which is greater than
-
--num-lt
<num_lt>
¶ Same as ‘–num-ge’ but ‘value’ is a number which is less than
-
--progress
¶
Shows Progress Bar
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument