mgkit.io.fastq module¶
Fastq utility functions
-
mgkit.io.fastq.
CASAVA_HEADER_NEW
= '(?P<machine>[\\w-]+):\n (?P<runid>\\d+):\n (?P<cellid>\\w+):\n (?P<lane>\\d):\n (?P<tile>\\d+):\n (?P<xcoord>\\d+):\n (?P<ycoord>\\d+)\n [_ ](?P<mate>\\d): # underscore for data from from www.ebi.ac.uk/ena/\n (?P<filter>[YN]):\n (?P<bits>\\d+):\n (?P<index>[ACTGN+]+)'¶ New casava header regex, including indices for both forward and reverse
-
mgkit.io.fastq.
CASAVA_HEADER_OLD
= '(?P<machine>\\w+-\\w+):\n (?P<lane>\\d):\n (?P<tile>\\d+):\n (?P<xcoord>\\d+):\n (?P<ycoord>\\d+)\\#\n (?P<index>(\\d|[ACTGN]{6}))/\n (?P<mate>(1|2))'¶ Old casava header regex
-
mgkit.io.fastq.
check_fastq_type
(qualities)[source]¶ Trys to guess the type of quality string used in a Fastq file
- Parameters
qualities (str) – string with the quality scores as in the Fastq file
- Return str
a string with the guessed quality score
Note
Possible values are the following, classified but the values usually used in other softwares:
ASCII33: sanger, illumina-1.8
ASCII64: illumina-1.3, illumina-1.5, solexa-old
-
mgkit.io.fastq.
choose_header_type
(seq_id)[source]¶ Return the guessed compiled regular expression :param str seq_id: sequence header to test
- Returns
compiled regular expression object or None if no match found
-
mgkit.io.fastq.
convert_seqid_to_new
(seq_id)[source]¶ Convert old seq_id format for Illumina reads to the new found in Casava 1.8+
- Parameters
seq_id (str) – seq_id of the sequence (stripped of ‘@’)
- Return str
the new format seq_id
Note
Example from Wikipedia:
old casava seq_id: @HWUSI-EAS100R:6:73:941:1973#0/1 new casava seq_id: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCAC
-
mgkit.io.fastq.
convert_seqid_to_old
(seq_id, index_as_seq=True)[source]¶ Deprecated since version 0.3.3.
Convert old seq_id format for Illumina reads to the new found in Casava until 1.8, which marks the new format.
-
mgkit.io.fastq.
load_fastq
(file_handle, num_qual=False)[source]¶ New in version 0.3.1.
Loads a fastq file and returns a generator of tuples in which the first element is the name of the sequence, the second the sequence and the third the quality scores (converted in a numpy array if num_qual is True).
Note
this is a simple parser that assumes each sequence is on 4 lines, 1st and 3rd for the headers, 2nd for the sequence and 4th the quality scores
- Parameters
- Yields
tuple – first element is the sequence name/header, the second element is the sequence, the third is the quality score. The quality scores are kept as a string if num_qual is False (default) and converted to a numpy array with correct values (0-41) if num_qual is True
- Raises
ValueError – if the headers in both sequence and quality scores are not
valid. This implies that the sequence/qualities have carriage returns –
or the file is truncated. –
TypeError – if the qualities are in a format different than sanger
(min 0, max 40) or illumina-1.8 (0, 41) –
-
mgkit.io.fastq.
load_fastq_rename
(file_handle, num_qual=False, name_func=None)[source]¶ New in version 0.3.3.
Mirrors the same functionality in
mgkit.io.fasta.load_fasta_rename()
. Renames the header of the sequences using name_func, which is called on each header. By default, the behaviour is to keep the header to the left of the first space (BLAST behaviour).