mgkit.utils.sequence module¶
Module containing functions related to sequence data
Note
For those functions without a docstring, look at the same with a underscore (‘_’) prepended.
-
class
mgkit.utils.sequence.
Alignment
(seqs=None)[source]¶ Bases:
object
Simple alignment class
-
add_seqs
(seqs)[source]¶ Add sequences to the alignment
- Parameters
seqs (iterable) – iterable that returns (name, seq)
-
get_consensus
(nucl=True)[source]¶ Changed in version 0.1.16: added nucl parameter
The consensus sequence is constructed by checking the nucleotide that has the maximum number of counts for each position in the alignment.
-
get_position
(pos)[source]¶ Get all characters at a position
- Parameters
pos (int) – position to return (0-based)
- Return str
all characters occuring at the position
-
-
mgkit.utils.sequence.
REV_COMP
= {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}¶ Dictionary containing the complement of each nucleotide sequence
-
mgkit.utils.sequence.
TRANS_TABLE
= {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': '*', 'TAC': 'Y', 'TAG': '*', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': '*', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}¶ Translation table - Universal genetic code
-
mgkit.utils.sequence.
_get_kmers
(seq, k)[source]¶ New in version 0.2.6.
Returns a generator, with every iteration yielding a kmer of size k
-
mgkit.utils.sequence.
_sequence_signature
(seq, w_size, k_size=4, step=None)[source]¶ New in version 0.2.6.
Returns the signature of a sequence, based on a kmer length, over a sliding window. Each sliding window signature is placed in order into a list, with each element being a
collections.Counter
instance whose keys are the kmer found in that window.
-
mgkit.utils.sequence.
_signatures_matrix
(seqs, w_size, k_size=4, step=None)[source]¶ New in version 0.2.6.
Return a matrix (pandas.DataFrame) where the columns are the kmer found in all sequences seqs and the rows are the a MultiIndex with the first level being the sequnce name and the second the index of the sliding window for which a signature was computed.
- Parameters
seqs (iterable) – iterable that yields a tuple, with the first element being the sequence name and the second the sequence itself
w_size (int) – size of the sliding window size
k_size (int) – size of the kmer to use
get_kmers()
step (int) – step to use in
sliding_window()
, defaults to half of the window size
- Returns
a DataFrame where the columns are the kmers and the rows are the signatures of each contigs/windows.
- Return type
pandas.DataFrame
-
mgkit.utils.sequence.
_sliding_window
(seq, size, step=None)[source]¶ New in version 0.2.6.
Returns a generator, with every iteration yielding a subsequence of size size, with a step of step.
-
mgkit.utils.sequence.
calc_n50
(seq_lengths)[source]¶ Calculate the N50 statistics for a
numpy.array
of sequence lengths.The algorithm finds in the supplied array the element (contig length) for which the sum all contig lengths equal or greater than it is equal to half of all assembled base pairs.
- Parameters
seq_lengths (array) – an instance of a numpy array containing the sequence lengths
- Return int
the N50 statistics value
-
mgkit.utils.sequence.
check_snp_in_seq
(ref_seq, pos, change, start=0, trans_table=None)[source]¶ Check a SNP in a reference sequence if it is a synonymous or non-synonymous change.
- Parameters
- Return bool
True if it is a synonymous change, False if non-synonymous
-
mgkit.utils.sequence.
convert_aa_to_nuc_coord
(start, end, frame=0)[source]¶ Converts aa coordinates to nucleotidic ones. The coordinates must be from ‘+’ strand. For the ‘-‘ strand, use
reverse_aa_coord()
first.- Parameters
- Returns
the first element is the converted start and the second element is the converted end
- Return type
Note
the coordinates are assumed to be 1-based indices
-
mgkit.utils.sequence.
extrapolate_model
(quals, frac=0.5, scale_adj=0.5)[source]¶ New in version 0.3.3.
Changed in version 0.4.4: returns now the minimum and maximum quality scores found, along with some messages for the stages of the function
Extrapolate a quality model from a list of qualities. It uses internally a LOWESS as the base, which is used to estimate the noise as a normal distribution.
- Parameters
- Returns
the first element is the qualities fit with a LOWESS, the second element is the distribution, the third is the minimum quality score and the fourth element is the maximum quality score found in the sequences
- Return type
-
mgkit.utils.sequence.
get_contigs_info
(file_name, pp=False)[source]¶ Changed in version 0.2.4: file_name can be a dict name->seq or a list of sequences
New in version 0.2.1.
Given a file name for a fasta file with sequences, a dictionary of name->seq, or a list of sequences, returns the following information in a tuple, or a string if pp is True:
number of sequences
total base pairs
max length
min length
average length
N50 statistic
-
mgkit.utils.sequence.
get_seq_expected_syn_count
(seq, start=0, syn_matrix=None)[source]¶ Changed in version 0.4.4: if codon contains N or is not a multiple of 3, it is skipped
Calculate the expected number of synonymous and non-synonymous changes in a nucleotide sequence. Assumes that the sequence is already in the correct frame and its length is a multiple of 3.
- Parameters
seq (iterable) – nucleotide sequence (uppercase chars)
start (int) – frame of the sequence
syn_matrix (dict) – dictionary that contains the expected number of changes for a codon, as returned by
get_syn_matrix()
- Return tuple
tuple with counts of expected counts (syn, nonsyn)
-
mgkit.utils.sequence.
get_seq_number_of_syn
(ref_seq, snps, start=0, trans_table=None)[source]¶ Given a reference sequence and a list of SNPs, calculates the number of synonymous and non-synonymous SNP.
-
mgkit.utils.sequence.
get_syn_matrix
(trans_table=None, nuc_list=None)[source]¶ Returns a dictionary containing the expected count of synonymous and non-synonymous changes that a codon can have if one base is allowed to change at a time.
There are 9 possible changes per codon.
- Parameters
trans_table (dict) – a tranlation table, defaults to
seq_utils.TRANS_TABLE
nuc_list (iterable) – a list of nucleotides in which a base can change, default to the keys of
seq_utils.REV_COMP
- Return dict
returns a dictionary in which for each codon a dictionary {‘syn’: 0, ‘nonsyn’: 0} holds the number of expected changes
-
mgkit.utils.sequence.
get_syn_matrix_all
(trans_table=None)[source]¶ Same as
get_syn_matrix()
but a codon can change in any of the ones included in trans_table.There are 63 possible changes per codon.
-
mgkit.utils.sequence.
get_variant_sequence
(seq, *snps)[source]¶ New in version 0.1.16.
Return a sequence changed in the positions requested.
- Parameters
- Returns
string with the changed characters
- Return type
Example
>>> get_variant_sequence('ACTGATATATGCGCGCATCT', (1, 'C')) 'CCTGNTGTATGCGCGCATCT'
Note
It is used for nucleotide sequences, but it is valid to use any string
-
mgkit.utils.sequence.
make_reverse_table
(tbl=None)[source]¶ Makes table to reverse complement a sequence by
reverse_complement()
. The table used is the complement for each nucleotide, defaulting toREV_COMP
-
mgkit.utils.sequence.
put_gaps_in_nuc_seq
(nuc_seq, aa_seq, trim=True)[source]¶ Match the gaps in an amino-acid aligned sequence to its original nucleotide sequence. If the nucleotide sequence is not a multiple of 3, the trim option by default trim those bases from the output.
-
mgkit.utils.sequence.
qualities_model_constant
(length=150, scale=1, loc=35)[source]¶ New in version 0.3.3.
Model with constant quality
-
mgkit.utils.sequence.
qualities_model_decrease
(length=150, scale=None, loc=35)[source]¶ New in version 0.3.3.
The model is a decreasing one, from 35 and depends on the length of the sequence.
-
mgkit.utils.sequence.
random_qualities
(n=1, length=150, model=None, max_qual=60, min_qual=0)[source]¶ New in version 0.3.3.
- Parameters
n (int) – number of quality arrays to yield
length (int) – length of the quality array
model (tuple) – a tuple specifying the qualities and error distribution, if None
qualities_model_decrease()
is usedmax_qual (int) – max quality values allowed, since rarely the quality exceed 60 in sequencing data
- Yields
numpy.array – numpy array of qualities, with the maximum value of 60 by default, or max_qual, same with min_qual
-
mgkit.utils.sequence.
random_sequences
(n=1, length=150, p=None)[source]¶ New in version 0.3.3.
Returns an iterator of random squences, where each nucleotide probability can be customised in the order (A, C, T, G)
-
mgkit.utils.sequence.
random_sequences_codon
(n=1, length=150, codons=['TTT', 'TCT', 'TAT', 'TGT', 'TTC', 'TCC', 'TAC', 'TGC', 'TTA', 'TCA', 'TAA', 'TGA', 'TTG', 'TCG', 'TAG', 'TGG', 'CTT', 'CCT', 'CAT', 'CGT', 'CTC', 'CCC', 'CAC', 'CGC', 'CTA', 'CCA', 'CAA', 'CGA', 'CTG', 'CCG', 'CAG', 'CGG', 'ATT', 'ACT', 'AAT', 'AGT', 'ATC', 'ACC', 'AAC', 'AGC', 'ATA', 'ACA', 'AAA', 'AGA', 'ATG', 'ACG', 'AAG', 'AGG', 'GTT', 'GCT', 'GAT', 'GGT', 'GTC', 'GCC', 'GAC', 'GGC', 'GTA', 'GCA', 'GAA', 'GGA', 'GTG', 'GCG', 'GAG', 'GGG'], p=None, frame=None)[source]¶ New in version 0.3.3.
Returns an iterator of nucleotidic sequences, based on a defined genetic code (passed as parameter, defaults to the universal one). The sequence if first sampled with replacement from the codon list, with a number of codons that covers the length chosen plus an additional one to allow a frame shift as set by frame
Note
If the probability (for each codon) are supplied, the number of sequences required to match those probabilities within a 10% margin of error is of at least 10.000 sequences, for 5% at leas 100.000
- Parameters
n (int) – number of sequences to yield
length (int) – length of the sequences
codons (iterable) – codons used when generating the sequences
p (tuple) – probability of each codon occurence, in the same order as codons
frame (int or None) – used to define a specific frame shift occuring in the sequence (0 to 2) or a random one (if None)
- Yields
str – string representing a nucleotidic sequence
-
mgkit.utils.sequence.
reverse_aa_coord
(start, end, seq_len)[source]¶ Used to reverse amino-acid coordinates when parsing an AA annotation on the - strand. Used when the BLAST or HMMER annotations use AA sequences.
- Parameters
- Returns
reversed (from strand - to strand +) coordinates. The first element is the converted start and the second element is the converted end
- Return type
Note
start and end are 1-based indices
-
mgkit.utils.sequence.
reverse_complement
(seq, tbl={65: 'T', 67: 'G', 71: 'C', 84: 'A'})[source]¶ Returns the reverse complement of a nucleotide sequence
- Parameters
seq (str) – nucleotide sequence with uppercase characters
tbl (dict) – translation table returned by
make_reverse_table()
- Return str
returns the reverse complement of a nucleotide sequence
-
mgkit.utils.sequence.
reverse_complement_old
(seq, tbl=None)[source]¶ Returns the reverse complement of a nucleotide sequence
-
mgkit.utils.sequence.
sequence_composition
(sequence, chars='A', 'T', 'C', 'G')[source]¶ New in version 0.1.13.
Returns the number of occurences of each unique character in the sequence
-
mgkit.utils.sequence.
sequence_gc_content
(sequence)[source]¶ Changed in version 0.3.3: in case of ZeroDivisionError returns .5
New in version 0.1.13.
Calculate GC content information for an annotation. The formula is:
(1)¶\[\frac {(G + C)}{(G + C + A + T)}\]
-
mgkit.utils.sequence.
sequence_gc_ratio
(sequence)[source]¶ New in version 0.1.13.
Calculate GC ratio information for a sequence. The formula is:
(2)¶\[\frac {(A + T)}{(G + C)}\]
-
mgkit.utils.sequence.
translate_sequence
(sequence, start=0, tbl=None, reverse=False)[source]¶ Translate a nucleotide sequence in an amino acid one.
- Parameters
sequence (str) – sequence to translate, it’s expected to be all caps
start (int) – 0-based index for the translation to start
tbl (dict) – dictionary with the translation for each codon
reverse (bool) – if True,
reverse_complement()
will be called and the returned sequence translated
- Return str
the translated sequence