mgkit.utils.sequence module

Module containing functions related to sequence data

Note

For those functions without a docstring, look at the same with a underscore (‘_’) prepended.

class mgkit.utils.sequence.Alignment(seqs=None)[source]

Bases: object

Simple alignment class

add_seq(name, seq)[source]

Add a sequence to the alignment

Parameters
  • name (str) – name of the sequence

  • seq (str) – sequence

add_seqs(seqs)[source]

Add sequences to the alignment

Parameters

seqs (iterable) – iterable that returns (name, seq)

get_consensus(nucl=True)[source]

Changed in version 0.1.16: added nucl parameter

The consensus sequence is constructed by checking the nucleotide that has the maximum number of counts for each position in the alignment.

Parameters

nucl (bool) – specify if the alignment is nucleotidic

Returns

consensus sequence

Return type

str

get_position(pos)[source]

Get all characters at a position

Parameters

pos (int) – position to return (0-based)

Return str

all characters occuring at the position

get_seq_len()[source]

Get the length of the alignment

get_snps(ref_seq=None, full_size=False)[source]

A SNP is called for the nucleotide that has the most counts among the ones that differ in the each site of the alignment. If two nucleotides have the same maximum count, one is randomly chosen.

Parameters
  • ref_seq (str) – a reference sequence can be provided, if None, a consensus sequence is produced for the alignment

  • full_size (bool) – if True a tuple is returned for each position in the alignment. If there is no SNP at a position the value for the SNP is None

Return list

a list of tuples (position, SNP)

mgkit.utils.sequence.REV_COMP = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

Dictionary containing the complement of each nucleotide sequence

mgkit.utils.sequence.TRANS_TABLE = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': '*', 'TAC': 'Y', 'TAG': '*', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': '*', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}

Translation table - Universal genetic code

mgkit.utils.sequence._get_kmers(seq, k)[source]

New in version 0.2.6.

Returns a generator, with every iteration yielding a kmer of size k

Parameters
  • seq (str) – sequence

  • k (int) – kmer size

Yields

str – a portion of seq, of size k with a step of 1

mgkit.utils.sequence._sequence_signature(seq, w_size, k_size=4, step=None)[source]

New in version 0.2.6.

Returns the signature of a sequence, based on a kmer length, over a sliding window. Each sliding window signature is placed in order into a list, with each element being a collections.Counter instance whose keys are the kmer found in that window.

Parameters
  • seq (str) – sequence for which to get the signature

  • w_size (int) – size of the sliding window size

  • k_size (int) – size of the kmer to use get_kmers()

  • step (int) – step to use in sliding_window()

Returns

a list of collections.Counter instances, for each window used

Return type

list

mgkit.utils.sequence._signatures_matrix(seqs, w_size, k_size=4, step=None)[source]

New in version 0.2.6.

Return a matrix (pandas.DataFrame) where the columns are the kmer found in all sequences seqs and the rows are the a MultiIndex with the first level being the sequnce name and the second the index of the sliding window for which a signature was computed.

Parameters
  • seqs (iterable) – iterable that yields a tuple, with the first element being the sequence name and the second the sequence itself

  • w_size (int) – size of the sliding window size

  • k_size (int) – size of the kmer to use get_kmers()

  • step (int) – step to use in sliding_window(), defaults to half of the window size

Returns

a DataFrame where the columns are the kmers and the rows are the signatures of each contigs/windows.

Return type

pandas.DataFrame

mgkit.utils.sequence._sliding_window(seq, size, step=None)[source]

New in version 0.2.6.

Returns a generator, with every iteration yielding a subsequence of size size, with a step of step.

Parameters
  • seq (str) – sequnece

  • size (int) – size of the sliding window

  • step (int, None) – the step to use in the sliding window. If None, half of the sequence length is used

Yields

str – a subsequence of size size and step step

mgkit.utils.sequence.calc_n50(seq_lengths)[source]

Calculate the N50 statistics for a numpy.array of sequence lengths.

The algorithm finds in the supplied array the element (contig length) for which the sum all contig lengths equal or greater than it is equal to half of all assembled base pairs.

Parameters

seq_lengths (array) – an instance of a numpy array containing the sequence lengths

Return int

the N50 statistics value

mgkit.utils.sequence.check_snp_in_seq(ref_seq, pos, change, start=0, trans_table=None)[source]

Check a SNP in a reference sequence if it is a synonymous or non-synonymous change.

Parameters
  • ref_seq (str) – reference sequence

  • pos (int) – SNP position - it is expected to be a 1 based index

  • change (str) – nucleotide change occuring at pos

  • start (int) – the starting position for the coding region - 0 based index

  • trans_table (dict) – translation table used - codon->AA

Return bool

True if it is a synonymous change, False if non-synonymous

mgkit.utils.sequence.convert_aa_to_nuc_coord(start, end, frame=0)[source]

Converts aa coordinates to nucleotidic ones. The coordinates must be from ‘+’ strand. For the ‘-‘ strand, use reverse_aa_coord() first.

Parameters
  • start (int) – start of the annotation (lowest number)

  • end (int) – end of the annotation (highest number)

  • frame (int) – frame of the AA translation (0, 1 or 2)

Returns

the first element is the converted start and the second element is the converted end

Return type

tuple

Note

the coordinates are assumed to be 1-based indices

mgkit.utils.sequence.extrapolate_model(quals, frac=0.5, scale_adj=0.5)[source]

New in version 0.3.3.

Changed in version 0.4.4: returns now the minimum and maximum quality scores found, along with some messages for the stages of the function

Extrapolate a quality model from a list of qualities. It uses internally a LOWESS as the base, which is used to estimate the noise as a normal distribution.

Parameters
  • quals (list) – list of arrays of qualities, sorted by position in the corresponding sequence

  • frac (float) – fraction of the data used for the LOWESS fit (uses statsmodels)

  • scale_adj (float) – value by which the scale of the normal distribution will be multiplied. Defaults to halving the scale

Returns

the first element is the qualities fit with a LOWESS, the second element is the distribution, the third is the minimum quality score and the fourth element is the maximum quality score found in the sequences

Return type

tuple

mgkit.utils.sequence.get_contigs_info(file_name, pp=False)[source]

Changed in version 0.2.4: file_name can be a dict name->seq or a list of sequences

New in version 0.2.1.

Given a file name for a fasta file with sequences, a dictionary of name->seq, or a list of sequences, returns the following information in a tuple, or a string if pp is True:

  • number of sequences

  • total base pairs

  • max length

  • min length

  • average length

  • N50 statistic

Parameters
  • file_name (str) – fasta file to open

  • pp (bool) – if True, a formatted string is returned

Returns

the returned value depends on the value of pp, if True a formatted string is returned, otherwise the tuple with all values is.

Return type

str, tuple

mgkit.utils.sequence.get_seq_expected_syn_count(seq, start=0, syn_matrix=None)[source]

Changed in version 0.4.4: if codon contains N or is not a multiple of 3, it is skipped

Calculate the expected number of synonymous and non-synonymous changes in a nucleotide sequence. Assumes that the sequence is already in the correct frame and its length is a multiple of 3.

Parameters
  • seq (iterable) – nucleotide sequence (uppercase chars)

  • start (int) – frame of the sequence

  • syn_matrix (dict) – dictionary that contains the expected number of changes for a codon, as returned by get_syn_matrix()

Return tuple

tuple with counts of expected counts (syn, nonsyn)

mgkit.utils.sequence.get_seq_number_of_syn(ref_seq, snps, start=0, trans_table=None)[source]

Given a reference sequence and a list of SNPs, calculates the number of synonymous and non-synonymous SNP.

Parameters
  • ref_seq (str) – reference sequence

  • snps (iterable) – list of tuples (position, SNP) - zero based index

  • start (int) – the frame used for the reference {0, 1, 2}

  • trans_table (dict) – translation table used - codon->AA

Return tuple

synonymous and non-synonymous counts

mgkit.utils.sequence.get_syn_matrix(trans_table=None, nuc_list=None)[source]

Returns a dictionary containing the expected count of synonymous and non-synonymous changes that a codon can have if one base is allowed to change at a time.

There are 9 possible changes per codon.

Parameters
  • trans_table (dict) – a tranlation table, defaults to seq_utils.TRANS_TABLE

  • nuc_list (iterable) – a list of nucleotides in which a base can change, default to the keys of seq_utils.REV_COMP

Return dict

returns a dictionary in which for each codon a dictionary {‘syn’: 0, ‘nonsyn’: 0} holds the number of expected changes

mgkit.utils.sequence.get_syn_matrix_all(trans_table=None)[source]

Same as get_syn_matrix() but a codon can change in any of the ones included in trans_table.

There are 63 possible changes per codon.

mgkit.utils.sequence.get_variant_sequence(seq, *snps)[source]

New in version 0.1.16.

Return a sequence changed in the positions requested.

Parameters
  • seq (str) – a sequence

  • *snps (tuple) – each argument passed is a tuple with the first element as a position in the sequence (1-based index) and the second element is the character to substitute in the sequence

Returns

string with the changed characters

Return type

str

Example

>>> get_variant_sequence('ACTGATATATGCGCGCATCT', (1, 'C'))
'CCTGNTGTATGCGCGCATCT'

Note

It is used for nucleotide sequences, but it is valid to use any string

mgkit.utils.sequence.make_reverse_table(tbl=None)[source]

Makes table to reverse complement a sequence by reverse_complement(). The table used is the complement for each nucleotide, defaulting to REV_COMP

mgkit.utils.sequence.put_gaps_in_nuc_seq(nuc_seq, aa_seq, trim=True)[source]

Match the gaps in an amino-acid aligned sequence to its original nucleotide sequence. If the nucleotide sequence is not a multiple of 3, the trim option by default trim those bases from the output.

Parameters
  • nuc_seq (str) – original nucleotide sequence

  • aa_seq (str) – aligned amino-acid sequence

  • trim (bool) – if True trim last nucleotide(s)

Return str

gapped nucleotide sequence

mgkit.utils.sequence.qualities_model_constant(length=150, scale=1, loc=35)[source]

New in version 0.3.3.

Model with constant quality

Parameters
  • length (int) – length of the qualities

  • scale (float) – base level of the qualities

  • loc (float) – loc parameter of the normal distribution

Returns

first element is sequence qualities, the second element contains the distribution used to randomise them

Return type

tuple

mgkit.utils.sequence.qualities_model_decrease(length=150, scale=None, loc=35)[source]

New in version 0.3.3.

The model is a decreasing one, from 35 and depends on the length of the sequence.

Parameters
  • length (int) – length of the qualities

  • scale (float) – base level of the qualities

  • loc (float) – loc parameter of the normal distribution

Returns

first element is sequence qualities, the second element contains the distribution used to randomise them

Return type

tuple

mgkit.utils.sequence.random_qualities(n=1, length=150, model=None, max_qual=60, min_qual=0)[source]

New in version 0.3.3.

Parameters
  • n (int) – number of quality arrays to yield

  • length (int) – length of the quality array

  • model (tuple) – a tuple specifying the qualities and error distribution, if None qualities_model_decrease() is used

  • max_qual (int) – max quality values allowed, since rarely the quality exceed 60 in sequencing data

Yields

numpy.array – numpy array of qualities, with the maximum value of 60 by default, or max_qual, same with min_qual

mgkit.utils.sequence.random_sequences(n=1, length=150, p=None)[source]

New in version 0.3.3.

Returns an iterator of random squences, where each nucleotide probability can be customised in the order (A, C, T, G)

Parameters
  • n (int) – number of sequences to yield

  • length (int) – length of each sequence

  • p (tuple) – tuple with the probability of a nucleotide to occur, in the order A, C, T, G

Yields

str – string representing a nucleotidic sequence

mgkit.utils.sequence.random_sequences_codon(n=1, length=150, codons=['TTT', 'TCT', 'TAT', 'TGT', 'TTC', 'TCC', 'TAC', 'TGC', 'TTA', 'TCA', 'TAA', 'TGA', 'TTG', 'TCG', 'TAG', 'TGG', 'CTT', 'CCT', 'CAT', 'CGT', 'CTC', 'CCC', 'CAC', 'CGC', 'CTA', 'CCA', 'CAA', 'CGA', 'CTG', 'CCG', 'CAG', 'CGG', 'ATT', 'ACT', 'AAT', 'AGT', 'ATC', 'ACC', 'AAC', 'AGC', 'ATA', 'ACA', 'AAA', 'AGA', 'ATG', 'ACG', 'AAG', 'AGG', 'GTT', 'GCT', 'GAT', 'GGT', 'GTC', 'GCC', 'GAC', 'GGC', 'GTA', 'GCA', 'GAA', 'GGA', 'GTG', 'GCG', 'GAG', 'GGG'], p=None, frame=None)[source]

New in version 0.3.3.

Returns an iterator of nucleotidic sequences, based on a defined genetic code (passed as parameter, defaults to the universal one). The sequence if first sampled with replacement from the codon list, with a number of codons that covers the length chosen plus an additional one to allow a frame shift as set by frame

Note

If the probability (for each codon) are supplied, the number of sequences required to match those probabilities within a 10% margin of error is of at least 10.000 sequences, for 5% at leas 100.000

Parameters
  • n (int) – number of sequences to yield

  • length (int) – length of the sequences

  • codons (iterable) – codons used when generating the sequences

  • p (tuple) – probability of each codon occurence, in the same order as codons

  • frame (int or None) – used to define a specific frame shift occuring in the sequence (0 to 2) or a random one (if None)

Yields

str – string representing a nucleotidic sequence

mgkit.utils.sequence.reverse_aa_coord(start, end, seq_len)[source]

Used to reverse amino-acid coordinates when parsing an AA annotation on the - strand. Used when the BLAST or HMMER annotations use AA sequences.

Parameters
  • start (int) – start of the annotation

  • end (int) – end of the annotation

  • seq_len (int) – aa sequence length

Returns

reversed (from strand - to strand +) coordinates. The first element is the converted start and the second element is the converted end

Return type

tuple

Note

  • start and end are 1-based indices

mgkit.utils.sequence.reverse_complement(seq, tbl={65: 'T', 67: 'G', 71: 'C', 84: 'A'})[source]

Returns the reverse complement of a nucleotide sequence

Parameters
Return str

returns the reverse complement of a nucleotide sequence

mgkit.utils.sequence.reverse_complement_old(seq, tbl=None)[source]

Returns the reverse complement of a nucleotide sequence

Parameters
  • seq (str) – nucleotide sequence with uppercase characters

  • tbl (dict) – dictionary of complement bases, like REV_COMP

Return str

returns the reverse complement of a nucleotide sequence

mgkit.utils.sequence.sequence_composition(sequence, chars='A', 'T', 'C', 'G')[source]

New in version 0.1.13.

Returns the number of occurences of each unique character in the sequence

Parameters
  • sequence (str) – sequence

  • chars (iterable, None) – iterable of the chars to test, default to (A, C, T, G). if None checks all unique characters in the sequencce

Yields

tuple – the first element is the nucleotide and the second is the number of occurences in sequence

mgkit.utils.sequence.sequence_gc_content(sequence)[source]

Changed in version 0.3.3: in case of ZeroDivisionError returns .5

New in version 0.1.13.

Calculate GC content information for an annotation. The formula is:

(1)\[\frac {(G + C)}{(G + C + A + T)}\]
Parameters

sequence (str) – sequence

Returns

GC content

Return type

float

mgkit.utils.sequence.sequence_gc_ratio(sequence)[source]

New in version 0.1.13.

Calculate GC ratio information for a sequence. The formula is:

(2)\[\frac {(A + T)}{(G + C)}\]
Parameters

sequence (str) – sequence

Returns

GC ratio, or numpy.nan if G = C = 0

Return type

float

mgkit.utils.sequence.translate_sequence(sequence, start=0, tbl=None, reverse=False)[source]

Translate a nucleotide sequence in an amino acid one.

Parameters
  • sequence (str) – sequence to translate, it’s expected to be all caps

  • start (int) – 0-based index for the translation to start

  • tbl (dict) – dictionary with the translation for each codon

  • reverse (bool) – if True, reverse_complement() will be called and the returned sequence translated

Return str

the translated sequence