mgkit.counts.func module¶

New in version 0.1.13.

Misc functions for count data

mgkit.counts.func.batch_load_htseq_counts(count_files, samples=None, cut_name=None)[source]¶

Loads a list of htseq count result files and returns a DataFrame (IDxSAMPLE)

The sample names are names are the file names if samples and cut_name are None, supplying a list of sample names with samples is the preferred way, and cut_name is used for backward compatibility and as an option in cases a string replace is enough.

Parameters

count_files (file or str) – file handle or string with file name
samples (iterable) – list of sample names, in the same order as count_files
cut_name (str) – string to delete from the the file names to get the sample names

Returns

with sample names as columns and gene_ids as index

Return type

pandas.DataFrame

mgkit.counts.func.filter_counts(counts_iter, info_func, gfilters=None, tfilters=None)[source]¶

Returns counts that pass filters for each uid associated gene_id and taxon_id.

Parameters

counts_iter (iterable) – iterator that yields a tuple (uid, count)
info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
gfilters (iterable) – list of filters to apply to each uid associated gene_id
tfilters (iterable) – list of filters to apply to each uid associated taxon_id

Yields

tuple – (uid, count) that pass filters

mgkit.counts.func.from_gff(annotations, samples, ann_func=None, sample_func=None)[source]¶

New in version 0.3.1.

Loads count data from a GFF file, only for the requested samples. By default the function returns a DataFrame where the index is the uid of each annotation and the columns the requested samples.

This can be customised by supplying ann_func and sample_func. sample_func is a function that accept a sample name and is expected to return a string or a tuple. This will be used to change the columns in the DataFrame. ann_func must accept an mgkit.io.gff.Annotation instance and return an iterable, with each iteration yielding either a single element or a tuple (for a MultiIndex DataFrame), each element yielded will have the count of that annotation added to.

Parameters

annotation (iterable) – iterable yielding annotations
samples (iterable) – list of samples to keep
ann_func (func) – function used to customise the output
sample_func (func) – function to customise the column elements

Returns

dataframe with the count data, columns are the samples and rows the annotation counts (unless mapped with ann_func)

Return type

DataFrame

Exmples:

Assuming we have a list of annotations and sample SAMPLE1 and SAMPLE2 we can obtain the count table for all annotations with this

>>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'])

Assuming we want to group the samples, for example treatment1, treatment2 and control1, control2 into a MultiIndex DataFrame column

>>> sample_func = lambda x: ('T' if x.startswith('t') else 'C', x)
>>> from_gff(annotations, ['treatment1', 'treatment2', 'control1',
'control2'], sample_func=sample_func)

Annotations can be mapped to other levels for example instead of using the uid that is the default, it can be mapped to the gene_id, taxon_id information that is included in the annotation, resulting in a MultiIndex index for the rows, with (gene_id, taxon_id) as key.

>>> ann_func = lambda x: [(x.gene_id, x.taxon_id)]
>>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'], ann_func=ann_func)

mgkit.counts.func.get_uid_info(info_dict, uid)[source]¶: Simple function to get a value from a dictionary of tuples (gene_id, taxon_id)

mgkit.counts.func.get_uid_info_ann(annotations, uid)[source]¶: Simple function to get a value from a dictionary of annotations

mgkit.counts.func.load_counts_from_gff(annotations, elem_func=<function <lambda>>, sample_func=None, nozero=True)[source]¶

New in version 0.2.5.

Loads counts for each annotations that are stored into the annotation counts_ attributes. Annotations with a total of 0 counts are skipped by default (nozero=True), the row index is set to the uid of the annotation and the column to the sample name. The functions used to transform the indices expect the annotation (for the row, elem_func) and the sample name (for the column, sample_func).

Parameters

annotations (iter) – iterable of annotations
elem_func (func) – function that accepts an annotation and return a str/int for a Index or a tuple for a MultiIndex, defaults to returning the uid of the annotation
sample_func (func, None) – function that accepts the sample name and returns tuple for a MultiIndex. Defaults to None so no transformation is performed
nozero (bool) – if True, annotations with no counts are skipped

mgkit.counts.func.load_deseq2_results(file_name, taxon_id=None)[source]¶

New in version 0.1.14.

Reads a CSV file output with DESeq2 results, adding a taxon_id to the index for concatenating multiple results from different taxonomic groups.

Parameters: file_name (str) – file name of the CSV
Returns: a MultiIndex DataFrame with the results
Return type: pandas.DataFrame

mgkit.counts.func.load_htseq_counts(file_handle, conv_func=<class 'int'>)[source]¶

Changed in version 0.1.15: added conv_func parameter

Loads an HTSeq-count result file

Parameters

file_handle (file or str) – file handle or string with file name
conv_func (func) – function to convert the number from string, defaults to int, but float can be used as well

Yields

tuple – first element is the gene_id and the second is the count

mgkit.counts.func.load_sample_counts(info_dict, counts_iter, taxonomy, inc_anc=None, rank=None, gene_map=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)[source]¶

Changed in version 0.1.14: added cached argument

Changed in version 0.1.15: added uid_used parameter

Changed in version 0.2.0: info_dict can be a function

Reads sample counts, filtering and mapping them if requested. It’s an example of the usage of the above functions.

Parameters

info_dict (dict) – dictionary that has uid as key and (gene_id, taxon_id) as value. In alternative a function that accepts a uid as sole argument and returns (gene_id, taxon_id)
counts_iter (iterable) – iterable that yields a (uid, count)
taxonomy – taxonomy instance
inc_anc (int, list) – ancestor taxa to include
rank (str) – rank to which map the counts
gene_map (dict) – dictionary with the gene mappings
ex_anc (int, list) – ancestor taxa to exclude
include_higher (bool) – if False, any rank different than the requested one is discarded
cached (bool) – if True, the function will use mgkit.simple_cache.memoize to cache some of the functions used
uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved

Returns

array with MultiIndex (gene_id, taxon_id) with the filtered and mapped counts

Return type

pandas.Series

mgkit.counts.func.load_sample_counts_to_genes(info_func, counts_iter, taxonomy, inc_anc=None, gene_map=None, ex_anc=None, cached=True, uid_used=None)[source]¶

New in version 0.1.14.

Changed in version 0.1.15: added uid_used parameter

Reads sample counts, filtering and mapping them if requested. It’s a variation of load_sample_counts(), with the counts being mapped only to each specific gene_id. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.

Parameters

info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
counts_iter (iterable) – iterable that yields a (uid, count)
taxonomy – taxonomy instance
inc_anc (int, list) – ancestor taxa to include
rank (str) – rank to which map the counts
gene_map (dict) – dictionary with the gene mappings
ex_anc (int, list) – ancestor taxa to exclude
cached (bool) – if True, the function will use mgkit.simple_cache.memoize to cache some of the functions used
uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved

Returns

array with Index gene_id with the filtered and mapped counts

Return type

pandas.Series

mgkit.counts.func.load_sample_counts_to_taxon(info_func, counts_iter, taxonomy, inc_anc=None, rank=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)[source]¶

New in version 0.1.14.

Changed in version 0.1.15: added uid_used parameter

Reads sample counts, filtering and mapping them if requested. It’s a variation of load_sample_counts(), with the counts being mapped only to each specific taxon. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.

Parameters

info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
counts_iter (iterable) – iterable that yields a (uid, count)
taxonomy – taxonomy instance
inc_anc (int, list) – ancestor taxa to include
rank (str) – rank to which map the counts
ex_anc (int, list) – ancestor taxa to exclude
include_higher (bool) – if False, any rank different than the requested one is discarded
cached (bool) – if True, the function will use mgkit.simple_cache.memoize to cache some of the functions used
uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved

Returns

array with Index taxon_id with the filtered and mapped counts

Return type

pandas.Series

mgkit.counts.func.map_counts(counts_iter, info_func, gmapper=None, tmapper=None, index=None, uid_used=None)[source]¶

Changed in version 0.1.14: added index parameter

Changed in version 0.1.15: added uid_used parameter

Maps counts according to the gmapper and tmapper functions. Each mapped gene ID count is the sum of all uid that have the same ID(s). The same is true for the taxa.

Parameters

counts_iter (iterable) – iterator that yields a tuple (uid, count)
info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
gmapper (func) – fucntion that accepts a gene_id and returns a list of mapped IDs
tmapper (func) – fucntion that accepts a taxon_id and returns a new taxon_id
index (None, str) – if None, the index of the Series if (gene_id, taxon_id), if a str, it can be either gene or taxon, to specify a single value
uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved

Returns

array with MultiIndex (gene_id, taxon_id) with the mapped counts

Return type

pandas.Series

mgkit.counts.func.map_counts_to_category(counts, gene_map, nomap=False, nomap_id='NOMAP')[source]¶

Used to map the counts from a certain gene identifier to another. Genes with no mappings are not counted, unless nomap=True, in which case they are counted as nomap_id.

Parameters

counts (iterator) – an iterator that yield a tuple, with the first value being the gene_id and the second value the count for it
gene_map (dictionary) – a dictionary whose keys are the gene_id yield by counts and the values are iterable of mapping identifiers
nomap (bool) – if False, counts for genes with no mappings in gene_map are discarded, if True, they a counted as nomap_id
nomap_id (str) – name of the mapping for genes with no mappings

Returns

mapped counts

Return type

pandas.Series

mgkit.counts.func.map_gene_id_to_map(gene_map, gene_id)[source]¶: Function that extract a list of gene mappings from a dictionary and returns an empty list if the gene_id is not found.

mgkit.counts.func.map_taxon_id_to_rank(taxonomy, rank, taxon_id, include_higher=True)[source]¶

Maps a taxon_id to the request taxon rank. Returns None if include_higher is False and the found rank is not the one requested.

Internally uses mgkit.taxon.Taxonomy.get_ranked_taxon()

Parameters

taxonomy – taxonomy instance
rank (str) – taxonomic rank requested
taxon_id (int) – taxon_id to map
include_higher (bool) – if False, any rank different than the requested one is discarded

Returns

if the mapping is successful, the ranked taxon_id is returned, otherwise None is returned

Return type

(int, None)

mgkit.counts.func module¶

MGKit: Metagenomic framework

Navigation

Related Topics