mgkit.counts.func module¶
New in version 0.1.13.
Misc functions for count data
-
mgkit.counts.func.
batch_load_htseq_counts
(count_files, samples=None, cut_name=None)[source]¶ Loads a list of htseq count result files and returns a DataFrame (IDxSAMPLE)
The sample names are names are the file names if samples and cut_name are None, supplying a list of sample names with samples is the preferred way, and cut_name is used for backward compatibility and as an option in cases a string replace is enough.
- Parameters
- Returns
with sample names as columns and gene_ids as index
- Return type
pandas.DataFrame
-
mgkit.counts.func.
filter_counts
(counts_iter, info_func, gfilters=None, tfilters=None)[source]¶ Returns counts that pass filters for each uid associated gene_id and taxon_id.
- Parameters
counts_iter (iterable) – iterator that yields a tuple (uid, count)
info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
gfilters (iterable) – list of filters to apply to each uid associated gene_id
tfilters (iterable) – list of filters to apply to each uid associated taxon_id
- Yields
tuple – (uid, count) that pass filters
-
mgkit.counts.func.
from_gff
(annotations, samples, ann_func=None, sample_func=None)[source]¶ New in version 0.3.1.
Loads count data from a GFF file, only for the requested samples. By default the function returns a DataFrame where the index is the uid of each annotation and the columns the requested samples.
This can be customised by supplying ann_func and sample_func. sample_func is a function that accept a sample name and is expected to return a string or a tuple. This will be used to change the columns in the DataFrame. ann_func must accept an
mgkit.io.gff.Annotation
instance and return an iterable, with each iteration yielding either a single element or a tuple (for a MultiIndex DataFrame), each element yielded will have the count of that annotation added to.- Parameters
annotation (iterable) – iterable yielding annotations
samples (iterable) – list of samples to keep
ann_func (func) – function used to customise the output
sample_func (func) – function to customise the column elements
- Returns
dataframe with the count data, columns are the samples and rows the annotation counts (unless mapped with ann_func)
- Return type
DataFrame
- Exmples:
Assuming we have a list of annotations and sample SAMPLE1 and SAMPLE2 we can obtain the count table for all annotations with this
>>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'])
Assuming we want to group the samples, for example treatment1, treatment2 and control1, control2 into a MultiIndex DataFrame column
>>> sample_func = lambda x: ('T' if x.startswith('t') else 'C', x) >>> from_gff(annotations, ['treatment1', 'treatment2', 'control1', 'control2'], sample_func=sample_func)
Annotations can be mapped to other levels for example instead of using the uid that is the default, it can be mapped to the gene_id, taxon_id information that is included in the annotation, resulting in a MultiIndex index for the rows, with (gene_id, taxon_id) as key.
>>> ann_func = lambda x: [(x.gene_id, x.taxon_id)] >>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'], ann_func=ann_func)
-
mgkit.counts.func.
get_uid_info
(info_dict, uid)[source]¶ Simple function to get a value from a dictionary of tuples (gene_id, taxon_id)
-
mgkit.counts.func.
get_uid_info_ann
(annotations, uid)[source]¶ Simple function to get a value from a dictionary of annotations
-
mgkit.counts.func.
load_counts_from_gff
(annotations, elem_func=<function <lambda>>, sample_func=None, nozero=True)[source]¶ New in version 0.2.5.
Loads counts for each annotations that are stored into the annotation counts_ attributes. Annotations with a total of 0 counts are skipped by default (nozero=True), the row index is set to the uid of the annotation and the column to the sample name. The functions used to transform the indices expect the annotation (for the row, elem_func) and the sample name (for the column, sample_func).
- Parameters
annotations (iter) – iterable of annotations
elem_func (func) – function that accepts an annotation and return a str/int for a Index or a tuple for a MultiIndex, defaults to returning the uid of the annotation
sample_func (func, None) – function that accepts the sample name and returns tuple for a MultiIndex. Defaults to None so no transformation is performed
nozero (bool) – if True, annotations with no counts are skipped
-
mgkit.counts.func.
load_deseq2_results
(file_name, taxon_id=None)[source]¶ New in version 0.1.14.
Reads a CSV file output with DESeq2 results, adding a taxon_id to the index for concatenating multiple results from different taxonomic groups.
- Parameters
file_name (str) – file name of the CSV
- Returns
a MultiIndex DataFrame with the results
- Return type
pandas.DataFrame
-
mgkit.counts.func.
load_htseq_counts
(file_handle, conv_func=<class 'int'>)[source]¶ Changed in version 0.1.15: added conv_func parameter
Loads an HTSeq-count result file
- Parameters
file_handle (file or str) – file handle or string with file name
conv_func (func) – function to convert the number from string, defaults to int, but float can be used as well
- Yields
tuple – first element is the gene_id and the second is the count
-
mgkit.counts.func.
load_sample_counts
(info_dict, counts_iter, taxonomy, inc_anc=None, rank=None, gene_map=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)[source]¶ Changed in version 0.1.14: added cached argument
Changed in version 0.1.15: added uid_used parameter
Changed in version 0.2.0: info_dict can be a function
Reads sample counts, filtering and mapping them if requested. It’s an example of the usage of the above functions.
- Parameters
info_dict (dict) – dictionary that has uid as key and (gene_id, taxon_id) as value. In alternative a function that accepts a uid as sole argument and returns (gene_id, taxon_id)
counts_iter (iterable) – iterable that yields a (uid, count)
taxonomy – taxonomy instance
rank (str) – rank to which map the counts
gene_map (dict) – dictionary with the gene mappings
include_higher (bool) – if False, any rank different than the requested one is discarded
cached (bool) – if True, the function will use
mgkit.simple_cache.memoize
to cache some of the functions useduid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
- Returns
array with MultiIndex (gene_id, taxon_id) with the filtered and mapped counts
- Return type
pandas.Series
-
mgkit.counts.func.
load_sample_counts_to_genes
(info_func, counts_iter, taxonomy, inc_anc=None, gene_map=None, ex_anc=None, cached=True, uid_used=None)[source]¶ New in version 0.1.14.
Changed in version 0.1.15: added uid_used parameter
Reads sample counts, filtering and mapping them if requested. It’s a variation of
load_sample_counts()
, with the counts being mapped only to each specific gene_id. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.- Parameters
info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
counts_iter (iterable) – iterable that yields a (uid, count)
taxonomy – taxonomy instance
rank (str) – rank to which map the counts
gene_map (dict) – dictionary with the gene mappings
cached (bool) – if True, the function will use
mgkit.simple_cache.memoize
to cache some of the functions useduid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
- Returns
array with Index gene_id with the filtered and mapped counts
- Return type
pandas.Series
-
mgkit.counts.func.
load_sample_counts_to_taxon
(info_func, counts_iter, taxonomy, inc_anc=None, rank=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)[source]¶ New in version 0.1.14.
Changed in version 0.1.15: added uid_used parameter
Reads sample counts, filtering and mapping them if requested. It’s a variation of
load_sample_counts()
, with the counts being mapped only to each specific taxon. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.- Parameters
info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
counts_iter (iterable) – iterable that yields a (uid, count)
taxonomy – taxonomy instance
rank (str) – rank to which map the counts
include_higher (bool) – if False, any rank different than the requested one is discarded
cached (bool) – if True, the function will use
mgkit.simple_cache.memoize
to cache some of the functions useduid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
- Returns
array with Index taxon_id with the filtered and mapped counts
- Return type
pandas.Series
-
mgkit.counts.func.
map_counts
(counts_iter, info_func, gmapper=None, tmapper=None, index=None, uid_used=None)[source]¶ Changed in version 0.1.14: added index parameter
Changed in version 0.1.15: added uid_used parameter
Maps counts according to the gmapper and tmapper functions. Each mapped gene ID count is the sum of all uid that have the same ID(s). The same is true for the taxa.
- Parameters
counts_iter (iterable) – iterator that yields a tuple (uid, count)
info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
gmapper (func) – fucntion that accepts a gene_id and returns a list of mapped IDs
tmapper (func) – fucntion that accepts a taxon_id and returns a new taxon_id
index (None, str) – if None, the index of the Series if (gene_id, taxon_id), if a str, it can be either gene or taxon, to specify a single value
uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
- Returns
array with MultiIndex (gene_id, taxon_id) with the mapped counts
- Return type
pandas.Series
-
mgkit.counts.func.
map_counts_to_category
(counts, gene_map, nomap=False, nomap_id='NOMAP')[source]¶ Used to map the counts from a certain gene identifier to another. Genes with no mappings are not counted, unless nomap=True, in which case they are counted as nomap_id.
- Parameters
counts (iterator) – an iterator that yield a tuple, with the first value being the gene_id and the second value the count for it
gene_map (dictionary) – a dictionary whose keys are the gene_id yield by counts and the values are iterable of mapping identifiers
nomap (bool) – if False, counts for genes with no mappings in gene_map are discarded, if True, they a counted as nomap_id
nomap_id (str) – name of the mapping for genes with no mappings
- Returns
mapped counts
- Return type
pandas.Series
-
mgkit.counts.func.
map_gene_id_to_map
(gene_map, gene_id)[source]¶ Function that extract a list of gene mappings from a dictionary and returns an empty list if the gene_id is not found.
-
mgkit.counts.func.
map_taxon_id_to_rank
(taxonomy, rank, taxon_id, include_higher=True)[source]¶ Maps a taxon_id to the request taxon rank. Returns None if include_higher is False and the found rank is not the one requested.
Internally uses
mgkit.taxon.Taxonomy.get_ranked_taxon()
- Parameters
- Returns
if the mapping is successful, the ranked taxon_id is returned, otherwise None is returned
- Return type