mgkit.snps.funcs module¶
Functions used in SNPs manipulation
-
mgkit.snps.funcs.
build_rank_matrix
(dataframe, taxonomy=None, taxon_rank=None)[source]¶ Make a rank matrix from a
pandas.Series
with the pN/pS values of a dataset.- Parameters
dataframe –
pandas.Series
instance with a MultiIndex (gene-taxon)taxonomy –
taxon.Taxonomy
instance with the full taxonomytaxon_rank (str) – taxon rank to limit the specifity of the taxa included
- Returns
pandas.DataFrame
instance
-
mgkit.snps.funcs.
combine_sample_snps
(snps_data, min_num, filters, index_type=None, gene_func=None, taxon_func=None, use_uid=False, flag_values=False, haplotypes=True, store_uids=False, partial_calc=False, partial_syn=True)[source]¶ Changed in version 0.2.2: added use_uid argument
Changed in version 0.3.1: added haplotypes
Changed in version 0.4.0: added store_uids
Changed in version 0.5.3: added partial_calc and partial_type
Combine a dictionary sample->gene_index->GeneSyn into a
pandas.DataFrame
. The dictionary is first filtered with the functions in filters, mapped to different taxa and genes using taxon_func and gene_func respectively. The returned DataFrame is also filtered for each row having at least a min_num of not NaN values.- Parameters
snps_data (dict) – dictionary with the GeneSNP instances
min_num (int) – the minimum number of not NaN values necessary in a row to be returned
filters (iterable) – iterable containing filter functions, a list can be found in
mgkit.snps.filter
index_type (str, None) – if None, each row index for the DataFrame will be a MultiIndex with gene and taxon as elements. If the equals ‘gene’, the row index will be gene based and if ‘taxon’ will be taxon based
gene_func (func) – a function to map a gene_id to a gene_map. See
mapper.map_gene_id()
for an exampletaxon_func (func) – a function to map a taxon_id to a list of IDs. See
mapper.map_taxon_id_to_rank
ormapper.map_taxon_id_to_ancestor
for examplesuse_uid (bool) – if True, uses the GeneSNP.uid instead of GeneSNP.gene_id
flag_values (bool) – if True,
mgkit.snps.classes.GeneSNP.calc_ratio_flag()
will be used, instead ofmgkit.snps.classes.GeneSNP.calc_ratio()
haplotypes (bool) – if flag_values is False, and haplotypes is True, the 0/0 case will be returned as 0 instead of NaN
store_uids (bool) – if True a dictionary with the uid used for each cell (e.g. gene/taxon/sample)
partial_calc (bool) – if True, only pS or pN values will be calculated, depending on the value of partial_syn
partial_syn (bool) – if both partial_calc and this are True, only pS values will be calculated. If this parameter is False, pN values will be calculated
- Returns
pandas.DataFrame
with the pN/pS values for the input SNPs, with the columns being the samples. if store_uids is True the return value is a tuple (DataFrame, dict)- Return type
DataFrame
-
mgkit.snps.funcs.
flat_sample_snps
(snps_data, min_cov)[source]¶ New in version 0.1.11.
Adds all the values of a gene across all samples into one instance of
classes.GeneSNP
, giving the average gene among all samples.- Parameters
- Returns
the dictionary with only one key (all_samples), which can be used with
combine_sample_snps()
- Return type
-
mgkit.snps.funcs.
group_rank_matrix
(dataframe, gene_map)[source]¶ Group a rank matrix using a mapping, in the form map_id->ko_ids.
- Parameters
dataframe – instance of a rank matrix from
build_rank_matrix()
gene_map (dict) – dictionary with the mapping
- Returns
pandas.DataFrame
instance
-
mgkit.snps.funcs.
order_ratios
(ratios, aggr_func=<function median>, reverse=False, key_filter=None)[source]¶ Given a dictionary of id->iterable where iterable contains the values of interest, the function uses aggr_func to sort (ascending by default) it and return a list with the key in the sorted order.
- Parameters
- Returns
iterable with the sort order
-
mgkit.snps.funcs.
significance_test
(dataframe, taxon_id1, taxon_id2, test_func=<function ks_2samp>)[source]¶ New in version 0.1.11.
Perform a statistical test on each gene distribution in two different taxa.
For each gene common to the two taxa, the distribution of values in all samples (columns) between the two specified taxa is tested.
- Parameters
dataframe –
pandas.DataFrame
instancetaxon_id1 – the first taxon ID
taxon_id2 – the second taxon ID
test_func – function used to test, defaults to
scipy.stats.ks_2samp()
- Returns
with all pvalues from the tests
- Return type
pandas.Series
-
mgkit.snps.funcs.
write_sign_genes_table
(out_file, dataframe, sign_genes, taxonomy, gene_names=None)[source]¶ Write a table with the list of significant genes found in a dataframe, the significant gene list is the result of
wilcoxon_pairwise_test_dataframe()
.- Out_file
the file name or file object to write the file
- Dataframe
the dataframe which was tested for significant genes
- Sign_genes
gene list that are significant
- Taxonomy
taxonomy object
- Gene_names
dictionary with the name of the the genes. Optional