mgkit.snps.funcs module

Functions used in SNPs manipulation

mgkit.snps.funcs.build_rank_matrix(dataframe, taxonomy=None, taxon_rank=None)[source]

Make a rank matrix from a pandas.Series with the pN/pS values of a dataset.

Parameters
  • dataframepandas.Series instance with a MultiIndex (gene-taxon)

  • taxonomytaxon.Taxonomy instance with the full taxonomy

  • taxon_rank (str) – taxon rank to limit the specifity of the taxa included

Returns

pandas.DataFrame instance

mgkit.snps.funcs.combine_sample_snps(snps_data, min_num, filters, index_type=None, gene_func=None, taxon_func=None, use_uid=False, flag_values=False, haplotypes=True, store_uids=False, partial_calc=False, partial_syn=True)[source]

Changed in version 0.2.2: added use_uid argument

Changed in version 0.3.1: added haplotypes

Changed in version 0.4.0: added store_uids

Changed in version 0.5.3: added partial_calc and partial_type

Combine a dictionary sample->gene_index->GeneSyn into a pandas.DataFrame. The dictionary is first filtered with the functions in filters, mapped to different taxa and genes using taxon_func and gene_func respectively. The returned DataFrame is also filtered for each row having at least a min_num of not NaN values.

Parameters
  • snps_data (dict) – dictionary with the GeneSNP instances

  • min_num (int) – the minimum number of not NaN values necessary in a row to be returned

  • filters (iterable) – iterable containing filter functions, a list can be found in mgkit.snps.filter

  • index_type (str, None) – if None, each row index for the DataFrame will be a MultiIndex with gene and taxon as elements. If the equals ‘gene’, the row index will be gene based and if ‘taxon’ will be taxon based

  • gene_func (func) – a function to map a gene_id to a gene_map. See mapper.map_gene_id() for an example

  • taxon_func (func) – a function to map a taxon_id to a list of IDs. See mapper.map_taxon_id_to_rank or mapper.map_taxon_id_to_ancestor for examples

  • use_uid (bool) – if True, uses the GeneSNP.uid instead of GeneSNP.gene_id

  • flag_values (bool) – if True, mgkit.snps.classes.GeneSNP.calc_ratio_flag() will be used, instead of mgkit.snps.classes.GeneSNP.calc_ratio()

  • haplotypes (bool) – if flag_values is False, and haplotypes is True, the 0/0 case will be returned as 0 instead of NaN

  • store_uids (bool) – if True a dictionary with the uid used for each cell (e.g. gene/taxon/sample)

  • partial_calc (bool) – if True, only pS or pN values will be calculated, depending on the value of partial_syn

  • partial_syn (bool) – if both partial_calc and this are True, only pS values will be calculated. If this parameter is False, pN values will be calculated

Returns

pandas.DataFrame with the pN/pS values for the input SNPs, with the columns being the samples. if store_uids is True the return value is a tuple (DataFrame, dict)

Return type

DataFrame

mgkit.snps.funcs.flat_sample_snps(snps_data, min_cov)[source]

New in version 0.1.11.

Adds all the values of a gene across all samples into one instance of classes.GeneSNP, giving the average gene among all samples.

Parameters
  • snps_data (dict) – dictionary with the instances of classes.GeneSNP

  • min_cov (int) – minimum coverage required for the each instance to be added

Returns

the dictionary with only one key (all_samples), which can be used with combine_sample_snps()

Return type

dict

mgkit.snps.funcs.group_rank_matrix(dataframe, gene_map)[source]

Group a rank matrix using a mapping, in the form map_id->ko_ids.

Parameters
Returns

pandas.DataFrame instance

mgkit.snps.funcs.order_ratios(ratios, aggr_func=<function median>, reverse=False, key_filter=None)[source]

Given a dictionary of id->iterable where iterable contains the values of interest, the function uses aggr_func to sort (ascending by default) it and return a list with the key in the sorted order.

Parameters
  • ratios (dict) – dictionary instance id->iterable

  • aggr_func (function) – any function returning a value that can be used as a key in sorting

  • reverse (bool) – the default is ascending sorting (False), set to True to reverse key_filter: list of keys to use for ordering, if None, every key is used

Returns

iterable with the sort order

mgkit.snps.funcs.significance_test(dataframe, taxon_id1, taxon_id2, test_func=<function ks_2samp>)[source]

New in version 0.1.11.

Perform a statistical test on each gene distribution in two different taxa.

For each gene common to the two taxa, the distribution of values in all samples (columns) between the two specified taxa is tested.

Parameters
  • dataframepandas.DataFrame instance

  • taxon_id1 – the first taxon ID

  • taxon_id2 – the second taxon ID

  • test_func – function used to test, defaults to scipy.stats.ks_2samp()

Returns

with all pvalues from the tests

Return type

pandas.Series

mgkit.snps.funcs.write_sign_genes_table(out_file, dataframe, sign_genes, taxonomy, gene_names=None)[source]

Write a table with the list of significant genes found in a dataframe, the significant gene list is the result of wilcoxon_pairwise_test_dataframe().

Out_file

the file name or file object to write the file

Dataframe

the dataframe which was tested for significant genes

Sign_genes

gene list that are significant

Taxonomy

taxonomy object

Gene_names

dictionary with the name of the the genes. Optional