mgkit.align module¶
Module dealing with BAM/SAM files
-
class
mgkit.align.
SamtoolsDepth
(file_handle, num_seqs=10000, max_size=1000000, max_size_dict=None, calc_density=False, dtype='uint32')[source]¶ Bases:
object
Changed in version 0.4.2: several optimisations and changes to support a scanning approach, instead of lookup table. No exception is raised when a sequence is not found in the file, instead assuming that the coverage is 0
Changed in version 0.4.0: uses pandas.SparseArray now. It should use less memory, but needs pandas version > 0.24
New in version 0.3.0.
This a class that helps use the results from
read_samtools_depth()
.There are 2 modes of operations:
Request a region coverage via
SamtoolsDepth.region_coverage()
Advance the samtools depth file until a sequence coverage is read
The method 1) was the default in MGKit < 0.4.2, when the user would request a region coverage and this class would advance the reading until the requested region is found. While scanning, all the sequences encountered are kept as pandas Series with SparseArray declared, reindexed from 0 to the max_size_dict values if provided, or max_size. The advantage is that the it’s easier to use, but with bigger datasets will keep high memory usage. This is the case for case 2) which involves calling directly
SamtoolsDepth.advance_file()
returning the sequence ID read. Then the region coverage can be requested the same way as before, but won’t involve scanning the file.The use of a SparseArray in a pandas Series allows the use of samtools depth files that weren’t produced with samtools depth -aa and save memory.
Note
The amount of memory used is fairly high, especially with the increasing number of sequences in a GFF/Depth file. It is recoommended to use max_size_dict to 1) only creates arrays for sequences needed and 2) use arrays of smaller size
Warning
In MGKit 0.4.2, the internal dictionary to keep the SparseArray(s) is a
weakref.WeakValueDictionary
, to improve the release of memory. That on Linux seems to release the array too fast. Upgrade to MGKit version 0.4.3 if this class is needed-
advance_file
()[source]¶ New in version 0.4.2.
Requests the scan of the file and returns the sequence found, the coverage can then be retrivied with
SamtoolsDepth.region_coverage()
-
closed
= False¶
-
data
= None¶
-
density
= None¶
-
drop_sequence
(seq_id)[source]¶ New in version 0.4.2.
Remove the sequence passed from the internal dictionary
-
file_handle
= None¶
-
max_size
= None¶
-
max_size_dict
= None¶
-
not_found
= None¶
-
region_coverage
(seq_id, start, end)[source]¶ Changed in version 0.4.2: now using
SamtoolsDepth.advance_file()
to scan the fileReturns the mean coverage of a region. The start and end parameters are expected to be 1-based coordinates, like the correspondent attributes in
mgkit.io.gff.Annotation
ormgkit.io.gff.GenomicRange
.If the sequence for which the coverage is requested is not found, the depth file is read (and cached) until it is found.
-
mgkit.align.
add_coverage_info
(annotations, bam_files, samples, attr_suff='_cov')[source]¶ Changed in version 0.3.4: the coverage now is returned as floats instead of int
Adds coverage information to annotations, using BAM files.
The coverage information is added for each sample as a ‘sample_cov’ and the total coverage as as ‘cov’ attribute in the annotations.
Note
The bam_files and sample variables must have the same order
- Parameters
annotations (iterable) – iterable of annotations
bam_files (iterable) – iterable of
pysam.Samfile
instancessample (iterable) – names of the samples for the BAM files
-
mgkit.align.
covered_annotation_bp
(files, annotations, min_cov=1, progress=False)[source]¶ New in version 0.1.14.
Returns the number of base pairs covered of annotations over multiple samples.
- Parameters
- Returns
a dictionary whose keys are the uid and the values the number of bases that are covered by reads among all samples
- Return type
-
mgkit.align.
get_region_coverage
(bam_file, seq_id, feat_from, feat_to)[source]¶ Return coverage for an annotation.
Note
feat_from and feat_to are 1-based indexes
-
mgkit.align.
read_samtools_depth
(file_handle, num_seqs=10000, seq_ids=None)[source]¶ Changed in version 0.4.2: the function returns lists instead of numpy arrays for speed (at least in my tests it seems ~4x increase)
Changed in version 0.4.0: now returns 3 array, instead of 2. Also added seq_ids to skip lines
Changed in version 0.3.4: num_seqs can be None to avoid a log message
New in version 0.3.0.
Reads a samtools depth file, returning a generator that yields the array of each base coverage on a per-sequence base.
Note
There’s no need anymore to use samtools depth -aa, because the function returns the position array and this can be used to create a Pandas SparseArray which can be reindexed to include missing positions (with values of 0)
Valid for version < 0.4.0:
The information on position is not used, to use numpy and save memory. samtools depth should be called with the -aa option:
`samtools depth -aa bamfile`
This options will output both base position with 0 coverage and sequneces with no aligned reads
- Parameters
- Yields
tuple – the first element is the sequence identifier, the second one is the list with the positions, the third element is the list with the coverages