sampling-utils - Resampling Utilities¶
Overview¶
New in version 0.3.1.
Resampling Utilities¶
sample command¶
This command samples from a Fasta or FastQ file, based on a probability defined by the user (0.001 or 1 / 1000 by default, -r parameter), for a maximum number of sequences (100,000 by default, -x parameter). By default 1 sample is extracted, but as many as desired can be taken, by using the -n parameter.
The sequence file in input can be either be passed to the standard input or as last parameter on the command line. By defult a Fasta is expected, unless the -q parameter is passed.
The -p parameter specifies the prefix to be used, and if the output files can be gzipped using the -z parameter.
sample_stream command¶
It works in the same way as sample, however the file is sampled only once and the output is the stdout by default. This can be convenient if streams are a preferred way to sample the file.
sync command¶
Used to keep in sync forward and reverse read files in paired-end FASTQ. The scenario is that the sample command was used to resample a FASTQ file, usually the forward, but we need the reverse as well. In this case, the resampled file, called master is passed to the -m option and the input file is the file that is to be synced (reverse). The input file is scanned until the same header is found in the master file and when that happens, the sequence is written. The next sequence is then read from the master file and the process is repeated until all sequence in the master file are found in the input file. This implies having the 2 files sorted in the same way, which is what the sample command does.
Note
the old casava format is not supported by this command at the moment, as it’s unusual to find it in SRA or other repositories as well.
rand_seq command¶
Generate random FastA/Q sequences, allowing the specification of GC content and
number of sequences being coding or random. If the output format chosen is
FastQ, qualities are generated using a decreasing model with added noise. A
constant model can be specified instead with a switch. Parameters such GC,
length and the type of model can be infered by passing a FastA/Q file, with
the quality model fit using a LOWESS (using mgkit.utils.sequence.extrapolate_model()
).
The noise in that case is model as the a normal distribution fitted from the
qualities along the sequence deviating from the fitted LOWSS and scaled back by
half to avoid too drastic changes in the qualities. Also the qualities are
clipped at 40 to avoid compatibility problems with FastQ readers. If inferred,
the model can be saved (as a pickle file) and loaded back for analysis
Changes¶
Changed in version 0.3.4: using click instead of argparse. Now *rand_seq can save and reload models
Changed in version 0.3.3: added sync, sample_stream and rand_seq commnads
Options¶
sampling-utils¶
Main function
sampling-utils [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
-
--cite
¶
rand_seq¶
Generates random FastA/Q sequences
sampling-utils rand_seq [OPTIONS] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-n
,
--num-seqs
<num_seqs>
¶ Number of sequences to generate
- Default
1000
-
-gc
,
--gc-content
<gc_content>
¶ GC content (defaults to .5 out of 1)
- Default
0.5
-
-i
,
--infer-params
<infer_params>
¶ Infer parameters GC content and Quality model from file
-
-r
,
--coding-prop
<coding_prop>
¶ Proportion of coding sequences
- Default
0.0
-
-l
,
--length
<length>
¶ Sequence length
- Default
150
-
-d
,
--const-model
¶
Use a model with constant qualities + noise
-
-x
,
--dist-loc
<dist_loc>
¶ Use as the starting point quality
- Default
30.0
-
-q
,
--fastq
¶
The output file is a FastQ file
-
-m
,
--save-model
<save_model>
¶ Save inferred qualities model to a pickle file
-
-a
,
--read-model
<read_model>
¶ Load qualities model from a pickle file
-
--progress
¶
Shows Progress Bar
Arguments
-
OUTPUT_FILE
¶
Optional argument
sample¶
Sample a FastA/Q multiple times
sampling-utils sample [OPTIONS] [INPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-p
,
--prefix
<prefix>
¶ Prefix for the file name(s) in output
- Default
sample
-
-n
,
--number
<number>
¶ Number of samples to take
- Default
1
-
-r
,
--prob
<prob>
¶ Probability of picking a sequence
- Default
0.001
-
-x
,
--max-seq
<max_seq>
¶ Maximum number of sequences
- Default
100000
-
-q
,
--fastq
¶
The input file is a fastq file
-
-z
,
--gzip
¶
gzip output files
Arguments
-
INPUT_FILE
¶
Optional argument
sample_stream¶
Samples a FastA/Q one time, alternative to sample if multiple sampling is not needed
sampling-utils sample_stream [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-r
,
--prob
<prob>
¶ Probability of picking a sequence
-
-x
,
--max-seq
<max_seq>
¶ Maximum number of sequences
-
-q
,
--fastq
¶
The input file is a fastq file
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument
sync¶
Syncs a FastQ file generated with sample with the original pair of files.
sampling-utils sync [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Options
-
-v
,
--verbose
¶
-
-m
,
--master-file
<master_file>
¶ Required Resampled FastQ file that is out of sync with the original pair
Arguments
-
INPUT_FILE
¶
Optional argument
-
OUTPUT_FILE
¶
Optional argument