Changed in version 0.3.4: updates
Tutorial¶
The aim of this tutorial is to show how to build a pipeline to analyse metagenomic samples. Moreover, the SNPs calling part was made to show how diversity estimates can be calculated from metagenomic data, hence it should be changed to be more strict.
We’re going to use Peru Margin Subseafloor Biosphere as an example, which can be download from the ENA website.
This tutorial is expected to run on a UNIX (Linux/MacOSX/Solaris), with the bash shell running, because of some of the loops (not tested with other shells).
Note
We assume that all scripts/commands are run in the same directory.
Warning
It is advised to run the tutorial on a cluster/server: the memory requirements for the programs used are quite high (external to the library).
Initial setup¶
We will assume that the pipeline and it’s relative packages are already installed on the system where the tutorial is run, either through a system-wide install or a virtual environment (advised). The details are in the Installation section of the documentation.
Also for the rest of the tutorial we assume that the following software are installed and accessible system-wide:
Getting Sequence Data¶
The data is stored on the EBI ftp as well, and can be downloaded with the following command (on Linux):
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001326/SRR001326.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001325/SRR001325.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001323/SRR001323.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001322/SRR001322.fastq.gz
on MacOSX you can replace wget with curl -O.
And then uncompress with:
$ gunzip *.fastq.gz
Taxonomy Data¶
We only need the taxonomy for an optional part of the gene prediction for the analysis. It can be downloaded using the command:
$ download-taxonomy.sh
The data will be saved in the file taxonomy.pickle to which we’ll refer from now on. More information can be found in Download Taxonomy
Metagenome Assembly¶
We’re going to use velvet to assemble the metagenomics sample, using the following commands in bash:
$ velveth velvet_work 31 -fmtAuto *.fastq
$ velvetg velvet_work -min_contig_lgth 50
The contigs are in the velvet_work/contigs.fa file. We want to take out some of the information in each sequence header, to make it easier to identify them. We decided to keep only NODE_#, where # is a unique number in the file (e.g. from >NODE_27_length_157_cov_703.121033 we keep only >NODE_27). We used this command in bash:
$ cat velvet_work/contigs.fa | sed -E 's/(>NODE_[0-9]+)_.+/\1/g' > assembly.fa
Alternatively, fasta-utils uid can be used to avoid problems with spaces in the headers:
$ fasta-utils uid cat velvet_work/contigs.fa assembly.fa
Gene Prediction¶
Gene prediction can be done with any software that supports the tab format that BLAST outputs. Besides BLAST, RAPSearch can be used as well.
Before that a suitable DB must be downloaded. In this tutorial we’ll use the SwissProt portion of Uniprot <http://www.uniprot.org> that can be downloaded using the following commands:
$ wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
$ gunzip uniprot_sprot.fasta.gz
Using BLAST¶
BLAST needs the DB to be indexed using the following command:
$ makeblastdb -dbtype prot -in uniprot_sprot.fasta
After which BLAST can be run:
$ blastx -query assembly.fasta -db uniprot_sprot.fasta -out \
assembly.uniprot.tab -outfmt 6
Using RAPSearch¶
RAPSearch is faster than BLAST, while giving similar results. As with BLAST, there is a command to be executed before it can predict genes:
$ prerapsearch -d uniprot_sprot.fasta -n uniprot_sprot
After this command is complete its execution, RAPSearch can be started:
$ rapsearch -q assembly.fa -d uniprot_sprot -o assembly.uniprot.tab
RAPSearch will produce two files, assembly.uniprot.tab.m8 and assembly.uniprot.tab.aln. assembly.uniprot.tab.m8 is the file in the correct format, so we can rename it and remove the other one:
$ rm assembly.uniprot.tab.aln
$ mv assembly.uniprot.tab.m8 assembly.uniprot.tab
Create the GFF¶
After BLAST or RAPSearch are finished, we can convert all predictions to a GFF file:
$ blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab \
assembly.uniprot.gff
And then, because the number of annotations is high, we filter them to reduce the number of overlapping annotations:
$ filter-gff overlap assembly.uniprot.gff assembly.uniprot-filt.gff
This will result in a smaller file. Both script supports piping, so they can be used together, for example to save a compressed file:
$ blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab | \
filter-gff overlap - assembly.uniprot-filt.gff
Finally, rename the filtered GFF file:
$ mv assembly.uniprot-filt.gff assembly.uniprot.gff
Warning
filter-gff may require a lot of memory, so it’s recommended to read its documentation for strategies on lowering the memory requirements for big datasets (a small script to sort a GFF is included sort-gff.sh)
Taxonomic Refinement¶
This section is optional, as taxonomic identifiers are assigned using Uniprot, but it can result in better identification. It requires the the nt database from NCBI to be found on the system, in the ncbi-db directory.
if you don’t have the nt database installed, it can be downloaded (> 80GB uncompressed, about 30 compressed) with this command (you’ll need to install ncftpget):
$ mkdir ncbi-db
$ cd ncbi-db
$ ncftpget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz
$ tar xfvz *.tar.gz
$ cd ..
To do it, first the nucleotide sequences must be extracted and then use blastn against the nt database:
$ get-gff-info sequence -f assembly.fa assembly.uniprot.gff \
assembly.uniprot.frag.fasta
$ blastn -query assembly.uniprot.frag.fasta -db ncbi-db/nt -out \
assembly.uniprot.frag.tab -outfmt 6
After BLAST completes, we need to download supporting file to associate the results with the taxonomic information:
$ download-ncbi-taxa.sh
$ gunzip -c ncbi-nucl-taxa.gz | taxon-utils to_hdf -n nt
We now need to run the taxon-utils (taxon-utils - Taxonomy Utilities) script to find the LCA for each annotation. BLAST will output too many matches, so we want to also filter this file first, with filter-gff. First we convert into GFF the BLAST tab file, then use filter-gff to pick only the 95% quantile of hit length out of all hits and finally filter to get the 95% of identities. Finally run taxon-utils to get the LCA table:
$ blast2gff blastdb -i 3 -r assembly.uniprot.frag.tab | \
filter-gff sequence -t -a length -f quantile -l 0.95 -c gt | \
filter-gff sequence -t -a identity -f quantile -l 0.95 -c gt | \
add-gff-info addtaxa -f taxa-table.hf5:nt | \
taxon-utils lca -b 40 -t taxonomy.pickle -s -p - lca.tab
What we do is convert the BLAST results into a GFF file, removing the version information from the accession. Then filter the GFF keeping only the annotation which are in the top 5% of indentity scores, but also use only annotations that have a bitscore of 40 and write the result as a 2 columns table.
We can now run the script to add the taxonomic information to the GFF file, with:
$ add-gff-info addtaxa -v -t lca.tab -a seq_id -db NCBI-NT \
assembly.uniprot.gff assembly.uniprot-taxa.gff
after it completes, it is safe to rename the output GFF:
$ mv assembly.uniprot-taxa.gff assembly.uniprot.gff
Complete GFF¶
To add the remaining information, mapping to KO and others, including the taxonomic information, a script is provided that downloads this information into a GFF file:
$ add-gff-info uniprot --buffer 500 -t -e -ec -ko \
assembly.uniprot.gff assembly.uniprot-final.gff
After which we can rename the GFF file:
$ mv assembly.uniprot-final.gff assembly.uniprot.gff
Alignment¶
The alignment of all reads to the assembly we’ll be made with bowtie2. The first step is to build the index for the reference (out assembly) with the following command:
$ bowtie2-build assembly.fa assembly.fa
and subsequently start the alignment, using bowtie2 and piping the output SAM file to samtools to convert it into BAM files with this command:
for file in *.fastq; do
BASENAME=`basename $file .fastq`
bowtie2 -N 1 -x assembly.fa -U $file \
--very-sensitive-local \
--rg-id $BASENAME --rg PL:454 --rg PU:454 \
--rg SM:$BASENAME | samtools view -Sb - > $BASENAME.bam;
done
We’ll have BAM files which we need to sort and index:
for file in *.bam; do
samtools sort -o `basename $file .bam`-sort.bam $file;
mv `basename $file .bam`-sort.bam $file
samtools index $file;
done
Coverage and SNP Info¶
The coverage information is added to the GFF and needs to be added for later SNP analysis, including information about the expected number of synonymous and non-synonymous changes. The following lines can do it, using one of the scripts included with the library:
$ export SAMPLES=$(for file in *.bam; do echo -a `basename $file .bam`,$file ;done)
$ add-gff-info coverage $SAMPLES assembly.uniprot.gff | add-gff-info \
exp_syn -r assembly.fa > assembly.uniprot-update.gff
$ mv assembly.uniprot-update.gff assembly.uniprot.gff
$ unset SAMPLES
The first line prepares part of the command line for the script and stores it into an environment variable, while the last command unsets the variable, as it’s not needed anymore. The second command adds the expected number of synonymous and non-synonymous changes for each annotation.
A faster way to add the coverage to a GFF file is to use the cov_samtools command instead:
$ for x in *.bam; do samtools depth -aa $x > `basename $x .bam`.depth; done
$ add-gff-info cov_samtools $(for file in *.depth; do echo -s `basename $file .depth` -d $file ;done) assembly.uniprot.gff assembly.uniprot-update.gff
$ mv assembly.uniprot-update.gff assembly.uniprot.gff
This requires the creation of depth files from samtools, which can be fairly big. The script will accept files compressed with gzip, bzip2 (and xz if the module is available), but will be slower. For this tutorial, each uncompressed depth file is aboud 110MB.
The coverage command memory footprint is tied to the GFF file (kept in memory). The cov_samtools reads the depth information one line at a time and keeps a numpy array for each sequence in memory (and each sample), while the GFF is streamed.
SNP Calling¶
bcftools¶
For calling SNPs, we can use bcftools (v 1.8 was tested)
bcftools mpileup -f assembly.fa -Ou *.bam | bcftools call -m -v -O v --ploidy 1 -A -o assembly.vcf
Data Preparation¶
Diversity Analysis¶
To use diversity estimates (pN/pS) for the data, we need to first first is aggregate all SNP information from the vcf file into data structures that can be read and analysed by the library. This can be done using the included script snp_parser, with this lines of bash:
$ export SAMPLES=$(for file in *.bam; do echo -m `basename $file .bam`;done)
$ snp_parser -s -v -g assembly.uniprot.gff -p assembly.vcf -a assembly.fa $SAMPLES
$ unset SAMPLES
Note
The -s options must be added if the VCF file was created with bcftools
Count Data¶
To evaluate the abundance of taxa and functional categories in the data we need to produce one file for each sample using htseq-count, from the HTSeq library.
for file in *.bam; do
htseq-count -f bam -r pos -s no -t CDS -i uid -a 8 \
-m intersection-nonempty $file assembly.uniprot.gff \
> `basename $file .bam`-counts.txt
done
And to add the counts to the GFF file:
$ add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` \
`for x in *-counts.txt; do echo -c $x; done` assembly.uniprot.gff tmp.gff
$ mv tmp.gff assembly.uniprot.gff
Alternatively featureCounts from the subread package can be used:
$ featureCounts -a assembly.uniprot.gff -g uid -O -t CDS -o counts-featureCounts.txt *.bam
And adding it to the GFF is similar:
$ add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` -c counts-featureCounts.txt -e assembly.uniprot.gff tmp.gff
$ mv tmp.gff assembly.uniprot.gff
Note however that there will be one file only made by featureCounts and that is allowed when using add-gff-info counts when the -e option is passed.
Additional Downloads¶
The following files needs to be downloaded to analyse the functional categories in the following script:
$ wget http://eggnog.embl.de/version_3.0/data/downloads/COG.members.txt.gz
$ wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.members.txt.gz
$ wget http://eggnog.embl.de/version_3.0/data/downloads/COG.funccat.txt.gz
$ wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.funccat.txt.gz
and this for Enzyme Classification:
$ wget ftp://ftp.expasy.org/databases/enzyme/enzclass.txt
Full Bash Script¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | #!/bin/bash
#download data
#50 meters
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001326/SRR001326.fastq.gz
#1 meter
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001325/SRR001325.fastq.gz
#32 meters
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001323/SRR001323.fastq.gz
#16 meters
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001322/SRR001322.fastq.gz
#uncompress data
gunzip -v *.fastq.gz
#assembly - preparatory phase
velveth velvet_work 31 -fmtAuto *.fastq
#assembly
velvetg velvet_work -min_contig_lgth 50
#change sequence names
cat velvet_work/contigs.fa | sed -E 's/(>NODE_[0-9]+)_.+/\1/g' > assembly.fa
#alternative
#fasta-utils uid cat velvet_work/contigs.fa assembly.fa
#remove velvet working directory
rm -R velvet_work
#To use the LCA option and other analysis we need a taxonomy file
download-taxonomy.sh
#Uniprot SwissProt DB
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
#Uncompress it
gunzip uniprot_sprot.fasta.gz
########
#Gene prediction
###BLAST
#index Uniprot
makeblastdb -dbtype prot -in uniprot_sprot.fasta
#Run blastx
blastx -query assembly.fa -db uniprot_sprot.fasta -out \
assembly.uniprot.tab -outfmt 6
###RAPSearch
#Index
# prerapsearch -d uniprot_sprot.fasta -n uniprot_sprot
#Run
# rapsearch -q assembly.fa -d uniprot_sprot -o assembly.uniprot.tab
#rename .m8 file to assembly.uniprot.tab and delete .aln
#rm assembly.uniprot.tab.aln
#mv assembly.uniprot.tab.m8 assembly.uniprot.tab
########
#Converts gene prediction into GFF annotations
blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab \
assembly.uniprot.gff
filter-gff overlap assembly.uniprot.gff assembly.uniprot-filt.gff
#rename the new filtered file
mv assembly.uniprot-filt.gff assembly.uniprot.gff
########
#Taxonomic refinement - requires NCBI nt DB installed and indexed
export NCBINT_DIR=ncbi-db
if [ -d "$NCBINT_DIR" ]; then
echo "Taxonomic refinement";
#Extract annotations sequences
get-gff-info sequence -f assembly.fa assembly.uniprot.gff \
assembly.uniprot.frag.fasta
#Use blastn to match against NCBI NT
blastn -query assembly.uniprot.frag.fasta -db ncbi-db/nt -out \
assembly.uniprot.frag.tab -outfmt 6
#Download necessary data
download-ncbi-taxa.sh
gunzip -c ncbi-nucl-taxa.gz | taxon-utils to_hdf -n nt
# Get the LCA for the sequences
blast2gff blastdb -i 3 -r assembly.uniprot.frag.tab | \
filter-gff sequence -t -a length -f quantile -l 0.95 -c gt | \
filter-gff sequence -t -a identity -f quantile -l 0.95 -c gt | \
add-gff-info addtaxa -f taxa-table.hf5:nt | \
taxon-utils lca -b 40 -t taxonomy.pickle -s -p - lca.tab
# Add the LCA info to the GFF
add-gff-info addtaxa -v -t lca.tab -a seq_id -db NCBI-NT \
assembly.uniprot.gff assembly.uniprot-taxa.gff
#rename the file to continue the script
mv assembly.uniprot-taxa.gff assembly.uniprot.gff
fi
unset NCBINT_DIR
########
#Finalise information from Gene Prediction
#Adds remaining taxonomy, EC numbers, KO and eggNOG mappings
add-gff-info uniprot --buffer 500 -t -e -ec -ko \
assembly.uniprot.gff assembly.uniprot-final.gff
#Rename the GFF
mv assembly.uniprot-final.gff assembly.uniprot.gff
########
#Alignments
bowtie2-build assembly.fa assembly.fa
for file in *.fastq; do
BASENAME=`basename $file .fastq`;
bowtie2 -N 1 -x assembly.fasta -U $file \
--very-sensitive-local \
--rg-id $BASENAME --rg PL:454 --rg PU:454 \
--rg SM:$BASENAME | samtools view -Sb - > $BASENAME.bam;
done
#sort and index BAM files with samtools
for file in *.bam; do
samtools sort $file `basename $file .bam`-sort;
#removes the unsorted file, it's not needed
mv `basename $file .bam`-sort.bam $file
samtools index $file;
done
########
#Add coverage and expected changes to GFF file
#export SAMPLES=$(for file in *.bam; do echo -a `basename $file .bam`,$file ;done)
#Coverage info
#add-gff-info coverage $SAMPLES assembly.uniprot.gff | add-gff-info \
# exp_syn -r assembly.fa > assembly.uniprot-update.gff
#rename to continue the script
#mv assembly.uniprot-update.gff assembly.uniprot.gff
#unset SAMPLES
#Faster
for x in *.bam; do samtools depth -aa $x > `basename $x .bam`.depth; done
add-gff-info cov_samtools $(for file in *.depth; do echo -s `basename file .depth` -d $file ;done) assembly.uniprot.gff assembly.uniprot-update.gff
mv assembly.uniprot-update.gff assembly.uniprot.gff
########
#SNP calling using samtools
bcftools mpileup -f assembly.fa -Ou *.bam | bcftools call -v -O v --ploidy 1 -A -o assembly.vcf
#Index fasta with Picard tools - GATK requires it
#java -jar picard-tools/picard.jar CreateSequenceDictionary \
# R=assembly.fa O=assembly.dict
#merge vcf
#export SAMPLES=$(for file in *.bam.vcf; do echo -V:`basename $file .bam.vcf` $file ;done)
# java -Xmx10g -jar GATK/GenomeAnalysisTK.jar \
# -R assembly.fa -T CombineVariants -o assembly.vcf \
# -genotypeMergeOptions UNIQUIFY \
# $SAMPLES
# unset SAMPLES
########
#snp_parser
export SAMPLES=$(for file in *.bam; do echo -m `basename $file .bam`;done)
snp_parser -s -v -g assembly.uniprot.gff -p assembly.vcf -a assembly.fa $SAMPLES
unset SAMPLES
########
#Count reads
# Using HTSeq
for file in *.bam; do
htseq-count -f bam -r pos -s no -t CDS -i uid -a 8 \
-m intersection-nonempty $file assembly.uniprot.gff \
> `basename $file .bam`-counts.txt
done
# Adding counts to GFF
add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` \
`for x in *-counts.txt; do echo -c $x; done` assembly.uniprot.gff tmp.gff
mv tmp.gff assembly.uniprot.gff
# Using featureCounts
#featureCounts -a assembly.uniprot.gff -g uid -O -t CDS -o counts-featureCounts.txt *.bam
#add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` -c counts-featureCounts.txt -e assembly.uniprot.gff tmp.gff
#mv tmp.gff assembly.uniprot.gff
########
#eggNOG mappings
wget http://eggnog.embl.de/version_3.0/data/downloads/COG.members.txt.gz
wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.members.txt.gz
wget http://eggnog.embl.de/version_3.0/data/downloads/COG.funccat.txt.gz
wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.funccat.txt.gz
########
#EC names
wget ftp://ftp.expasy.org/databases/enzyme/enzclass.txt
|
Footnotes