This page outlines our standard pipeline for analysis of genetic variation using 2bRAD, a cost-effective and flexible
sequencing-based approach for SNP genotyping. This guide covers the most recently updated version of the 2bRAD utilities pipeline as of July 2018 (v3.0).
These tools are intended to be used on a high-performance computing cluster, and assume a basic knowledge of Linux and command-line
tools. To analyze large numbers of samples in parallel, the user will probably want to (a) combine some of the
following steps into shell scripts, and (b) submit each job to a job scheduler such as SGE. Since the details of
those steps may vary from one cluster to another, we have left those details up to the end user.
Click on each header to expand or collapse that section.
Our scripts rely on BioPerl modules. First, check whether you need to install BioPerl.
perl -MBio::SeqIO -e 0
If you get no feedback, BioPerl is available on your system. If you get an error message something like the following:
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2 ...
this indicates that you need to install BioPerl.
To install BioPerl, go to the BioPerl Wiki and follow the instructions given there.
Our scripts for 2bRAD analysis are hosted on GitHub. You'll want to download scripts from this repository:
git clone git://github.com/Eli-Meyer/2bRAD_utilities.git
Make sure you have execute permission to these scripts, and that they are in your path ($PATH).To check whether bbduk is installed on your system, run:
which bbduk.sh
If you get no feedback, this indicates you need to install bbduk. Go to this
link
and follow the instructions to unpack this collection of software, which requires Java version 6 or higher is installed on your system.
If you get feedback similar to this:
/local/cluster/bbmap/bbduk.sh
This indicates bbduk is already installed and in your path. You can move on to the next step.
To check whether SHRiMP is installed on your system, run:
which gmapper
If you get no feedback, this indicates you need to install SHRiMP. Go to this
link
and download and install SHRiMP according to the directions on that site.
If you get feedback similar to this:
/local/cluster/SHRiMP/bin/gmapper
This indicates SHRiMP is already installed and in your path. You can move on to the next step.
which cd-hit-est
/local/cluster/cd-hit/cd-hit-est
which raxmlHPC
/local/cluster/RAxML/bin/raxmlHPC
which BGC
which GFE
which HGC
/local/cluster/BGC
g++ HGC.cpp -o HGC -lm
Since AlfI produces uniform 36-bp restriction fragments, we first truncate sequences to keep only
the inserts derived from these fragments. This can be accomplished by running:
TruncateFastq.pl -i input.fastq -s 1 -e 36 -o output.fastq
to read "input.fastq", trim away the sequences after 36 bp, and write the output to a file called "output.fastq".
To read instructions for any of the scripts described here, run the script with no arguments. e.g. :
TruncateFastq.pl
Next, we exclude low quality reads that might introduce errors in genotyping. The choice of thresholds is
somewhat arbitrary and ultimately up to the user. For an example using reasonable default settings, run:
QualFilterFastq.pl -i input.fastq -m 20 -x 5 -o output.fastq
to remove any reads from "input.fastq" having more than 5 bases with quality scores lower than 20, and write
the output to a file named "output.fastq".
Finally, the most computationally intensive task. The artificial DNA sequences introduced during
library preparation may occupy unknown portions of the read (including the entire read). Removing
these is probably the most important task in read processing, and certainly the most
computationally intensive. To make the whole pipeline easier to run in parallel, we've recently
switched from the old version (AdaptorFilterFastq.pl) to a newer kmer-based tool
called BBDUK.
This has superior performance in all respects to the old process and most importantly, can be run
directly on large sequence datasets without lots of tedious parallelizing.
To run this tool on a set of reads called input.fastq, to eliminate any reads sharing at least
one 12-mer with sequences in adaptors.fasta, we can run:
bbduk.sh in=hrf.fastq ref=adaptors.fasta k=12 stats=stats.txt out=clean.fastq
to remove any reads in "input.fastq" having valid alignments (at least one 12-bp kmer match)
to sequences in "adaptors.fasta", and write the output (sequences passing the filter) to "output.fastq".
ExtractSites.pl -i assembly.fasta -e AlfI -o output.fasta
head -n 4000000 file1.fastq >> combined.fastq
head -n 4000000 file2.fastq >> combined.fastq
...
head -n 4000000 fileN.fastq >> combined.fastq
Next, constrcut a de novo reference by clustering these reads, using the BuildRef.pl script as:BuildRef.pl -i combined.fastq -o reference.fasta
We recommend the SHRiMP software package, which is both faster and more sensitive in our testing than
any other mapping software currently available. However, in principle another mapping program can be substituted
at this step, as long as the output is in SAM format and includes positive alignment scores in the "AS:" field.
To use SHRiMP, run the gmapper tool, as:
gmapper --qv-offset 33 -Q --strata -o 3 -N 1 reads.fastq reference.fasta >gmapper.sam
In this example, a set of trimmed and filtered reads ("reads.fastq") is mapped against a reference sequence ("reference.fasta")
using a single thread (processor). The raw alignments are written to a file called "gmapper.sam".
SAMFilter.pl -i gmapper.sam -m 32 -o filtered.sam
First, we parse the alignments to record the nucleotides observed at each position in each reference sequence.
This can be accomplished using the SAMBaseCounts.pl script, as:
SAMBaseCounts.pl -i filtered.sam -r reference.fasta -c 5 -o base_counts.tab
In this example, the reference is "reference.fasta", the alignments to be parsed are in "filtered.sam", and we've chosen a minimum coverage of 5 (i.e. loci covered by < 5 reads will be ignored). The output is written to a tab-delimited text file called "base_counts.tab".
Because some of the genotyping methods require population information, we combine the nucleotide counts from all samples prior to
determining genotypes. This can be accomplished using the script CombineBaseCounts.pl, as:
CombineBaseCounts.pl sample1/base_counts.tab sample2/base_counts.tab ... sampleN/base_counts.tab >combined.tab
In this example, we are combining files called "base_counts.tab", for each sample ("sample1", "sample2", etc.
up to sampleN). Depending on your sample names and directory structure, this may be easily accomplished with wildcards:
CombineBaseCounts.pl sample*/base_counts.tab >combined.tab
The output is written to a file named "combined.tab".
Finally, we determine genotype at each position in each sample, based on nucleotide frequencies in that sample and in the population. Our script CallGenotypes.pl includes several options:
CallGenotypes.pl
CallGenotypes.pl -i combined.tab -o genotypes.tab -c 10
Typically we are only interested in polymorphic loci (SNPs). We apply this filter first, since most
loci are monomorphic and excluding them will greatly reduce file sizes. This is accomplished by running
the following script:
PolyFilter.pl -i combined.tab -n 2 -p y -o snps.tab
In this example, we selected all loci at which 2 or more genotypes were observed, writing them to a file called "snps.tab".
The choice of "y" for the print option indicates that we want the script to write the selected loci to
a file. Choosing "n" instead (the default) would only report the number of loci passing the filter, without writing those
genotypes to output.
Note that this script can also be used to filter out loci at which the minor allele is only present in a small number of
individuals (e.g. alleles found in only a single individual). See the -v and -s options for more details.
If a few samples are sequenced at lower depth than the others, they increase the total amount of missing data
with little benefit. This script excludes samples for which too few genotypes were determined. Run the following:
LowcovSampleFilter.pl -i snps.tab -n 1000
In this example, we have chosen to count the number of samples for which at least 1000 loci were genotyped. It's
generally a good idea to first explore a variety of thresholds, to evaluate whether there are a few samples with
substantially more missing data than others. e.g. if gradually increasing the thresholds eliminates a small number
of samples at a low threshold, then no further samples are eliminated as the threshold continues to increase, this
indicates a few samples are contributing disproportionately more to the overall missing data and should be excluded.
Once you've determined the threshold that best balances sample numbers and missing data, extract those loci as e.g.:
LowcovSampleFilter.pl -i snps.tab -n 2200 -p y -o samplefiltered.tab
(In this example, we've chosen to write all samples with at least 2200 SNPs genotyped to a file named
"samplefiltered.tab")
While the previous step excluded samples (columns) genotyped in too few loci, this step excludes loci (rows)
genotyped in too few samples to be informative. This is accomplished as:
MDFilter.pl -i samplefiltered.tab -n 40
Again, its useful to first explore the effects of a variety of thresholds to identify a threshold that achieves
a good balance between the number of loci and the proportion of missing data. Once you've identified a threshold,
run the following:
MDFilter.pl -i samplefiltered.tab -n 45 -p y -o mdfiltered.tab
In this example, we've chosen to write all loci genotyped in at least 45 samples to a file named "mdfiltered.tab".
Repetitive sequences can introduce error into any sequencing-based genotyping method, including 2bRAD. Since
reads cannot be uniquely assigned to one of these loci over the others, this is less of a problem for systems with
a sequenced genome (the unique mapping requirement effectively removes these loci when filtering alignments). For
de novo analysis, or systems with imperfect genome assemblies, it can be useful to explicitly filter out these
sites at this stage.
The independent accumulation of SNPs at multiple loci, and their subsequent mapping back to a single reference sequence,
can be expected to produce sites with unusually high numbers of SNPs. These are excluded to guard against errors
resulting from repetitive sites. This is accomplished as:
RepTagFilter.pl -i mdfiltered.tab -n 2
In this example, we've chosen to count the number of sites bearing no more than 2 SNPs. Once you've determined
which threshold to use, you can write the results to a file as:
RepTagFilter.pl -i mdfiltered.tab -n 2 -p y -o nr.tab
Since multiple SNPs on a single tag (AlfI restriction site) are unlikely to be separated by recombination,
they can usually be expected to segregate as a single locus. For many analyses its appropriate to remove these
redundant SNPs prior to analysis. This can be accomplished as:
OneSNPPerTag.pl -i nr.tab -p y -o selected_snps.tab
For any sites having multiple SNPs, this script selects the SNP with the least missing data and write the output
to a file; in this example, named "selected_snps.tab".
This output constitutes the endpoint of the standard 2bRAD analysis pipeline. The next steps depend on the study
and the biological question, and are up to the end user. This output file is simple, tab-delimited text that can
be easily read in commonly used software e.g. R or Microsoft Excel.
Last updated 14 July 2018, E. Meyer.