User's Guide - 2bRAD utilities

◁ Download scripts and install required software

BioPerl

Our scripts rely on BioPerl modules. First, check whether you need to install BioPerl.
perl -MBio::SeqIO -e 0 If you get no feedback, BioPerl is available on your system. If you get an error message something like the following:
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2 ... this indicates that you need to install BioPerl.

To install BioPerl, go to the BioPerl Wiki and follow the instructions given there.

Scripts

Our scripts for 2bRAD analysis are hosted on GitHub. You'll want to download scripts from this repository:

Scripts specifically for 2bRAD analysis.

The best way to do this is to use the git tool. While this is undoubtedly the best and most widely-used platform
for sharing code, the git tool itself is not simple or intuitive. To learn about use of git, we recommend

this resource

git

git clone git://github.com/Eli-Meyer/2bRAD_utilities.git

bbduk.sh

To check whether bbduk is installed on your system, run:
which bbduk.sh
If you get no feedback, this indicates you need to install bbduk. Go to this link and follow the instructions to unpack this collection of software, which requires Java version 6 or higher is installed on your system.
If you get feedback similar to this:
/local/cluster/bbmap/bbduk.sh
This indicates bbduk is already installed and in your path. You can move on to the next step.

SHRiMP

To check whether SHRiMP is installed on your system, run:
which gmapper
If you get no feedback, this indicates you need to install SHRiMP. Go to this link and download and install SHRiMP according to the directions on that site.
If you get feedback similar to this:
/local/cluster/SHRiMP/bin/gmapper
This indicates SHRiMP is already installed and in your path. You can move on to the next step.

CD-HIT

which cd-hit-est

/local/cluster/cd-hit/cd-hit-est

link

RAxML

which raxmlHPC

/local/cluster/RAxML/bin/raxmlHPC

link

BGC

2017

which BGC

which GFE

which HGC

/local/cluster/BGC

FileS1

FileS3

FileS5

g++ HGC.cpp -o HGC -lm

◁ Trim and filter reads

Truncate

Since AlfI produces uniform 36-bp restriction fragments, we first truncate sequences to keep only the inserts derived from these fragments. This can be accomplished by running:
TruncateFastq.pl -i input.fastq -s 1 -e 36 -o output.fastq
to read "input.fastq", trim away the sequences after 36 bp, and write the output to a file called "output.fastq".
To read instructions for any of the scripts described here, run the script with no arguments. e.g. :
TruncateFastq.pl

Quality Filter

Next, we exclude low quality reads that might introduce errors in genotyping. The choice of thresholds is somewhat arbitrary and ultimately up to the user. For an example using reasonable default settings, run:
QualFilterFastq.pl -i input.fastq -m 20 -x 5 -o output.fastq
to remove any reads from "input.fastq" having more than 5 bases with quality scores lower than 20, and write the output to a file named "output.fastq".

Adaptor filter

Finally, the most computationally intensive task. The artificial DNA sequences introduced during library preparation may occupy unknown portions of the read (including the entire read). Removing these is probably the most important task in read processing, and certainly the most computationally intensive. To make the whole pipeline easier to run in parallel, we've recently switched from the old version (AdaptorFilterFastq.pl) to a newer kmer-based tool called BBDUK .
This has superior performance in all respects to the old process and most importantly, can be run directly on large sequence datasets without lots of tedious parallelizing.

To run this tool on a set of reads called input.fastq, to eliminate any reads sharing at least one 12-mer with sequences in adaptors.fasta, we can run:
bbduk.sh in=hrf.fastq ref=adaptors.fasta k=12 stats=stats.txt out=clean.fastq
to remove any reads in "input.fastq" having valid alignments (at least one 12-bp kmer match) to sequences in "adaptors.fasta", and write the output (sequences passing the filter) to "output.fastq".

◁ Prepare reference

(Option 1) Extract AlfI sites from the genome assembly

ExtractSites.pl -i assembly.fasta -e AlfI -o output.fasta

(Option 2) Produce a de novo reference by clustering reads

de novo

head

head -n 4000000 file1.fastq >> combined.fastq

head -n 4000000 file2.fastq >> combined.fastq

...

head -n 4000000 fileN.fastq >> combined.fastq

de novo

BuildRef.pl -i combined.fastq -o reference.fasta

◁ Align reads against reference

Align reads to reference

We recommend the SHRiMP software package, which is both faster and more sensitive in our testing than any other mapping software currently available. However, in principle another mapping program can be substituted at this step, as long as the output is in SAM format and includes positive alignment scores in the "AS:" field. To use SHRiMP, run the gmapper tool, as:
gmapper --qv-offset 33 -Q --strata -o 3 -N 1 reads.fastq reference.fasta >gmapper.sam
In this example, a set of trimmed and filtered reads ("reads.fastq") is mapped against a reference sequence ("reference.fasta") using a single thread (processor). The raw alignments are written to a file called "gmapper.sam".

Filter alignments

SAMFilter.pl -i gmapper.sam -m 32 -o filtered.sam

◁ Determine genotypes from alignments

Count nucleotide frequencies

First, we parse the alignments to record the nucleotides observed at each position in each reference sequence. This can be accomplished using the SAMBaseCounts.pl script, as:
SAMBaseCounts.pl -i filtered.sam -r reference.fasta -c 5 -o base_counts.tab

In this example, the reference is "reference.fasta", the alignments to be parsed are in "filtered.sam", and we've chosen a minimum coverage of 5 (i.e. loci covered by < 5 reads will be ignored). The output is written to a tab-delimited text file called "base_counts.tab".

Combine base counts from multiple samples prior to calling genotypes

Because some of the genotyping methods require population information, we combine the nucleotide counts from all samples prior to determining genotypes. This can be accomplished using the script CombineBaseCounts.pl, as:
CombineBaseCounts.pl sample1/base_counts.tab sample2/base_counts.tab ... sampleN/base_counts.tab >combined.tab
In this example, we are combining files called "base_counts.tab", for each sample ("sample1", "sample2", etc. up to sampleN). Depending on your sample names and directory structure, this may be easily accomplished with wildcards:
CombineBaseCounts.pl sample*/base_counts.tab >combined.tab
The output is written to a file named "combined.tab".

Determine genotypes from nucleotide frequencies

Finally, we determine genotype at each position in each sample, based on nucleotide frequencies in that sample and in the population. Our script CallGenotypes.pl includes several options:

Nucleotide frequencies (-m nf). This method is the default. These simple rules for genotyping were originally described in the 2bRAD method paper (Wang et al. 2012), and have been repeatedly validated. This approach operates directly on allele counts:
- If one allele accounts for ≥99% of read depth, the genotype is called homozygous for that allele
- If a second allele accounts for ≥25% of read depth, the genotype is called heterozygous
- If a second allele is present at >1% and <25%, or if total depth is below a chosen threshold, or if a third allele is present at more than 1% of read depth, then the genotype is not called.
Population genotype frequencies (-m pgf). In this simple addition to the original nucleotide frequency rules, frequencies of each genotype are first compared across the population to identify the valid alleles to be expected in each sample. This allows identification of additional heterozygous genotypes where the expected minor allele (based on population genotype frequencies) occurs at a lower frequency threshold (default: ≥5%).
Bayesian Genotype Caller (-m bgc). This option calls software developed by Maruki & Lynch (2017) [doi: 10.1534/g3.117.039008]. The software implements maximum-likelihood (ML) methods for calling genotypes from high-throughput sequencing data. Population-level information on genotype frequencies and error rates (pre-estimated by an ML method) is incorporated in genotyping decisions.
- Using the GFE program, ML estimates are obtained for the allele frequencies, error rates, and disequilibrium/inbreeding coefficients. From these, the program also estimates genotype frequencies in the population.
- Using the BGC and HGC programs for low and high coverage loci respectively, genotypes are determined at each locus in each sample, based on nucleotide counts and incorporating the genotype-frequency and sequencing-error rate estimates predetermined by an ML genotype frequency estimator (GFE).

CallGenotypes.pl

CallGenotypes.pl -i combined.tab -o genotypes.tab -c 10

◁ Filter genotypes

Filter to select polymorphic loci (SNPs)

Typically we are only interested in polymorphic loci (SNPs). We apply this filter first, since most loci are monomorphic and excluding them will greatly reduce file sizes. This is accomplished by running the following script:
PolyFilter.pl -i combined.tab -n 2 -p y -o snps.tab
In this example, we selected all loci at which 2 or more genotypes were observed, writing them to a file called "snps.tab". The choice of "y" for the print option indicates that we want the script to write the selected loci to a file. Choosing "n" instead (the default) would only report the number of loci passing the filter, without writing those genotypes to output.
Note that this script can also be used to filter out loci at which the minor allele is only present in a small number of individuals (e.g. alleles found in only a single individual). See the -v and -s options for more details.

Filter to exclude low-coverage samples

If a few samples are sequenced at lower depth than the others, they increase the total amount of missing data with little benefit. This script excludes samples for which too few genotypes were determined. Run the following:
LowcovSampleFilter.pl -i snps.tab -n 1000
In this example, we have chosen to count the number of samples for which at least 1000 loci were genotyped. It's generally a good idea to first explore a variety of thresholds, to evaluate whether there are a few samples with substantially more missing data than others. e.g. if gradually increasing the thresholds eliminates a small number of samples at a low threshold, then no further samples are eliminated as the threshold continues to increase, this indicates a few samples are contributing disproportionately more to the overall missing data and should be excluded. Once you've determined the threshold that best balances sample numbers and missing data, extract those loci as e.g.:
LowcovSampleFilter.pl -i snps.tab -n 2200 -p y -o samplefiltered.tab
(In this example, we've chosen to write all samples with at least 2200 SNPs genotyped to a file named "samplefiltered.tab")

Filter to exclude low-coverage loci

While the previous step excluded samples (columns) genotyped in too few loci, this step excludes loci (rows) genotyped in too few samples to be informative. This is accomplished as:
MDFilter.pl -i samplefiltered.tab -n 40
Again, its useful to first explore the effects of a variety of thresholds to identify a threshold that achieves a good balance between the number of loci and the proportion of missing data. Once you've identified a threshold, run the following:
MDFilter.pl -i samplefiltered.tab -n 45 -p y -o mdfiltered.tab
In this example, we've chosen to write all loci genotyped in at least 45 samples to a file named "mdfiltered.tab".

Filter to exclude repetitive sites

Repetitive sequences can introduce error into any sequencing-based genotyping method, including 2bRAD. Since reads cannot be uniquely assigned to one of these loci over the others, this is less of a problem for systems with a sequenced genome (the unique mapping requirement effectively removes these loci when filtering alignments). For de novo analysis, or systems with imperfect genome assemblies, it can be useful to explicitly filter out these sites at this stage.
The independent accumulation of SNPs at multiple loci, and their subsequent mapping back to a single reference sequence, can be expected to produce sites with unusually high numbers of SNPs. These are excluded to guard against errors resulting from repetitive sites. This is accomplished as:
RepTagFilter.pl -i mdfiltered.tab -n 2
In this example, we've chosen to count the number of sites bearing no more than 2 SNPs. Once you've determined which threshold to use, you can write the results to a file as:
RepTagFilter.pl -i mdfiltered.tab -n 2 -p y -o nr.tab

Select one SNP per tag

Since multiple SNPs on a single tag (AlfI restriction site) are unlikely to be separated by recombination, they can usually be expected to segregate as a single locus. For many analyses its appropriate to remove these redundant SNPs prior to analysis. This can be accomplished as:
OneSNPPerTag.pl -i nr.tab -p y -o selected_snps.tab
For any sites having multiple SNPs, this script selects the SNP with the least missing data and write the output to a file; in this example, named "selected_snps.tab".

This output constitutes the endpoint of the standard 2bRAD analysis pipeline. The next steps depend on the study and the biological question, and are up to the end user. This output file is simple, tab-delimited text that can be easily read in commonly used software e.g. R or Microsoft Excel.

Tools | Analysis of genetic variation using 2bRAD utilities (v3.0)