User's Guide - Tag-Seq utilities

About this guide

This page outlines our standard pipeline for analysis of gene expression using Tag-Seq, a cost-effective and accurate approach. This guide covers the most recently updated version of the Tag-Seq utilities pipeline as of July 2018 (v2.0).

These tools are intended to be used on a high-performance computing cluster, and assume a basic knowledge of Linux and command-line tools. To analyze large numbers of samples in parallel, the user will probably want to (a) combine some of the following steps into shell scripts, and (b) submit each job to a job scheduler such as SGE. Since the details of those steps may vary from one cluster to another, we have left those details up to the end user.

Click on each header to expand or collapse that section.

◁ Download scripts and install required software

BioPerl

Our scripts rely on BioPerl modules. First, check whether you need to install BioPerl.
perl -MBio::SeqIO -e 0 If you get no feedback, BioPerl is available on your system. If you get an error message something like the following:
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2 ... this indicates that you need to install BioPerl.

To install BioPerl, go to the BioPerl Wiki and follow the instructions given there.

Scripts

Our scripts for Tag-Seq analysis are hosted on GitHub. You'll want to download scripts from this repository:

Scripts specifically for Tag-Seq analysis.

The best way to do this is to use the git tool. While this is undoubtedly the best and most widely-used platform
for sharing code, the git tool itself is not simple or intuitive. To learn about use of git, we recommend

this resource

git

git clone git://github.com/Eli-Meyer/TagSeq_utilities.git

bbduk.sh

To check whether bbduk is installed on your system, run:
which bbduk.sh
If you get no feedback, this indicates you need to install bbduk. Go to this link and follow the instructions to unpack this collection of software, which requires Java version 6 or higher is installed on your system.
If you get feedback similar to this:
/local/cluster/bbmap/bbduk.sh
This indicates bbduk is already installed and in your path. You can move on to the next step.

SHRiMP

To check whether SHRiMP is installed on your system, run:
which gmapper
If you get no feedback, this indicates you need to install SHRiMP. Go to this link and download and install SHRiMP according to the directions on that site.
If you get feedback similar to this:
/local/cluster/SHRiMP/bin/gmapper
This indicates SHRiMP is already installed and in your path. You can move on to the next step.

◁ Trim and filter reads

Quality Filter

First, we exclude low quality reads likely to contain sequencing errors. The choice of thresholds is somewhat arbitrary and ultimately up to the user. For an example using reasonable default settings, run:
QualFilterFastq.pl -i input.fastq -m 20 -x 10 -o output.fastq
to remove any reads from "input.fastq" having more than 10 bases with quality scores lower than 20, and write the output to a file named "output.fastq".

Homopolymer repeat filter

Some sequence datasets include lots of homopolymer repeats (long stretches of a single nucleotide, e.g. poly-A tails). Generally this is a challenge for transcriptome assembly, and to a lesser extent for gene expression analysis (RNA-Seq) and genome assembly, and not at all for SNP genotyping. To exclude all reads with homopolymer repeats more than 30 bases in length, we can run:
HRFilterFastq.pl -i input.fastq -n 30 -o output.fastq

Adaptor filter

Next, we discard reads containing sequences derived from adaptors used during library preparation. These artificial DNA sequences introduced during library preparation may occupy unknown portions of the read (including the entire read). Removing these is probably the most important task in read processing, and certainly the most computationally intensive. To make the whole pipeline easier to run in parallel, we've recently switched from the old version (AdaptorFilterFastq.pl) to a newer kmer-based tool called BBDUK .
This has superior performance in all respects to the old process and most importantly, can be run directly on large sequence datasets without lots of tedious parallelizing.

To run this tool on a set of reads called input.fastq, to eliminate any reads sharing at least one 12-mer with sequences in adaptors.fasta, we can run:
bbduk.sh in=hrf.fastq ref=adaptors.fasta k=12 stats=stats.txt out=clean.fastq overwrite=t
to remove any reads in "input.fastq" having valid alignments (at least one 12-bp kmer match) to sequences in "adaptors.fasta", and write the output (sequences passing the filter) to "output.fastq".

Remove PCR Duplicates

In this optional step, the user can choose to discard PCR duplicates based on the sequence of the insert and the sequence of an internal barcode incorporated during FS-cDNA synthesis. To remove duplicates from a library having barcodes at positions 1-4 and insert sequences beginning at position 9, the user would run:
RemovePCRDups.pl -i input.fastq -o output.fastq -s 1 -e 4 -j 9

Trim tags

Finally, we remove the non-template bases introduced at the 5' end of each tag during library preparation. The region removed begins at the 5' end by default (position 1), and extends to the end of the 5'-most occurence of GGG (up to a maximum position defined by the user). This range is determined from the position of the GGG motif in RNA oligos used during library preparation. To trim a set of reads where the GGG motif is expected to occur in the first 8 bases:
TagTrimmer.pl -i input.fastq -b 1 -e 8 -o output.fastq

◁ Align reads against reference

Align reads to reference

The ideal reference for mapping Tag-Seq reads would be a complete and non-redundant set of transcripts from the same species as the reads. de novo transcriptome assemblies and transcripts identified from genome assemblies are often used for this purpose.

We use the SHRiMP software package for mapping, which is both faster and more sensitive in our testing than any other mapping software we have tested. However, in principle other mapping programs can be substituted at this step, as long as the output is in SAM format and includes positive alignment scores in the "AS:" field. To use SHRiMP, run the gmapper tool, as:
gmapper --qv-offset 33 -Q --strata -o 3 -N 1 reads.fastq reference.fasta >gmapper.sam
In this example, a set of trimmed and filtered reads ("reads.fastq") is mapped against a reference sequence ("reference.fasta") using a single thread (processor). The raw alignments are written to a file called "gmapper.sam".

Filter alignments

SAMFilter.pl -i gmapper.sam -m 40 -o filtered.sam -c 1

Combine counts files from all samples

Finally, we combine the counts data from all samples into a single output file, with samples as columns and genes as rows.

CombineExpression.pl sample1/counts.tab sample2/counts.tab sample3/counts.tab > combined.tab

CombineExpression.pl */counts.tab > combined.tab

The combined file produced by this step is ready for statistical analysis in DESeq2, edgeR, or other R packages for analysis of differential gene expression.

Tools | Analysis of gene expression using Tag-Seq (v2.0)