This page outlines our standard pipeline for analysis of gene expression using Tag-Seq,
a cost-effective and accurate approach. This guide covers the most recently updated version of the Tag-Seq utilities pipeline as
of July 2018 (v2.0).
These tools are intended to be used on a high-performance computing cluster, and assume a basic knowledge of Linux and command-line
tools. To analyze large numbers of samples in parallel, the user will probably want to (a) combine some of the
following steps into shell scripts, and (b) submit each job to a job scheduler such as SGE. Since the details of
those steps may vary from one cluster to another, we have left those details up to the end user.
Click on each header to expand or collapse that section.
Our scripts rely on BioPerl modules. First, check whether you need to install BioPerl.
perl -MBio::SeqIO -e 0
If you get no feedback, BioPerl is available on your system. If you get an error message something like the following:
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2 ...
this indicates that you need to install BioPerl.
To install BioPerl, go to the BioPerl Wiki and follow the instructions given there.
Our scripts for Tag-Seq analysis are hosted on GitHub. You'll want to download scripts from this repository:
git clone git://github.com/Eli-Meyer/TagSeq_utilities.git
Make sure you have execute permission to these scripts, and that they are in your path ($PATH).To check whether bbduk is installed on your system, run:
which bbduk.sh
If you get no feedback, this indicates you need to install bbduk. Go to this
link
and follow the instructions to unpack this collection of software, which requires Java version 6 or higher is installed on your system.
If you get feedback similar to this:
/local/cluster/bbmap/bbduk.sh
This indicates bbduk is already installed and in your path. You can move on to the next step.
To check whether SHRiMP is installed on your system, run:
which gmapper
If you get no feedback, this indicates you need to install SHRiMP. Go to this
link
and download and install SHRiMP according to the directions on that site.
If you get feedback similar to this:
/local/cluster/SHRiMP/bin/gmapper
This indicates SHRiMP is already installed and in your path. You can move on to the next step.
First, we exclude low quality reads likely to contain sequencing errors. The choice of thresholds is
somewhat arbitrary and ultimately up to the user. For an example using reasonable default settings, run:
QualFilterFastq.pl -i input.fastq -m 20 -x 10 -o output.fastq
to remove any reads from "input.fastq" having more than 10 bases with quality scores lower than 20, and write
the output to a file named "output.fastq".
Some sequence datasets include lots of homopolymer repeats (long stretches of a single nucleotide, e.g. poly-A tails).
Generally this is a challenge for transcriptome assembly, and to a lesser extent for gene expression analysis (RNA-Seq) and
genome assembly, and not at all for SNP genotyping. To exclude all reads with homopolymer repeats more than 30 bases in length,
we can run:
HRFilterFastq.pl -i input.fastq -n 30 -o output.fastq
Next, we discard reads containing sequences derived from adaptors used during library preparation. These
artificial DNA sequences introduced during library preparation may occupy unknown portions of the read
(including the entire read). Removing these is probably the most important task in read processing, and certainly the most
computationally intensive. To make the whole pipeline easier to run in parallel, we've recently
switched from the old version (AdaptorFilterFastq.pl) to a newer kmer-based tool
called BBDUK.
This has superior performance in all respects to the old process and most importantly, can be run
directly on large sequence datasets without lots of tedious parallelizing.
To run this tool on a set of reads called input.fastq, to eliminate any reads sharing at least
one 12-mer with sequences in adaptors.fasta, we can run:
bbduk.sh in=hrf.fastq ref=adaptors.fasta k=12 stats=stats.txt out=clean.fastq overwrite=t
to remove any reads in "input.fastq" having valid alignments (at least one 12-bp kmer match)
to sequences in "adaptors.fasta", and write the output (sequences passing the filter) to "output.fastq".
In this optional step, the user can choose to discard PCR duplicates based on the sequence of the insert and the sequence
of an internal barcode incorporated during FS-cDNA synthesis. To remove duplicates from a library having barcodes at positions 1-4
and insert sequences beginning at position 9, the user would run:
RemovePCRDups.pl -i input.fastq -o output.fastq -s 1 -e 4 -j 9
Finally, we remove the non-template bases introduced at the 5' end of each tag during library preparation.
The region removed begins at the 5' end by default (position 1), and extends to the end of the 5'-most occurence
of GGG (up to a maximum position defined by the user). This range is determined from the position of the GGG motif in
RNA oligos used during library preparation. To trim a set of reads where the GGG motif is expected to occur in the first 8 bases:
TagTrimmer.pl -i input.fastq -b 1 -e 8 -o output.fastq
The ideal reference for mapping Tag-Seq reads would be a complete and non-redundant set of transcripts from the same species as the reads. de novo transcriptome assemblies and transcripts identified from genome assemblies are often used for this purpose.
We use the SHRiMP software package for mapping, which is both faster and more sensitive in our testing than
any other mapping software we have tested. However, in principle other mapping programs can be substituted
at this step, as long as the output is in SAM format and includes positive alignment scores in the "AS:" field.
To use SHRiMP, run the gmapper tool, as:
gmapper --qv-offset 33 -Q --strata -o 3 -N 1 reads.fastq reference.fasta >gmapper.sam
In this example, a set of trimmed and filtered reads ("reads.fastq") is mapped against a reference sequence ("reference.fasta")
using a single thread (processor). The raw alignments are written to a file called "gmapper.sam".
SAMFilter.pl -i gmapper.sam -m 40 -o filtered.sam -c 1
Finally, we combine the counts data from all samples into a single output file, with samples as columns and genes as rows.
To combine data from three specific samples, you could run:CombineExpression.pl sample1/counts.tab sample2/counts.tab sample3/counts.tab > combined.tab
CombineExpression.pl */counts.tab > combined.tab
The combined file produced by this step is ready for statistical analysis in DESeq2, edgeR, or other R packages for
analysis of differential gene expression.
Last updated 14 July 2018, E. Meyer.