Sequential trimming and mapping of short reads
Antonio Marco (c) 2011, 2012
What is SeqTrimMap?
SeqTrimMap is a bash script that maps reads from a fasta file into a reference genome using the sequential trimming and mapping strategy developed by Marco and Griffiths-Jones (2011). The actual mapping is done with Bowtie (Langmead et al. 2009), and you should have it installed in your system.
What do you need?
- Bowtie program (version 0.12), available at http://bowtie-bio.sourceforge.net. Please note that version 2 of Bowtie does not map in color space, hence, should not be used with this pipeline
- The bash script here described ‘SeqTrimMap’
- A UNIX-based operating system
- A reference genome file in fasta format
- A sequence reads file in fasta format, either in base-space (Illumina) or in color-space (SOLiD)
- A fasta file with sequences to filter out
Adding SetTrimMap into your PATH
PATH=/folder/with/seqtrimmap/script:$PATH export PATH
Before you start
Open the SeqTrimMap script with a text editor and modify the line:
so that it includes the full path to the Bowtie folder. For example:
Running SeqTrimMap the easy way
The basic command of SeqTrimMap is:
SeqTrimMap -o <OUTPUT_FILE_NAME> <INPUT_READS_FILE_CSFASTA> <REFERENCE_GENOME>
The program will map now the reads from <INPUT_READS_FILE_CSFASTA>, in color-space (from a SOLiD machine) to the reference genome, and print output the results in the file <OUTPUT_FILE_NAME>. If your input files are in base-space (e.g. Illumina), use the tag ‘-b’.
Filtering unwanted sequences
Quite often it is useful to discard sequences mapping to potential contaminants. For instance, while detecting microRNAs is common to remove reads mapping to tRNAs or rRNAs. You can provide a fasta file with sequences to be filtered out using the ‘-f’ tag:
SeqTrimMap -f <FILTER_FASTA> -o <OUTPUT_FILE_NAME> <INPUT_READS_FILE_CSFASTA> <REFERENCE_GENOME>
Getting a small RNA set
To detect potential microRNAs, only a subset of the mapped reads are of interest. You can add the tag ‘-s’ to create a file with the small RNA subset, and using the tags ‘-u’ and ‘-t’ define the size range:
SeqTrimMap -s -u 20 -t 26 -f <FILTER_FASTA> -o <OUTPUT_FILE_NAME> <INPUT_READS_FILE_CSFASTA> <REFERENCE_GENOME>
If you already indexed the genome in Bowtie format, you can save time by adding the tag ‘-i’, so the programs looks for an already indexed genome. For microRNA detection we often use the following set of parameters:
SeqTrimMap -zis -m 5 -f <FILTER_FASTA> -o <OUTPUT_FILE_NAME> <INPUT_READS_FILE_CSFASTA> <REFERENCE_GENOME>
where ‘-m 5′ indicates that reads mapped to more than 5 loci are saved apart (and not included in the final output file).
Save computational time when detecting microRNAs
Concerned that the program in too slow? There is a simple way to save computational time. By default, the program maps by trimming from the length of the read to a minimum of 20 color/nucleotides. However, the first trimming step can be done at a shorter length with the tag ‘-x’. Please note that this length should be a few units larger than the maximum expected length of a microRNA (that is, tag ‘-t’). On the other hand, the minimum length by default (tag ‘-l’), 20, can be changed to the minimum expected length of a short RNA (tag ‘-u’). For instance, piRNAs are slightly longer than microRNAs. In the following example, the script will do the trimming from length 28 to a minimum of 21:
SeqTrimMap -s -u 21 -t 26 -l 21 -x 28 -f <FILTER_FASTA> -o <OUTPUT_FILE_NAME> <INPUT_READS_FILE_CSFASTA> <REFERENCE_GENOME>
However, you must know that the number of reads mapped may be slightly different to the comprehensive trimming. We still recommend to start the trimming from the length of the read, and this heuristic approach should be used only when computational time is limited.
Dealing with the first color of the reads
The first color from SOLiD reads is problematic since it is also defined by the linker sequence. In the paper associated to this script (Marco and Griffiths-Jones 2012) we described why we should keep this first color. By default this pipeline removes the first character of each read, as a result the first color is kept (Ben Langmead, personal communication). However, Bowtie also offers a native option to deal with first colors. The option ‘–col-keepends’ allow the program to keep the first and last color of the read. We recommend to run our pipeline with the defaults settings. However, you can force the program to use ‘–col-keepends’ Bowtie option by adding the flag ‘-k’.
Note on the expected number of false positives
In our original paper we estimated the proportion of expected false positives by using a Poisson probability. However, we made two mistakes in our approach, and I thank Aaron Webber for pointing them out. First, there is a typo in the formula we used: a (1/i)! is missing after the summation symbol. Second, and more important, we did not account in the false positives estimation for the two color mismatches we allowed in our analyses. In that case a proper false positive estimation should account for overlapping between words and genome sequence content. This problem has been solved for expected mappabilities of a given sequence (e.g. a given transcription factor binding site), but (as far as I am aware) not for a generic sequence of a given length (like NNNNNNNNNNNNNNNNNNNNN) or at least its computation is not straightforward. In the particular case of microRNAs this is the least of the problems, since further filtering steps will clear the dataset from spurious sequences, as our results in the original paper demonstrate. Long story short, you don’t have to worry about it while detecting microRNAs.
Output files are in ‘bowtie’ format, as described at:
You can get the output alignments in SAM format as well by using the tag ‘-S':
If you prefer your output alignments to be in color-space, use the tag ‘-c’
Full list of program options
-z Gzip output files (Default=NO) -b Input file in in base-space (Default=NO) -S Output file in SAM format (Default: Bowtie format) -c Output sequence in color-space (Default: decoded nucleotides) -i Reference genome is already in bowtie index format (Default=NO) -k Use --col-keepends native Bowtie option -p Number of processors (Default=1) -m Save apart reads mapping to more than INT positions (Default=NO) -v Mismatches allowed between the read and the reference genome (Default=2) -o <FILE> Base name for output files (Default='mapped_reads') -f <FILE> Filter reads mapping to the specified fasta file (Default=NO) -l INT Minimum read length after trimming to be mapped (Default=19) -x INT Maximum read length before trimming (Default='length of the read') -s Create output file with reads mapped from length -u to -t' (Default=NO) -u INT see -s option (Default=20) -t INT see -s option (Default=26)
Sequence reads were mapped with Bowtie (Langmead et al. 2009), using the sequential trimming strategy described in Marco and Griffiths-Jones (2012).
Copyright 2011, 2012, Antonio Marco
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A copy of the GNU General Public License in included along with this program.
For any question or suggestion, please contact me at email@example.com