L_RNA_scaffolder | SSPACE (2) | SOPRA | MIP scaffolder | Opera | SOAPdenovo | scarpa

Manual Reference Pages

NAME L_RNA_scaffolder - using long transcriptome reads to scaffold genome
CONTENTS
DESCRIPTION
SYSTEM REQUIREMENTS
INSTALLING
INPUT FILE
COMMANDS AND OPTIONS
OUTPUT FILE
SPEED

DESCRIPTION
L_RNA_scaffolder is a genome scaffolding tool with long trancriptome reads. The long transcriptome reads could be generated by 454/Sanger/Ion_Torrent sequencing, or de novo assembled with pair-end Illumina sequencing. The long reads are aligned to genome fragments using BLAT and alignment file (PSL format with no heading) is used as the input file of L_RNA_scaffolder. L_RNA_scaffolder searches "guider" reads, fragment of which were mapped to different genome fragments. Then the "guider" reads orientated and ordered the genome fragments into longer scaffolds.

SYSTEM REQUIREMENTS
The software, written with Shell script, consists of C++ programs and Perl programs. The C programs have been precompiled and therefore could be directly executed. To run Perl program, perl and Bioperl modules should be installed on the system. Further, the program required PSL file as input file. Thus, BLAT program should also be installed on the system. L_RNA_scaffolder has been tested and is supported on Linux.

INSTALLING
After downloading the sofware, simply type "tar -zxvf L_RNA_scaffolder.tar.gz" in the base installation directory. The software is either written in C++ and compiled, or written in Perl. Therefore, it should not require any special compilation and is already provided as portable precompiled software.

INPUT FILES
PSL file and genome fragment fasta file are necessary for scaffolding. The psl file was generated using BLAT program with "-noHead" option. The genome fragment file should be fasta format, consistent with the subject sequences when using BLAT program. Another file, named overlapping file, contain two columns. This file is not necessary but will avoid some interesting genome fragments not being scaffolding.

COMMANDS AND OPTIONS
L_RNA_scaffolder is run via the shell script: L_RNA_scaffolder.sh found in the base installation directory.

Usage info is as follows:

Required:
-d : the directory where the programs are in.
-i : the output of transcripts alignment with BLAT. SeePSL format
-j : the genome fragments fasta file which will be scaffolded and was used as the database when aligning transcript reads.
Optional:
-r : some fragments which you might be interesting and will not be scaffolded. The file has two columns per row. One row stand for that two fragments might be connected and should not be scaffolded.
-l : the threshold of alignment length coverage (default:0.95). If one read has a hit of which length coverage was over the threshold, then this read would be filtered out.
-p : the threshold of alignment identity (default: 0.9). If one alignment has an identity over the threshold, then the alignment is kept for the further analysis.
-o : the directory where the output file is stored. The default output directory is equal to the program directory.
-e : The maximal intron length between two exons (default: 100kb).
-f : the minimal number of supporting reads (default: 1). If the number of the supporting reads for the connection is over the frequency, then this connection is reliable.

Note: a typical L_RNA_scaffolder command might be:
# sh L_RNA_scaffolder.sh -d ./ -i input.psl -j genome.fasta

OUTPUT FILES
When L_RNA_scaffolder completes, it will create an L_RNA_scaffolder.fasta output file in the output_dir/ output directory.

SPEED
L_RNA_scaffolder spent 5 hours in scaffolding human genome fragments with a psl file of 43 GB generated from alignment of 8.3 millions of transcripts. However, the sequencing alignment is time-consuming when using BLAT to alignment nearly ten millions of reads. We recommend splitting the reads into multiple pieces and running the alignments simultaneously.

======================================================================================================================

在做基因组拼接的时候,有时候已经获得了一堆contig或者scaffold,然后需要利用另外的一个pair-end或者mate-pair文库进一步做scaffolding的话。用什么软件好呢?

最近使用了一款SSPACE貌似不错,这款软件可以利用Pair-End(PE)和Mate-Pair(MP)的二代数据将已有的contig或者scaffold连接成更大的scaffold。同时这个软件还能够利用PE或者MP的数据对已有的contig或者scaffold进行延伸(或者填补部分的gap)。

这个软件做的不错,但是还是要小小地complain一下:这个软件分basic版和premium版,前者免费使用,后者则要求用户捐 € 250给他们才能使用。基于GNU的软件还有这么做的?!!!

关于软件的相关论文:

SSPACE has shown excellent performance on various datasets. The results have been published in the high-impact journal Bioinformatics (2011, vol. 27(4), pag. 578-9).

这个软件使用方法很简单,大家可以下载basic版,然后阅读里面的User_Manual 和 Tutorial。前者告诉你每个参数的意义后者是软件使用的简单教程。另外里面还有一个README文件是介绍软件算法过程的。如果由国内用户无法 下载的可以联系我。我这现在有最新的1.2版本。
下面是附上SSPACE的参数介绍、参数含义、算法过程(工作原理):

SSPACE参数

01 Usage: perl SSPACE_v1-2.pl options
02  
03 -l  Library file containing two mate pate files with insert size, error and either mate pair or paired end indication.
04 -s  Fasta file containing contig sequences used for extension. Inserted pairs are mapped to extended and non-extended contigs (REQUIRED)
05 -x  Indicate whether to extend the contigs of -s using paired reads in -l. (-x 1=extension, -x 0=no extension, default -x 1)
06 -m  Minimum number of overlapping bases with the seed/contig during overhang consensus build up (default -m 32)
07 -o  Minimum number of reads needed to call a base during an extension (default -o 20)
08 -t  Trim up to -t base(s) on the contig end when all possibilities have been exhausted for an extension (default -t 0, optional)
09 -g  Maximum number of allowed gaps during mapping with Bowtie. Corresponds to the -v option in Bowtie. *higher number of allowed gaps can lead to least accurate scaffolding* (default -v 0, optional)
10 -b  Base name for your output files (optional)
11 -v  Runs in verbose mode (-v 1=yes, -v 0=no, default -v 0, optional)
12 ============ Options below only considered for scaffolding ============
13 -k  Minimum number of links (read pairs) to compute scaffold (default -k 5, optional)
14 -a  Maximum link ratio between two best contig pairs *higher values lead to least accurate scaffolding* (default -a 0.7, optional)
15 -n  Minimum overlap required between contigs to merge adjacent contigs in a scaffold (default -n 15, optional)
16 -u  Fasta/fastq file containing unpaired sequence reads (optional)
17 -p  Make .dot file for visualisation (-p 1=yes, -p 0=no, default -p 0, optional)

SSPCAE的参数含义:

01 The ‘-l’ library file:
02 ---------------
03 The library file contains information about each library. The library file contains six columns, each separated by a space. An example of a library file is;
04  
05 Lib1 file1.fasta file2.fasta 400 0.5 0
06 Lib1 file2.fasta file2.fasta 400 0.5 0
07  
08 Lib2 file3.fastq file3.fastq 4000 0.75 1
09  
10 Each column is explained in more detail below; Column 1:
11 ---------
12 Name of the library. A short name to keep track of the names of the libraries. All temporary files and summary statistics are named by this library name.  Libraries having same name are considered to be of same distance and devaition (column 4 and 5). Libraries should be sorted on distance, the first library is scaffolded first, followed by next libraries.
13  
14 Column 2 & 3:
15 ---------
16 Fasta or fastq files for both ends. For each paired read, one of the reads should be in the first file, and the other one in the second file. The paired reads are required to be on the same line. No naming convention of the reads is required, because names of the headers are not used in the protocol. Thus names of the headers shouldn’t be the same and do not require any overlap of names like (…).x and (…).y, which is commonly used in assembly programs.
17  
18 During the reading step, the sequences of both pairs are merged together and filtered for;
19 -Mapping. the filtered read pairs with only ACGT characters and no duplicates are used. The remaining merged read pairs are split to single reads, and are mapped to contigs.
20 -Extension. only unmapped single reads are used for extension of the pre-assembled contigs.
21 -Scaffolding. Besides the filtering of mate pairs containing “N” and non- ACGT character, duplicate mate pair sequences are also removed.
22  
23 Concluding, each read should be larger than 16 (or the ‘–m’ parameter if
24 -x 1). If they are shorter, the program will simply omit them from the assembly.
25  
26 Column 4 & 5:
27 ---------
28 The fourth column represents the expected/observed inserted size between paired reads. The fifth column represents the minimum allowed error. A combination of both means e.g. that with an expected insert size of 4000 and 0.5 error, the distance can have an error of 4000 * 0.5 =
29 2000 in either direction. Thus pairs between 2000 and 6000 distance are valid pairs.
30  
31 Column 6:
32 ---------
33 The final column indicates that the reads in both inserted fasta files should be reverse complemented or not. This column should either be set to 0 (no reverse complement) or 1 (reverse complement). SSPACE requires an A--><--B orientation of the reads. If the library has A--><--B orientation of the reads, typical paired-end library, no reverse complement should be used. For a library with A<---->B orientation, which is typically used for mate pair libraries, one should set this column to 1, to reverse complement both reads to A--><--B orientation
34  
35 The ‘-s’ contigs fasta file
36 ---------------
37 The ‘–s’ contigs file should be in a .fasta format. The headers are used to trace back the original contigs on the final scaffold fasta file. Therefore, names of the headers should not be too complex. A naming of “>contig11” or “>11”, should be fine. Otherwise, headers of the final scaffold fasta file will be too large and hard to read.
38 Contigs having a non-ACGT character like “.” or “N” are not discarded. They are used for extension, mapping and building scaffolds. However, contigs having such character at either end of the sequence, could fail for proper contig extension and alignment of the reads.
39  
40 The ‘-x’ contig extension option
41 ---------------
42 Indicate whether to do extension or not. If set to 1, contigs are tried to be extended using the unmapped sequences. If set to 0, no extension is performed.
43  
44 The ‘–m’ minimum overlap
45 ---------------
46 Minimum number of overlapping bases of the reads with the contig during overhang consensus build up. Higher ‘-m’ values lead to more accurate contigs at the cost of decreased contiguity. We suggest to take a value close to the largest read length. For example, for a library with
47 36bp reads, we suggest to use a -m value between 32 and 35 for reliable contig extension. For more information, see the SSPACE README file
48 or the SSAKE paper/poster.
49  
50 The -o number of reads
51 ---------------
52 Minimum number of reads needed to call a base during an extension, also known as base coverage. The higher the ‘-o’, the more reads are considered for an extension, increasing the reliability of the extension.
53  
54 The ‘-t’ trimming option
55 ---------------
56 Trims up to ‘-t’ base(s) on the contig end when all possibilities have been exhausted for an extension. See SSAKE help files for information.
57  
58 The ‘-k’ minimal links and ‘-a’ maximum link ratio
59 ---------------
60 Two parameters control scaffolding (-k and -a).  The -k option specifies the minimum number of links (read pairs) a valid contig pair must have to be considered.  The -a option specifies the maximum ratio between the best two contig pairs for a given contig being extended. For more information see the .readme file or the poster of SSAKE.
61  
62 The ‘-n’ contig overlap
63 ---------------
64 Minimum overlap required between contigs to merge adjacent contigs in a scaffold. Overlaps in the final output are shown in lower-case characters.
65  
66 The '-g' maximum gaps
67 ---------------
68 Maximum allowed gaps for Bowtie, this parameter is used both at mapping during extension and mapping during scaffolding. This option
69  
70 corresponds to the -v option in Bowtie. We strongly recommend using no gaps, since this will slow down the process and can decrease the reliability of the scaffolds. We only suggest to increase this parameter when large reads are used, e.g. Roche 454 data or Illumina 100bp.
71  
72 The ‘-u’ unpaired single reads
73 ---------------
74 Unpaired reads can be inserted for contig extension. These reads are not used for scaffolding.
75  
76 The ‘-p’ plot option
77 ---------------
78 Indicate whether to generate a .dot file for visualisation of the produced scaffolds.
79  
80 The ‘-b’ prefix base name
81 ---------------
82 All files start with the ‘-b’ prefix to allow for multiple runs on the same folder without overwriting the results.
83  
84 The ‘-v’ verbose option
85 ---------------
86 Indicate whether to run in verbose mode or not. If set, information about the process is printed on the screen.

SSPACE算法原理

001 The program consists of several steps, a short overview;
002  
003 The first steps are reading the data and filter them. The protocol is slightly different when -x is set to either 0 or 1. We treat them separately here;
004  
005 With -x 0 the steps are;
006 1) Read -l library file;
007    A) For each library in the -l library file;
008       -merge both reads of a pair in to a single string and store them in a file. Filter this data on reads containing N's.
009 2) Convert the inserted contig file to appropriate format.
010  
011 With -x 1 the steps are;
012  
013 1) Read -l library file;
014    A) For each library separately
015    - merge both reads of a pair in to a single file for mapping. Filter this data on reads containing N's.
016    B) For all libraries
017    - store the single reads to a new file. Only reads containing only ACGT characters are stored.
018 2) Extend the pre-assembled contigs
019    A) Map single reads of step 1B to (-s) contig file with Bowtie.
020    B) Read unmapped reads into memory.
021    C) Go through each contig in the (-s) contig file, and try to extend the contig. The new contigs are stored in a new file.
022  
023 After producing either a formatted or an extended contig file, the next step is to go through each library in the -l library file and map the filtered paired reads of step 1A to the new contigs;
024  
025 3) Use Bowtie to map single reads of 1A to either the formatted or extended contigs. Map only reads that are on the edges of the contigs. Only reads that map to only one contig are stored in a file. Position and orientation of each read is stored in the file.
026 4) Store the position information of each found read on a contig to an hash.
027 5) Go through the paired reads and use only reads for contig pairing if both pairs are found in the hash. If a pair is already used for contig pairing, it is not used again.
028 6) Pair contigs based on the number of links (-k) and link ratio (-a)
029 7) Merge, orient and order the contigs to produce scaffolds.
030  
031 8) If multiple libraries are in -l file, the produced scaffolds in fasta format are the input for the new library. Steps 3 till 8 are repeated for each library.
032  
033 A more detailed view of the six main steps are given below.
034  
035 Detailed view
036 ------------
037  
038 1. Reading and filtering libraries
039 Both fasta/fastq files inserted at the -l library file are read and merged together, forming a single string separated by a semicolon;
040  
041 >read1_A of file 1
042 ACGATGCTAT
043  
044 >read1_B of file 2
045 ACCGCGCCCC
046  
047 >merged_read
048 ACGATGCTAT:ACCGCGCCCC
049  
050 If the merged read contains only ACGT characters, it is stored in a new file.  Filtered pairs are used for contig pairing, see step 4.
051  
052 If -x 1 is set, for contig extension, single reads containing only ACGT characters are stored in a new file. The single reads are mapped to contigs at the next step.
053  
054 2. Mapping when -x 1
055 To extend contigs, only reads that are not already present on the contigs should be used. Otherwise, reads are re-used and cause erroneous contigs. To filter these reads out, Bowtie is used. Bowtie maps the produced single reads at step 1 to the (-s) pre-assembled contigs. A file is generated with reads that did not map to the contigs. The unmapped read file is read in memory, populating a hash table keyed by unique sequence reads with pairing values representing the number of sequence occurrences. The hash is used for contig extension at the next section.
056  
057 3. Extending when -x 1
058 Contigs are extended, when -x set to 1, using the unmapped reads with a method developed by SSAKE. With SSAKE, contigs extension is initiated by generating the longest 3'-most word (k-mer) from the unassembled read u that is shorter than the sequence read length l.  Every possible 3' most k-mers will be generated from u and used in turn for the search until the word length is smaller than a user-defined minimum, m.  Meanwhile, all perfectly overlapping reads will be collected in an array and further considered for 3' extension once the k-mer search is done.  At the same time, a hash table c will store every base along with a coverage count for every position of the overhang (or stretches of bases hanging off the seed sequence u).
059  
060 Once the search complete, a consensus sequence is derived from the hash table c, taking the most represented base at each position of the overhang.  To be considered for the consensus, each base has to be covered by user-defined -o (set to 2 by default).  If there's a tie (two bases at a specific position have the same coverage count), the prominent base is below a user-defined ratio r, the coverage -o is to low or the end of the overhang is reached, the consensus extension terminates and the consensus overhang joined to the contig.  All reads overlapping are searched against the newly formed sequence and, if found, are removed from the hash table and prefix tree. If they are not part of the consensus, they will be used to extend other contigs, if applicable.  If no overlapping reads match the newly formed contig, the extension is terminated from that end and SSAKE resumes with a new contig.  That prevents infinite looping through low complexity DNA sequences. In the former case, the extension resumes using the new [l-m] space to search for joining k-mers.
061  
062 The process of progressively cycling through 3'-most k-mer is repeated after every contig extension until nothing else can be done on that side.  Since only left-most searches are possible with a prefix tree, when all possibilities have been exhausted for the 3' extension, the complementary strand of the contiguous sequence generated is used to extend the contig on the 5' end.  The DNA prefix tree is used to limit the search space by segregating sequence reads and their reverse-complemented counterparts by their first eleven 5' end bases.
063  
064 There are two ways to control the stringency in SSPACE:
065 1) Disallow contig extension if the coverage is too low (-o). Higher -o values lead to shorter contigs, but minimizes sequence misassemblies.
066 2) Adjust the minimum overlap -m allowed between the contig and short sequence reads. Higher m values lead to more accurate contigs at the cost of decreased contiguity.
067  
068 After the sequence assembly, a file is generated with .extendedcontigs.fasta extension in the 'intermediate_results' folder. This file contains both extended and non-extended contigs.
069  
070 The next steps are looped through each library, present in the (-l) library file.
071  
072 4. Mapping unique paired reads
073  
074 At step 1, pairs of each library were filtered. Pairs containing N's are unable to correctly map to the contigs, therefore they are removed to save time. Of the remaining pairs, only single reads are extracted. Bowtie maps the single reads to the contigs, produced either after extending (if -x 1), or after formatting (if -x 0), or after step 5 if multiple libraries are inserted on -l.
075  
076 Before mapping, contigs are shortened, reducing the search space for Bowtie. Only edges of the contigs are considered for mapping. Cutting of edges is determined by taking the maximal allowed distance inserted by the user in the library file (insert size and insert standard deviation). The maximal distance is insert_size + (insert_size * insert_stdev). For example, with a insert size of 500 and a deviation of 0.5, the maximal distance is 750. First 750 bases and last 750 bases are subtracted from the contig sequence, in this case.
077  
078 ------------------------------------------
079            |                  |
080 ------------                  ------------
081    750bp                          750bp
082  
083 This step reduces the search space by merging the two sequences, divided by a <N> character.
084  
085 The algorithm of mapping goes through each pair and checks its occurrence on the edges of the contigs. If both reads are found, the reads of the pair is stored and contigs could be paired in the next step. Otherwise, it is not stored and the read pair is not used for contig pairing. If a pair is previously found and used for contig pairing, the pair is not considered again. Otherwise same links between contigs are found based on same read pair, which can generate misleading results.
086  
087 If either of the two reads of a read pair occur on multiple contigs, one can not tell which contig should be paired. For example, the left read occurs at contigs 1 and 3, and the right read at contig 2. For this situation it is impossible to tell if contigs 1 and 2 should be paired, or contigs 1 and 3. Therefore, reads that occur multiple times on contigs are not considered for contig pairing.
088  
089 5. Building scaffolds
090 The final step is scaffolding. SSPACE uses an updated version of the SSAKE scaffolder for this. For each read pairs, putative contig pairs (pre-scaffolding stage) are tallied based on the position/location of the paired reads on different contigs. Contig pairs are only considered if the calculated distance between them satisfy the mean distance specified (fourth column in -l file) while allowing for a deviation (fifth column in -l file), also defined by the user. Only contig pairs having a valid gap or overlap are allowed to proceed to the scaffolding stage.
091 Please note that this stage accepts redundancy of contig pairs (i.e. a given contig may link to multiple contigs, and the number of links (spanning pairs) between any given contig pair is recorded, along with a mean putative gap or overlap(-)).
092  
093 Once pairing between contigs is complete, the scaffolds are built using contigs as seeds. Every contig is used in turn until all have been incorporated into a scaffold.
094  
095 Consider the following contig pairs (AB, AC and rAD):
096  
097     A         B
098 ========= ========
099   ->       <-
100    ->        <-
101     ->      <-
102        ->       <-
103  
104     A       C
105 ========= ======
106   ->        <-
107     ->        <-
108  
109    rA        D           equivalent to rDA, in this order
110 ========= =======
111       ->   <-
112      ->   <-
113        ->   <-
114  
115 Two parameters control scaffolding (-k and -a).  The -k option specifies the minimum number of links (read pairs) a valid contig pair MUST have to be considered.  The -a option specifies the maximum ratio between the best two contig pairs for a given contig being extended.  For example, contig A shares 4 links with B and 2 links with C, in this orientation.  contig rA (reverse) also shares 3 links with D.   When it's time to extend contig A (with the options -k and -a set to 2 and 0.7, respectively), both contig pairs AB and AC are considered.  Since C (second-best) has 2 links and B (best) has 4 (2/4) = 0.5 below the maximum ratio of 0.7, A will be linked with B in the scaffold and C will be kept for another extension. If AC had 3 links the resulting ratio (0.75), above the user-defined maximum 0.7 would have caused the extension to terminate at A, with both B and C considered for a different scaffold.  A maximum links ratio of 1 (not recommended) means that the best two candidate contig pairs have the same number of links -- SSPACE will accept the first one since both have a valid gap/overlap. The above method was adopted from SSAKE. The SSPACE improved this method by introduing another method if a contig can link to more than one alternative. Both methods (original SSAKE method and our method) for handling alternatives are explained below;
116  
117 If a contig can be linked to more than one alternative, connections between these alternatives are searched and linked together if a connection is found. Otherwise a ratio is calculated between the two best alternatives. If this ratio is below a threshold (-a) a connection with the best scoring alternative is established. The two methods are shown below;
118  
119 The first method;
120 A has 10 links with B
121 A has 5 links with C
122 B has 10 links with C;
123  
124 Result is a scaffold containing A-B-C
125  
126 The second method adopted from SSAKE (only used if first method did not produce a scaffold);
127 A has 10 links with B
128 A has 5 links with C
129 B has 0 links with C;
130  
131 Since no connection is found between B and C, a ratio of the links is estimated. Here, this is 5/10 (C/B) = 0.5. If this is below the -a ratio threshold, the scaffold is A-B.
132  
133 When a scaffold extension is terminated on one side, the scaffold is extended on the "left", by looking for contig pairs that involve the reverse of the seed (in this example, rAD).  With AB and AC having 4 and 2 links, respectively and rAD being the only pair on the left, the final scaffolds outputted by SSPACE would be:
134  
135 1) rD-A-B
136 2) C
137  
138 SSPACE outputs a .scaffolds file with linkage information between contigs (see "Understanding the .scaffolds csv file" below)
139 Accurate scaffolding depends on many factors.  Number and nature of repeats in your target sequence, optimum adjustments of insert size, error, parameters -k and -a and data quality/size of sequence set (more doesn't mean better) will all affect SSPACE's ability to build scaffolds.
140  
141 6. Merging contigs
142 SSAKE scaffolder produces links between contigs and determines the possible gap between them. For a positive gap, m number of N's will be placed between them if a gap of size m is predicted to occur. When a negative gap is generated, a putative overlap is predicted to occur. The adjacent contigs are searched for overlap within a window given at -n option till 50 bp. If an overlap was found, contigs are merged and the region is marked with lowercase nucleotides. Otherwise, if no overlap was detected, a single "n" will be placed between the contigs. A short overview of this step with three examples;
143  
144 >contig_1
145 AGCTAGTCGTAGCTTGTAC
146 >contig_2
147 ACGTAGTGATATTATTGTC
148  
149 Example 1:
150 A link between contig_1 and contig_2 is found, with a putative gap of 10. In the final output, the gaps is indicated by 10 N's between the two contigs.
151  
152 Link = contig_1 with contig_2. Gap = 10;
153 AGCTAGTCGTAGCTTGTACNNNNNNNNNNACGTAGTGATATTATTGTC
154  
155 Example 2;
156 A link between contig_1 and contig_2 is found, with a putative gap of -10. When using the -n 10 option, no overlap was found and a small <n> is inserted between the two contigs.
157  
158 Link = contig_1 with contig_2. Gap = -10. -n = 10;
159 AGCTAGTCGTAGCTTGTACnACGTAGTGATATTATTGTC
160  
161 Example 3;
162 A link between contig_3 and contig_4 is found, with a putative gap of -10. When using the -n 10 option, an overlap of 13 nucleotides was found, indicated in lower case in the final output.
163  
164 >contig_3
165 AGTGTTAGATAGTTATAGA
166 >contig_4
167 AGATAGTTATAGAAGTAGT
168  
169 Link = contig_3 with contig_4. Gap = -10. -n = 10;
170 AGTGTTagatagttatagaAGTAGT