seqclean

在使用TGICL进行EST的组装前,必须首先要把EST含有的可能的污染先去掉(比如EST含有一部分VECTOR序列),这就会对EST的正确 组装带来一些风险,去除这些污染序列使用Seqclean(下载:http://compbio.dfci.harvard.edu/tgi /software/)。
在使用Seqclean除去污染序列时,你必须有一个有关污染的数据库,这个你可以根据自己需要而建立,一般情况下,NCBI的VecScreen提供的污染序列数据库就可满足需要。这个数据库可下载。(ftp://ftp.ncbi.nih.gov/pub/UniVec/)
这个数据库下载后必须先formatdb格式化后才能被Seqclean使用。
Seqclean的使用
它 能够把EST序列中的污染序列(vector sequences, adapters, linkers, and primers commonly used in the process of cloning cDNA or genomic DNA)去掉,为EST的正确组装提供重要的一步。
使用方法:
seqclean <seqfile> [-v <vecdbs>] [-s <screendbs>] [-r <reportfile>]
[-o <outfasta>] [-n slicesize] [-c {<num_CPUs>|<PVM_nodefile>}]
[-l <minlen>] [-N] [-A] [-L] [-x <min_pid>] [-y <min_vechitlen>]
[-m <e-mail>]
Parameters
<seqfile>: sequence file to be analyzed (multi-FASTA)
-c use the specified number of CPUs on local machine
(default 1) or a list of PVM nodes in <PVM_nodefile>
-n number of sequences taken at once in each
search slice (default 2000)
-v comma delimited list of sequence files
to use for end-trimming of <seqfile> sequences
(usually vector sequences)
-l during cleaning, consider invalid the sequences sorter
than <minlen> (default 100)
-s comma delimited list of sequence files to use for
screening <seqfile> sequences for contamination
(mito/ribo or different species contamination)
-r write the cleaning report into file <reportfile>
(default: <seqfile>.cln)
-o output the "cleaned" sequences to file <outfasta>
(default: <seqfile>.clean)
-x minimum percent identity for an alignemnt with
a contaminant (default 96)
-y minimum length of a terminal vector hit to be considered
(>11, default 11)
-N disable trimming of ends rich in Ns (undetermined bases)
-M disable trashing of low quality sequences
-A disable trimming of polyA/T tails 
-L disable low-complexity screening (dust)
-I do not rebuild the cdb index file
-m send e-mail notifications to <e-mail>
example:
./seqclean seq -v /home/apple/Documents/UniVec
seq为准备去除污染的序列文件,文件里序列格式为FASTA。
生成的seq.clean文件里就是根据UniVec数据库去除污染后的序列文件
详细使用方面请阅读同一文件下的RADEME文件。


seqclean很强大,能去除polya,接头, 载体,此外还可以过滤线粒体序列,核糖体序列等

seqclean <seqfile> [-v <vecdbs>] [-s <screendbs>] [-r <reportfile>]
[-o <outfasta>] [-n slicesize] [-c {<num_CPUs>|<PVM_nodefile>}]
[-l <minlen>] [-N] [-A] [-L] [-x <min_pid>] [-y <min_vechitlen>]
[-m <e-mail>]
(注 -v 后面的文件须为 formatdb后面的文件名)
Parameters

<seqfile>: sequence file to be analyzed (multi-FASTA)

-c use the specified number of CPUs on local machine
(default 1) or a list of PVM nodes in <PVM_nodefile>
-n number of sequences taken at once in each
search slice (default 2000)
-v comma delimited list of sequence files
to use for end-trimming of <seqfile> sequences
(usually vector sequences)
-l during cleaning, consider invalid the sequences sorter
than <minlen> (default 100)
-s comma delimited list of sequence files to use for
screening <seqfile> sequences for contamination
(mito/ribo or different species contamination)
-r write the cleaning report into file <reportfile>
(default: <seqfile>.cln)
-o output the "cleaned" sequences to file <outfasta>
(default: <seqfile>.clean)
-x minimum percent identity for an alignemnt with
a contaminant (default 96)
-y minimum length of a terminal vector hit to be considered
(>11, default 11)
-N disable trimming of ends rich in Ns (undetermined bases)
-M disable trashing of low quality sequences
-A disable trimming of polyA/T tails  
-L disable low-complexity screening (dust)
-I do not rebuild the cdb index file
-m send e-mail notifications to <e-mail>

web