AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.

1. Augustus的安装


$ wget  
$ tar zxf augustus.2.7.tar.gz  
$ cd augustus.2.7  
$ cd src  
$ make -j 8  
$ export AUGUSTUS_CONFIG_PATH=$PWD/../config/ (可以加入到.bashrc中)

2. Augustus使用方法

2.1 基因预测例子

$ augustus --strand=both --genemode=partial --singlestrand=false --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.cfg --protein=on --introns=on --start=on --stop=on --cds=on --codingseq=on --alternatives-from-evidence=true --gff3=on --UTR=on ----outfile=out.gff --species=human genome.fa  
$ augustus --noprediction=true --species=SPECIES

2.2 Augustus使用参数


augustus [parameters] --sepcies=SPECIES queryfilename


--strand=both, --strand=forward or --strand=backward      report predicted genes on both strands, just the forward or   just the backward strand.default is 'both'    
--genemodel=partial, --genemodel=intronless, --genemodel=complete,   
--genemodel=atleastone or --genemodel=exactlyone      partial : allow prediction of incomplete genes at the sequence boundaries (default)      intronless : only predict single-exon genes like in prokaryotes and some eukaryotes      
complete : only predict complete genes      atleastone : predict at least one complete gene      exactlyone : predict exactly one complete gene    
--singlestrand=true      predict genes independently on each strand, allow overlapping   genes on opposite strands. This option is turned off by default.    
--hintsfile=hintsfilename      When this option is used the prediction considering hints (ex  trinsic information) is turned on. hintsfilename contains the hints   in gff format.    
--extrinsicCfgFile=cfgfilename      Optional. This file contains the list of used sources for the   hints and their boni and mali. If not specified the file "extrin  sic.cfg" in the config directory $AUGUSTUS_CONFIG_PATH is used.    
--maxDNAPieceSize=n      This value specifies the maximal length of the pieces that the   sequence is cut into for the core algorithm (Viterbi) to be run.   Default is --maxDNAPieceSize=200000.      
AUGUSTUS tries to place the boundaries of these pieces in the   intergenic region, which is inferred by a preliminary prediction.   GC-content dependent parameters are chosen for each piece of DNA   
if /Constant/decomp_num_steps > 1 for that species. This is why   this value should not be set very large, even if you have plenty   of memory.    
--codingseq=on/off      Output options. Output predicted protein sequence, introns,   start codons, stop codons. Or use 'cds' in addition to 'initial',   'internal', 'terminal' and 'single' exon. 
The CDS excludes the   stop codon (unless stopCodonExcludedFromCDS=false) whereas the   terminal and single exon include the stop codon.    
--AUGUSTUS_CONFIG_PATH=path      path to config directory (if not specified as environment var  iable)    
--alternatives-from-evidence=true/false      report alternative transcripts when they are suggested by hints    
--alternatives-from-sampling=true/false      report alternative transcripts generated through probabilistic   sampling    
--sample=n  --minexonintronprob=p  --minmeanexonintronprob=p  --maxtracks=n    --proteinprofile=filename  Read a protein profile from file filename. See section 7 below.    
--predictionStart=A, --predictionEnd=B      A and B define the range of the sequence for which predictions   should be found. Quicker if you need predictions only for a small   part.    
--gff3=on/off      output in gff3 format.    
--UTR=on/off      predict the untranslated regions in addition to the coding   sequence. This currently works only for human, galdieria, toxopl  asma and caenorhabditis.    
--outfile=filename      print output to filename instead to standard output. This is   useful for computing environments, e.g. parasol jobs, which do   not allow shell redirection.    
--noInFrameStop=true/false      Don't report transcripts with in-frame stop codons. Otherwise,   intron-spanning stop codons could occur. Default: false    
If true and input is in genbank format, no prediction is made.  
 Useful for getting the annotated protein sequences. Augustus也可以以  genebank格式文件为输入文件,进行基因预测,并将预测结果和genebank的结果进行比较后  得出一个精确性的统计结果。      
当然,由于genebank格式文件中有些sequences没有cds的注释结果,因此可以使用该  参数进行检测,从而得到没有cds的序列号,在人为去去除这些没有cds注释的序列,再去进行  预测准确性的评估。    
--contentmodels=on/off      If 'off' the content models are disabled (all emissions unif  ormly 1/4). The content models are; coding region Markov chain   (emiprobs), 
initial k-mers in coding region (Pls), intron and int  ergenic regin Markov chain. This option is intended for special   applications that require judging gene structures from the signal   models only, 
e.g. for predicting the effect of SNPs or mutations   on splicing. For all typical gene predictions, this should be   true. Default: on    
--paramlist      For a complete list of parameters, type "augustus --paramlist"