Cufflinks

Download

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.

Cufflinks is a collaborative effort between the Laboratory for Mathematical and Computational Biology, led by Lior Pachter at UC Berkeley, Steven Salzberg's computational genomics group at the Institute of Genetic Medicine at Johns Hopkins University, and Barbara Wold's lab at Caltech.

Cufflinks is provided under the OSI-approved Boost License

==========================================================

2.1.1 release - 4/11/2013

This release fixes a bug introduced with version 2.1.0, which for some users caused Cufflinks to output empty transcript GTF files.


2.1.0 release - 4/10/2013

This release substantially improves the accuracy, speed, and memory footprint of Cufflinks and Cuffdiff. It is recommended for all users. Those who wish to see the impact of the accuracy improvements should look at the new benchmarking section. In addition to numerous bugfixes, the main changes are as follows:


New google group for TopHat, Cufflinks, and CummeRbund users - 4/11/2013

To better handle the many questions we receive from users, we have launched a new google group for discussion and questions about our tools. We are simply unable to handle the email we receive, and our hope is that the users of of the group will help each other. While we will continue to answer email sent to tophat.cufflinks@gmail.com as we have time, we will also answer questions posted to the group, and the latter questions will be prioritized. Please post questions or comments to the group (if your question hasn't already been asked and answered) before emailing tophat.cufflinks@gmail.com.


TransDecoder: calling novel ORFs in RNA-Seq assemblies- 4/11/2013

Our friend Brian Haas (of Trinity fame) has written a tool that might be helpful to Cufflinks users wishing to discover new coding genes and splice variants. TransDecoder takes as input an assembly of RNA-Seq reads from Trinity, Cufflinks, and other tools and produces predictions about likely open reading frames as output. http://transdecoder.sourceforge.net/


Cuffdiff 2 manuscript published at Nature Biotechnology - 12/11/2012

We are happy to announce a new manuscript describing Cuffdiff 2 has appeared in print at Nature Biotechnology. The manuscript contains performance assessments of Cuffdiff 2 and a detailed comparison of our methods against other tools for differential analysis.


2.0.2 release - 7/8/2012

This release fixes several bugs:


2.0.1 release - 6/15/2012

This release addresses several bugs, most of which were introduced as part of the 2.0.0 release:


Illumina iGenomes now include both Bowtie1 and Bowtie2 indexes - 5/23/2012


2.0.0 release - 5/4/2012

This release substantially improves the accuracy and robustness of differential analysis with Cuffdiff. The update also resolves several user-reported issues and bugs, and several requested features. Due to the large number of enchancements and fixes, users are encouraged to treat this as a beta release. A manuscript describing the algorithmic improvements to the software is in preparation. Changes include:


TopHat and Cufflinks protocol published at Nature Protocols - 3/12/2012

A complete bioinformatic protocol for analysis of RNA-Seq data using our tools has been published at Nature Protocols. The protocol covers read alignment with TopHat, gene and transcript discovery with Cufflinks, annotation analysis with Cuffmerge and Cuffcompare, differential expression analysis with Cuffdiff, and visualization with CummeRbund. Several variants of the protocol are included for those who wish to forgo certain analysis steps, such as gene discovery.


1.3.0 release - 1/2/2012

This release improves the accuracy of Cuffdiff's isoform switching tests and fixes several bugs.


1.2.0 release - 11/23/2011

This release fixes a number of bugs and includes some signficant accuracy and performance improvements:


CummeRbund released - 11/21/2011

Extracting biological insight from transcript-level RNA-Seq analysis can be very challenging. Due to the volume and complexity of output from RNA-Seq analysis pipelines, many users may choose to focus only on gene-level results, and thus miss crucial biological insights that a transcript-level analysis can provide. We are happy to present CummeRbund, an R/Bioconductor package that simplifies the organization, access, exploration, and visualization of the various output files of a Cuffdiff differential expression analysis. CummeRbund begins by re-organizing the Cuffdiff output files, and storing these data in a local SQLite database. During this process, CummeRbund indexes the data to speed up access to specific feature data (genes, isoforms, TSS, CDS, etc.), and preserves the various relationships between these features. Access to data elements is done through R via the RSQLite package and data are presented in appropriately structured R classes with various convenience functions designed to streamline your workflow.

CummeRbund simplifies the way in which you access and analyze your RNA-Seq data. Features include:

CummeRbund is being made freely available under the OSI approved Artistic License 2.0 and can be downloaded from http://compbio.mit.edu/cummeRbund/. CummeRbund has also been included as part of the new R/Bioconductor v2.9 release and can be installed in a similar manner to standard Bioconductor packages.

1.1.0 release - 9/8/2011

This is a fix release to address several issues reported by our users:

FILE FORMAT CHANGES:


iGenomes index and annotation packages available for download - 7/31/2011

Illumina has generously provided a set of freely downloadable packages that contain everything you need to get started working with TopHat and Cufflinks. These packages contain Bowtie indexes for the human, mouse, and fly genomes as well as many others. The packages also contain annotation files (in GTF format) from UCSC, Ensembl, NCBI, and other sources. These files are augmented with the special attributes Cufflinks needs to perform differential splicing and promoter analysis. We strongly encourage users to download and try these packages!


Cufflinks RABT assembly paper published - 6/26/2011

Our new paper on the Cufflinks reference annotation based transcript (RABT) assembly method introduced in v1.0.0 has appeared in Bioinformatics. The paper describes how the RABT assembler builds upon a known reference annotation to better identify novel transcripts. You can try the RABT assembler in Cufflinks by using the -g option as explained in the manual.

Please cite this paper as well as the original Cufflinks paper if using the RABT assembler in your work.


1.0.3 release - 6/1/2011

This is a fix release to address several issues reported by our users:


1.0.2 release - 5/22/2011

This release fixes several bugs and adds a few enhancements:


1.0.1 release - 5/6/2011

This fix release corrects several issues introduced with 1.0.0:


1.0.0 release - 5/5/2011

This release represents a huge leap for Cufflinks in terms of performance and features. It is highly recommended that all users upgrade to this version of Cufflinks. Updates and improvements include:


Cufflinks Bias Correction paper published - 3/16/2011

Our new paper on the Cufflinks bias correction method introduced in v0.9.0 has appeared in Genome Biology. The paper describes the details of the method and provides validations for the improvements in expression estimates it produces. As a reminder, bias correction is only activated when a reference sequence (fasta) is supplied with the -r option.

Please cite this paper if using Cufflinks bias correction in your work.


New "Getting Help" email address for TopHat and Cufflinks

In order to more effectively answer user help requests and improve usability and documentation, we have created an email address to which users can send messages for technical support. If you have questions about Cufflinks or TopHat, please send them to tophat.cufflinks@gmail.com. We will do our best to answer your question in a timely fashion, although please read the manual carefully before sending your email. We have very limited time to answer questions, and most questions require careful, technical answers.

If you believe you have found a bug in the software, please include a small package of test data with your email so that we can reproduce your problem locally. A test example makes it much easier to correct the issue.


TopHat and Cufflinks now supported through Galaxy

We are very pleased to announce that you can now run TopHat and Cufflinks through Galaxy. The Galaxy project aims to make informatics tools accessible through the web, and allows you to experiment with parameter settings and create sophisticated analysis workflows easily. Galaxy is developed by researchers at Emory University and Penn State in the Taylor and Nekrutenko labs, respectively. We are extremely grateful for the Galaxy team's work, and proud to have TopHat and Cufflinks offered through their platform.


0.9.3 release - 11/30/2010

This release fixes several issues that affect abundance estimation and differential expression accuracy, and is strongly recommended for all users. Additionally, there are some speed and threading improvements during bias modeling and correction. Finally, a bug which causes Cufflinks to crash on some BAM files has been fixed.


0.9.2 release - 10/26/2010

This release modifies the way library types are handled to add support for more strand-specific protocols as well as adaptability to future protocols that may be introduced. Note that the way library-types are specified in the input has changed as a result. See the Manual for more details.

Some portability issues, which resulted in segfaults on some systems, have been fixed in the precompiled binaries.


0.9.1 release - 10/3/2010

This release includes two bug fixes and some enhancements for strand-specific RNA-Seq libraries:


Update - fix for 0.9.0

The binaries and source have been updated to address a floating point exception when using a SAM file with no header and without a GTF


New developer - Adam Roberts

The Cufflinks team has been joined by Adam Roberts, a Ph.D. student from the UC Berkeley Department of Computer Science. Adam has made many improvements to Cufflinks (see the announcement below), and we are very fortunate to benefit from his talents and expertise.


0.9.0 release - 9/27/2010

This release includes significant bug fixes as well as some major new features. Enhancements include:

Note that building Cufflinks from source now requires the SAM tools. See the installation instructions for more details.


0.8.3 build update - 7/2/2010

The 0.8.3 release on 6/30 included a broken version of Cuffdiff. This and a few other minor issues have been resolved and the builds and source have been updated.


0.8.3 release - 6/30/2010

Fixes and accuracy improvements to the Cufflinks assembler. This update is strongly recommended for users interested in identifying novel transcripts. Other changes:


Cufflinks paper published - 5/2/2010

Our paper on Cufflinks has appeared at Nature Biotechnology. In a study of developing mouse muscle cells, we identified thousands of new transcripts and detected differential splicing and promoter use in hundreds of genes by examining changes in isoform abundance levels. The paper is aimed at a general audience, so much of the algorithmic and mathematical detail can be found in the supplemental methods (WARNING: 2 MB PDF).


0.8.2 release - 3/26/2010

Numerous bug fixes and significant performance improvements in Cufflinks and Cuffdiff. Note that some the formats of some files have changed. Other changes:


0.8.1 release - 2/13/2010

This is a minor fix release. Cufflinks 0.8.0 had some lingering references to "RPKM", which have been replaced with "FPKM". Reported expression values have always been in FPKM, this is simply a nomenclature change to better reflect what RNA-Seq actually measures. See the section on FPKM in the description of how Cufflinks works


0.8.0 release - 2/5/2010

We are happy to announce a major update to Cufflinks that introduces some powerful new features and includes a number of performance improvement and bug fixes. Highlights include:


0.7.0 release - 9/26/2009

The first public release of Cufflinks is now available for download. Cufflinks is a program for the comparative assembly of transcripts and the estimation of their abundances in an RNA-Seq experiment. It runs on Linux and OS X. It also comes with a tool to track transcripts across multiple samples, for example in a time course of RNA-Seq.

Cufflinks takes as input a file of alignments in SAM format, and reports transfrags in GTF format. You can use TopHat to align your reads, as TopHat reports its alignments in SAM format.

This software is a work in progress - it is a beta release, and new features will continue to be added over the next couple of weeks. To suggest a feature or report a bug, please email Cole Trapnell

=======================================================================

Getting started




Install quick-start


Installing a pre-compiled binary release

In order to make it easy to install Cufflinks, we provide a few binary packages to save users from occasionally frustrating process of building Cufflinks, which requires that you install the Boost libraries. To use the binary packages, simply download the appropriate one for your machine, untar it, and make sure the cufflinks,cuffdiff and cuffcompare binaries are in a directory in your PATH environment variable.


Building Cufflinks from source

In order to build Cufflinks, you must have the Boost C++ libraries (version 1.47 or higher) installed on your system. See below for instructions on installing Boost.


Installing Boost


  1. Download Boost and the bjam build engine.
  2. Unpack bjam and add it to your PATH.
  3. Unpack the Boost tarball and cd to the Boost source directory. This directory is called the BOOST_ROOT in some Boost installation instructions.
  4. Build Boost. Note that you can specify where to put Boost with the --prefix option. The default Boost installation directory is /usr/local. Take note of the boost installation directory, because you will need to tell the Cufflinks installer where to find Boost later on.

    If you are on Mac OS X, type (all on one line):

    bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=darwin architecture=x86 address_model=32_64 link=static runtime-link=static --layout=versioned stage install
  5. If you are on a 32-bit Linux system, type (all on one line):

    bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=gcc architecture=x86 address_model=32 link=static runtime-link=static stage install

    If you are on a 64-bit Linux system, type (all on one line):

    bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=gcc architecture=x86 address_model=64 link=static runtime-link=static stage install

Installing the SAM tools


  1. Download the SAM tools
  2. Unpack the SAM tools tarball and cd to the SAM tools source directory.
  3. Build the SAM tools by typing make at the command line.
  4. Choose a directory into which you wish to copy the SAM tools binary, the included library libbam.a, and the library headers. A common choice is /usr/local/.
  5. Copy libbam.a to the lib/ directory in the folder you've chosen above (e.g. /usr/local/lib/)
  6. Create a directory called "bam" in the include/ directory (e.g. /usr/local/include/bam)
  7. Copy the headers (files ending in .h) to the include/bam directory you've created above (e.g. /usr/local/include/bam)
  8. Copy the samtools binary to some directory in your PATH.

Installing the Eigen libraries


  1. Download Eigen
  2. Unpack the Eigen tarball and cd to the Eigen source directory.
  3. Copy the Eigen/ subdirectory someplace on your system where you keep header files (e.g. /usr/local/include)

Building Cufflinks


  1. Unpack the Cufflinks source tarball:
    tar zxvf cufflinks-0.7.0.tar.gz
  2. Change to the Cufflinks directory:
    cd cufflinks-0.7.0
  3. Configure Cufflinks. If Boost is installed somewhere other than /usr/local, you will need to tell the installer where to find it using the --with-boost option. Specify where to install Cufflinks using the --prefix option.
    ./configure --prefix=/path/to/cufflinks/install --with-boost=/path/to/boost --with-eigen=/path/to/eigen
    If you see any errors during configuration, verify that you are using Boost version 1.47 or higher, and that the directory you specified via --with-boost contains the boost header files and libraries. See the Boost Getting started page for more details. If you copied the SAM tools binaries to someplace other than /usr/local/, you may need to supply the --with-bam configuration option.
  4. Finally, make and install Cufflinks.
    make
    make install

Testing the installation


  1. Download the test data
  2. In the directory where you placed the test file, type:
    cufflinks ./test_data.sam

    You should see the following output:
    [bam_header_read] EOF marker is absent. The input is probably truncated.  [bam_header_read] invalid BAM binary header (this is not a BAM file).  File ./test_data.sam doesn't appear to be a valid BAM file, trying SAM...  [13:23:15] Inspecting reads and determining fragment length distribution.  > Processed 1 loci.                            [*************************] 100%  Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.    It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.  > Map Properties:  >       Total Map Mass: 102.50  >       Read Type: 75bp x 75bp  >       Fragment Length Distribution: Truncated Gaussian (default)  >                     Estimated Mean: 200  >                  Estimated Std Dev: 80  [13:23:15] Assembling transcripts and estimating abundances.  > Processed 1 loci.                            [*************************] 100%  
  3. Verify that the file transcripts.gtf is in the current directory and looks like this (your file will have GTF attributes, omitted here for clarity)
    test_chromosome Cufflinks       exon    53      250     1000    +       .   test_chromosome Cufflinks       exon    351     400     1000    +       .   test_chromosome Cufflinks       exon    501     550     1000    +       .  					  				

Common uses of the Cufflinks package


Discovering novel genes and transcripts


RNA-Seq is a powerful technology for gene and splice variant discovery. You can use Cufflinks to help annotate a new genome or find new genes and splice isoforms of known genes in even well-annotated genomes. Annotating genomes is a complex and difficult process, but we outline a basic workflow that should get you started here. The workflow also excludes examples of the commands you'd run to implement each step in the workflow. Suppose we have RNA-Seq reads from human liver, brain, and heart.


  1. Map the reads for each tissue to the reference genome
  2. We recommend that you use TopHat to map your reads to the reference genome. For this example, we'll assume you have paired-end RNA-Seq data. You can map reads as follows:

    tophat -r 50 -o tophat_brain /seqdata/indexes/hg19 brain_1.fq brain_2.fq tophat -r 50 -o tophat_liver /seqdata/indexes/hg19 liver_1.fq liver_2.fq tophat -r 50 -o tophat_heart /seqdata/indexes/hg19 heart_1.fq heart_2.fq

    The commands above are just examples of how to map reads with TopHat. Please see the TopHat manual for more details on RNA-Seq read mapping.


  3. Run Cufflinks on each mapping file
  4. The next step is to assemble each tissue sample independently using Cufflinks. Assemble each tissue like so:

    cufflinks -o cufflinks_brain tophat_brain/accepted_hits.bam
    cufflinks -o cufflinks_liver tophat_liver/accepted_hits.bam
    cufflinks -o cufflinks_heart tophat_liver/accepted_hits.bam
  5. Merge the resulting assemblies
  6. assemblies.txt:
    cufflinks_brain/transcripts.gtf
    cufflinks_liver/transcripts.gtf
    cufflinks_heart/transcripts.gtf
    Now run the merge script:
    cuffmerge -s /seqdata/fastafiles/hg19/hg19.fa assemblies.txt

    The final, merged annotation will be in the file merged_asm/merged.gtf. At this point, you can use your favorite browser to explore the structure of your genes, or feed this file into downstream informatic analyses, such as a search for orthologs in other organisms. You can also explore your samples with Cuffdiff and identify genes that are significantly differentially expressed between the three conditions. See the workflows below for more details on how to do this.

  7. (optional) Compare the merged assembly with known or annotated genes
  8. If you want to discover new genes in a genome that has been annotated, you can use cuffcompare to sort out what is new in your assembly from what is already known. Run cuffcompare like this:
    cuffcompare -s /seqdata/fastafiles/hg19/hg19.fa -r known_annotation.gtf merged_asm/merged.gtf
    Cuffcompare will produce a number of output files that you can parse to select novel genes and isoforms.

Identifying differentially expressed and regulated genes


There are two workflows you can choose from when looking for differentially expressed and regulated genes using the Cufflinks package. The first workflow is simpler and is a good choice when you aren't looking for novel genes and transcripts. This workflow requires that you not only have a reference genome, but also a reference gene annotation in GFF format (GFF3 or GTF2 formats are accepted, see details here). The second workflow, which includes steps to discover new genes and new splice variants of known genes, is more complex and requires more computing power. The second workflow can use and augment a reference gene annotation GFF if one is available.


Differential analysis without gene and transcript discovery

  1. Map the reads for each condition to the reference genome
  2. We recommend that you use TopHat to map your reads to the reference genome. For this example, we'll assume you have paired-end RNA-Seq data. Suppose you have RNA-Seq from a knockdown experiment where you have two biological replicates of a mock condition as a control and two replicates of your knockdown.

    Note: Cuffdiff will work much better if you map your replicates independently, rather than pooling the replicates from one condition into a single set of reads.

    Note: While an GTF of known transcripts is not strictly required at this stage, providing one will improve alignment sensitivity, and ultimately, the accuracy of Cuffdiff's analysis.

    You can map reads as follows:

    tophat -r 50 -G annotation.gtf -o tophat_mock_rep1 /seqdata/indexes/hg19 \
    mock_rep1_1.fq mock_rep1_2.fq
    tophat -r 50 -G annotation.gtf -o tophat_mock_rep2 /seqdata/indexes/hg19 \
    mock_rep2_1.fq mock_rep2_2.fq
    tophat -r 50 -G annotation.gtf -o tophat_knockdown_rep1 /seqdata/indexes/hg19 \
    knockdown_rep1_1.fq knockdown_rep1_2.fq
    tophat -r 50 -G annotation.gtf -o tophat_knockdown_rep2 /seqdata/indexes/hg19 \
    knockdown_rep2_1.fq knockdown_rep2_2.fq
  3. Run Cuffdiff
  4. Take the annotated transcripts for your genome (as GFF or GTF) and provide them to cuffdiff along with the BAM files from TopHat for each replicate:
    cuffdiff annotation.gtf mock_rep1.bam,mock_rep2.bam \
    knockdown_rep1.bam,knockdown_rep2.bam

Differential analysis with gene and transcript discovery

  1. Complete steps 1-3 in "Discovering novel genes and transcripts", above
  2. Follow the protocol for gene and transcript discovery listed above. Be sure to provide TopHat and the assembly merging script with an reference annotation if one is available for your organism, to ensure the highest possible quality of differential expression analysis.

  3. Run Cuffdiff
  4. Take the merged assembly from produced in step 3 of the discovery protocol and provide it to cuffdiff along with the BAM files from TopHat:
    cuffdiff merged_asm/merged.gtf liver1.bam,liver2.bam brain1.bam,brain2.bam
    As shown above, replicate BAM files for each conditions must be given as a comma separated list. If you put spaces between replicate files instead of commas, cuffdiff will treat them as independent conditions.

====================================================================

Please Note If you have questions about how to use Cufflinks or would like more information about the software, please email tophat.cufflinks@gmail.com, though we ask you to have a look at the paper and the supplemental methods first, as your question be answered there.

Manual



Prerequisites


Cufflinks runs on intel-based computers running Linux or Mac OS X and that have GCC 4.0 or greater installed. You can install pre-compiled binaries or build Cufflinks from the source code. If you wish to build Cufflinks yourself, you will need to install the Boost C++ libraries. See Installing Boost, on the getting started page. You will also need to build and install the SAM tools, but you should take a look at the getting started page for detailed instructions, because the headers and libbam must be accessible to the Cufflinks build scripts.



Run cufflinks from the command line as follows:
Usage: cufflinks [options]* <aligned_reads.(sam/bam)>

The following is a detailed description of the options used to control Cufflinks:


Arguments:
<aligned_reads.(sam/bam)> A file of RNA-Seq read alignments in the SAM format. SAM is a standard short read alignment, that allows aligners to attach custom tags to individual alignments, and Cufflinks requires that the alignments you supply have some of these tags. Please see Input formats for more details.
General Options:
-h/--help Prints the help message and exits
-o/--output-dir <string> Sets the name of the directory in which Cufflinks will write all of its output. The default is "./".
-p/--num-threads <int> Use this many threads to align reads. The default is 1.
-G/--GTF <reference_annotation.(gtf/gff)> Tells Cufflinks to use the supplied reference annotation (a GFF file) to estimate isoform expression. It will not assemble novel transcripts, and the program will ignore alignments not structurally compatible with any reference transcript.
-g/--GTF-guide <reference_annotation.(gtf/gff)> Tells Cufflinks to use the supplied reference annotation (GFF) to guide RABT assembly. Reference transcripts will be tiled with faux-reads to provide additional information in assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.
-M/--mask-file <mask.(gtf/gff)> Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.
-b/--frag-bias-correct <genome.fa> Providing Cufflinks with a multifasta file via this option instructs it to run our new bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. See How Cufflinks Works for more details.
-u/--multi-read-correct Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome. See How Cufflinks Works for more details.
--library-type See Library Types
--library-norm-method See Library Normalization Methods
Advanced Abundance Estimation Options:
-m/--frag-len-mean <int> This is the expected (mean) fragment length. The default is 200bp.
Note: Cufflinks now learns the fragment length mean for each SAM file, so using this option is no longer recommended with paired-end reads.
-s/--frag-len-std-dev <int> The standard deviation for the distribution on fragment lengths. The default is 80bp.
Note: Cufflinks now learns the fragment length standard deviation for each SAM file, so using this option is no longer recommended with paired-end reads.
-N/--upper-quartile-norm With this option, Cufflinks normalizes by the upper quartile of the number of fragments mapping to individual loci instead of the total number of sequenced fragments. This can improve robustness of differential expression calls for less abundant genes and transcripts.
--total-hits-norm With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is active by default.
--compatible-hits-norm With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is inactive by default, and can only be used in combination with --GTF. Use with either RABT or ab initio assembly is not supported
--max-mle-iterations <int> Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
--max-bundle-frags <int> Sets the maximum number of fragments a locus may have before being skipped. Skipped loci are listed in skipped.gtf. Default: 1000000
--no-effective-length-correction Cufflinks will not employ its "effective" length normalization to transcript FPKM.
--no-length-correction Cufflinks will not normalize fragment counts by transcript length at all. Use this option when fragment count is independent of the size of the features being quantified (e.g. for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length). Experimental option, use with caution.
Advanced Assembly Options:
-L/--label Cufflinks will report transfrags in GTF format, with a prefix given by this option. The default prefix is "CUFF".
-F/--min-isoform-fraction <0.0-1.0> After calculating isoform abundance for a gene, Cufflinks filters out transcripts that it believes are very low abundance, because isoforms expressed at extremely low levels often cannot reliably be assembled, and may even be artifacts of incompletely spliced precursors of processed transcripts. This parameter is also used to filter out introns that have far fewer spliced alignments supporting them. The default is 0.1, or 10% of the most abundant isoform (the major isoform) of the gene.
-j/--pre-mrna-fraction <0.0-1.0> Some RNA-Seq protocols produce a significant amount of reads that originate from incompletely spliced transcripts, and these reads can confound the assembly of fully spliced mRNAs. Cufflinks uses this parameter to filter out alignments that lie within the intronic intervals implied by the spliced alignments. The minimum depth of coverage in the intronic region covered by the alignment is divided by the number of spliced reads, and if the result is lower than this parameter value, the intronic alignments are ignored. The default is 15%.
-I/--max-intron-length <int> The maximum intron length. Cufflinks will not report transcripts with introns longer than this, and will ignore SAM alignments with REF_SKIP CIGAR operations longer than this. The default is 300,000.
-a/--junc-alpha <0.0-1.0> The alpha value for the binomial test used during false positive spliced alignment filtration. Default: 0.001
-A/--small-anchor-fraction <0.0-1.0> Spliced reads with less than this percent of their length on each side of the junction are considered suspicious and are candidates for filtering prior to assembly. Default: 0.09.
--min-frags-per-transfrag <int> Assembled transfrags supported by fewer than this many aligned RNA-Seq fragments are not reported. Default: 10.
--overhang-tolerance <int> The number of bp allowed to enter the intron of a transcript when determining if a read or another transcript is mappable to/compatible with it. The default is 8 bp based on the default bowtie/TopHat parameters.
--max-bundle-length <int> Maximum genomic length allowed for a given bundle. The default is 3,500,000 bp.
--min-intron-length <int> Minimum intron size allowed in genome. The default is 50 bp.
--trim-3-avgcov-thresh <int> Minimum average coverage required to attempt 3' trimming. The default is 10.
--trim-3-dropoff-frac <int> The fraction of average coverage below which to trim the 3' end of an assembled transcript. The default is 0.1.
--max-multiread-fraction <0.0-1.0> The fraction a transfrag's supporting reads that may be multiply mapped to the genome. A transcript composed of more than this fraction will not be reported by the assembler. Default: 0.75 (75% multireads or more is suppressed).
--overlap-radius <int> Transfrags that are separated by less than this distance get merged together, and the gap is filled. Default: 50 (in bp).
Advanced Reference Annotation Based Transcript (RABT) Assembly Options:
These options have an affect only when used in conjuction with -g/--GTF-guide.
--3-overhang-tolerance <int> The number of bp allowed to overhang the 3' end of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 600 bp.
--intron-overhang-tolerance <int> The number of bp allowed to enter the intron of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 50 bp.
--no-faux-reads This option disables tiling of the reference transcripts with faux reads. Use this if you only want to use sequencing reads in assembly but do not want to output assembled transcripts that lay within reference transcripts. All reference transcripts in the input annotation will also be included in the output.
Advanced Program Behavior Options:
-v/--verbose Print lots of status updates and other diagnostic information.
-q/--quiet Suppress messages other than serious warnings and errors.
--no-update-check Turns off the automatic routine that contacts the Cufflinks server to check for a more recent version.


Cufflinks takes a text file of SAM alignments, or a binary SAM (BAM) file as input. For more details on the SAM format, see the specification. The RNA-Seq read mapper TopHat produces output in this format, and is recommended for use with Cufflinks. However Cufflinks will accept SAM alignments generated by any read mapper. Here's an example of an alignment Cufflinks will accept:


s6.25mer.txt-913508	16	chr1 4482736 255 14M431N11M * 0 0 \     CAAGATGCTAGGCAAGTCTTGGAAG IIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 XS:A:-  	
Note the use of the custom tag XS. This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from. While this tag can be applied to any alignment, including unspliced ones, it must be present for all spliced alignment records (those with a 'N' operation in the CIGAR string).

The SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already. If you used another tool, you may want to make sure they are properly sorted as follows:


sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted


Cufflinks produces three output files:


    1) transcripts.gtf

    This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id", and "transcript_id"). There one GTF record per row, and each record represents either a transcript or an exon within a transcript. The columns are defined as follows:

    Column number Column name Example Description
    1 seqname chrX Chromosome or contig name
    2 source Cufflinks The name of the program that generated this file (always 'Cufflinks')
    3 feature exon The type of record (always either "transcript" or "exon".
    4 start 77696957 The leftmost coordinate of this record (where 1 is the leftmost possible coordinate)
    5 end 77712009 The rightmost coordinate of this record, inclusive.
    6 score 77712009 The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM)
    7 strand + Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."
    7 frame . Cufflinks does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used.
    8 attributes ... See below.
    Each GTF record is decorated with the following attributes:
    Attribute Example Description
    gene_id CUFF.1 Cufflinks gene id
    transcript_id CUFF.1.1 Cufflinks transcript id
    FPKM 101.267 Isoform-level relative abundance in Fragments Per Kilobase of exon model per Million mapped fragments
    frac 0.7647 Reserved. Please ignore, as this attribute may be deprecated in the future
    conf_lo 0.07 Lower bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, lower bound = FPKM * (1.0 - conf_lo)
    conf_hi 0.1102 Upper bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, upper bound = FPKM * (1.0 + conf_lo)
    cov 100.765 Estimate for the absolute depth of read coverage across the whole transcript
    full_read_support yes When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.

    2) isoforms.fpkm_tracking

    This file contains the estimated isoform-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.


    3) genes.fpkm_tracking

    This file contains the estimated gene-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.


Running Cuffcompare


Cufflinks includes a program that you can use to help analyze the transfrags you assemble. The program cuffcompare helps you:

From the command line, run cuffcompare as follows:

cuffcompare [options]* <cuff1.gtf> [cuff2.gtf] ... [cuffN.gtf]

Cuffcompare Input


Cuffcompare takes Cufflinks' GTF output as input, and optionally can take a "reference" annotation (such as from Ensembl)
Arguments:
<cuff1.gtf> A GTF file produced by cufflinks.
Options:
-h Prints the help message and exits
-o <outprefix> All output files created by Cuffcompare will have this prefix (e.g. <outprefix>.loci, <outprefix>.tracking, etc.). If this option is not provided the default output prefix being used is: "cuffcmp"
-r An optional "reference" annotation GFF file. Each sample is matched against this file, and sample isoforms are tagged as overlapping, matching, or novel where appropriate. See the refmap and tmap output file descriptions below.
-R If -r was specified, this option causes cuffcompare to ignore reference transcripts that are not overlapped by any transcript in one of cuff1.gtf,...,cuffN.gtf. Useful for ignoring annotated transcripts that are not present in your RNA-Seq samples and thus adjusting the "sensitivity" calculation in the accuracy report written in the <outprefix> file
-s <seq_dir> Causes cuffcompare to look into for fasta files with the underlying genomic sequences (one file per contig) against which your reads were aligned for some optional classification functions. For example, Cufflinks transcripts consisting mostly of lower-case bases are classified as repeats. Note that <seq_dir> must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.
-C Enables the "contained" transcripts to be also written in the <outprefix>.combined.gtffile, with the attribute "contained_in" showing the first container transfrag found. By default, without this option, cuffcompare does not write in that file isoforms that were found to be fully contained/covered (with the same compatible intron structure) by other transfrags in the same locus.
-V Cuffcompare is a little more verbose about what it's doing, printing messages to stderr, and it will also show warning messages about any inconsistencies or potential issues found while reading the given GFF file(s).

Cuffcompare Output


Cuffcompare produces the following output files:


    1) <outprefix>.stats

    Cuffcompare reports various statistics related to the "accuracy" of the transcripts in each sample when compared to the reference annotation data. The typical gene finding measures of "sensitivity" and "specificity" (as defined in Burset, M., Guigó, R. : Evaluation of gene structure prediction programs (1996) Genomics, 34 (3), pp. 353-367. doi: 10.1006/geno.1996.0298) are calculated at various levels (nucleotide, exon, intron, transcript, gene) for each input file and reported in this file. The Sn and Sp columns show specificity and sensitivity values at each level, while the fSn and fSp columns are "fuzzy" variants of these same accuracy calculations, allowing for a very small variation in exon boundaries to still be counted as a "match". (If the -o option was not given the default prefix "cuffcmp" is used and these stats will be printed into a file named cuffcmp.stats in the current directory)

    2) <outprefix>.combined.gtf

    Cuffcompare reports a GTF file containing the "union" of all transfrags in each sample. If a transfrag is present in both samples, it is thus reported once in the combined gtf.

    3) <outprefix>.tracking

    This file matches transcripts up between samples. Each row contains a transcript structure that is present in one or more input GTF files. Because the transcripts will generally have different IDs (unless you assembled your RNA-Seq reads against a reference transcriptome), cuffcompare examines the structure of each the transcripts, matching transcripts that agree on the coordinates and order of all of their introns, as well as strand. Matching transcripts are allowed to differ on the length of the first and last exons, since these lengths will naturally vary from sample to sample due to the random nature of sequencing.

    If you ran cuffcompare with the -r option, the first and second columns contain the closest matching reference transcript to the one described by each row.


    Here's an example of a line from the tracking file:

    TCONS_00000045 XLOC_000023 Tcea|uc007afj.1	j	\       q1:exp.115|exp.115.0|100|3.061355|0.350242|0.350207 \       q2:60hr.292|60hr.292.0|100|4.094084|0.000000|0.000000  		
    In this example, a transcript present in the two input files, called exp.115.0 in the first and 60hr.292.0 in the second, doesn't match any reference transcript exactly, but shares exons with uc007afj.1, an isoform of the gene Tcea, as indicated by the class code j. The first three columns are as follows:
    Column number Column name Example Description
    1 Cufflinks transfrag id TCONS_00000045 A unique internal id for the transfrag
    2 Cufflinks locus id XLOC_000023 A unique internal id for the locus
    3 Reference gene id Tcea The gene_name attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript
    4 Reference transcript id uc007afj.1 The transcript_id attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript
    5 Class code c The type of match between the Cufflinks transcripts in column 6 and the reference transcript. See class codes
    Each of the columns after the fifth have the following format:
    qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>|<cov>|<len>

    A transcript need not be present in all samples to be reported in the tracking file. A sample not containing a transcript will have a "-" in its entry in the row for that transcript.

    (The following output files are created for each of the <cuff_in> file given, in the same directories where the <cuff_in> files reside)

    4) <cuff_in>.refmap

    This tab delimited file lists, for each reference transcript, which cufflinks transcripts either fully or partially match it. There is one row per reference transcript, and the columns are as follows:

    Column number Column name Example Description
    1 Reference gene name Myog The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used.
    2 Reference transcript id uc007crl.1 The transcript_id attribute of the reference GTF record for this transcript
    3 Class code c The type of match between the Cufflinks transcripts in column 4 and the reference transcript. One of either 'c' for partial match, or '=' for full match.
    4 Cufflinks matches CUFF.23567.0,CUFF.24689.0 A comma separated list of Cufflinks transcript ids matching the reference transcript

    5) <cuff_in>.tmap

    This tab delimited file lists the most closely matching reference transcript for each Cufflinks transcript. There is one row per Cufflinks transcript, and the columns are as follows:

    Column number Column name Example Description
    1 Reference gene name Myog The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used.
    2 Reference transcript id uc007crl.1 The transcript_id attribute of the reference GTF record for this transcript
    3 Class code c The type of relationship between the Cufflinks transcripts in column 4 and the reference transcript (as described in the Class Codes section below)
    4 Cufflinks gene id CUFF.23567 The Cufflinks internal gene id
    5 Cufflinks transcript id CUFF.23567.0 The Cufflinks internal transcript id
    6 Fraction of major isoform (FMI) 100 The expression of this transcript expressed as a fraction of the major isoform for the gene. Ranges from 1 to 100.
    7 FPKM 1.4567 The expression of this transcript expressed in FPKM
    8 FPKM_conf_lo 0.7778 The lower limit of the 95% FPKM confidence interval
    9 FPKM_conf_hi 1.9776 The upper limit of the 95% FPKM confidence interval
    10 Coverage 3.2687 The estimated average depth of read coverage across the transcript.
    11 Length 1426 The length of the transcript
    12 Major isoform ID CUFF.23567.0 The Cufflinks ID of the gene's major isoform


    Class Codes

    If you ran cuffcompare with the -r option, tracking rows will contain the following values. If you did not use -r, the rows will all contain "-" in their class code column.
    Priority Code Description
    1 = Complete match of intron chain
    2 c Contained
    3 j Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
    4 e Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
    5 i A transfrag falling entirely within a reference intron
    6 o Generic exonic overlap with a reference transcript
    7 p Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
    8 r Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
    9 u Unknown, intergenic transcript
    10 x Exonic overlap with reference on the opposite strand
    11 s An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)
    12 . (.tracking file only, indicates multiple classifications)

Merging assemblies with cuffmerge


Cufflinks includes a script called cuffmerge that you can use to merge together several Cufflinks assemblies. It handles also handles running Cuffcompare for you, and automatically filters a number of transfrags that are probably artfifacts. If you have a reference GTF file available, you can provide it to the script in order to gracefully merge novel isoforms and known isoforms and maximize overall assembly quality. The main purpose of this script is to make it easier to make an assembly GTF file suitable for use with Cuffdiff. From the command line, run cuffmerge as follows:

cuffmerge [options]* <assembly_GTF_list.txt>

cuffmerge Input


cuffmerge takes several assembly GTF files from Cufflinks' as input. Input GTF files must be specified in a "manifest" file listing full paths to the files.
Arguments:
<assembly_list.txt> Text file "manifest" with a list (one per line) of GTF files that you'd like to merge together into a single GTF file.
Options:
-h/--help Prints the help message and exits
-o <outprefix> Write the summary stats into the text output file <outprefix>(instead of stdout)
-g/--ref-gtf An optional "reference" annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
-p/--num-threads <int> Use this many threads to align reads. The default is 1.
-s/--ref-sequence <seq_dir>/<seq_fasta> This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present. The merge script will pass this option to cuffcompare, which will use the sequences to assist in classifying transfrags and excluding artifacts (e.g. repeats). For example, Cufflinks transcripts consisting mostly of lower-case bases are classified as repeats. Note that <seq_dir> must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.

cuffmerge Output


cuffmerge produces a GTF file that contains an assembly that merges together the input assemblies.


    <outprefix>/merged.gtf



Running Cuffdiff


Cufflinks includes a program, "Cuffdiff", that you can use to find significant changes in transcript expression, splicing, and promoter use. From the command line, run cuffdiff as follows:

cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

Cuffdiff Input


Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample. Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are:
Attribute Description
tss_id The ID of this transcript's inferred start site. Determines which primary transcript this processed transcript is believed to come from. Cuffcompare appends this attribute to every transcript reported in the .combined.gtf file.
p_id The ID of the coding sequence this transcript contains. This attribute is attached by Cuffcompare to the .combined.gtf records only when it is run with a reference annotation that include CDS records. Further, differential CDS analysis is only performed when all isoforms of a gene have p_id attributes, because neither Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts.

Arguments:
<transcripts.(gtf/gff)> A transcript annotation file produced by cufflinks, cuffcompare, or other source.
<sample1.sam> A SAM file of aligned RNA-Seq reads. If more than two are provided, Cuffdiff tests for differential expression and regulation between all pairs of samples.
Options:
-h/--help Prints the help message and exits
-o/--output-dir <string> Sets the name of the directory in which Cuffdiff will write all of its output. The default is "./".
-L/--labels <label1,label2,...,labelN> Specify a label for each sample, which will be included in various output files produced by Cuffdiff.
-p/--num-threads <int> Use this many threads to align reads. The default is 1.
-T/--time-series Instructs Cuffdiff to analyze the provided samples as a time series, rather than testing for differences between all pairs of samples. Samples should be provided in increasing time order at the command line (e.g first time point SAM, second timepoint SAM, etc.)
--total-hits-norm With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is inactive by default.
--compatible-hits-norm With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. Using this mode is generally recommended in Cuffdiff to reduce certain types of bias caused by differential amounts of ribosomal reads which can create the impression of falsely differentially expressed genes. It is active by default.
-b/--frag-bias-correct <genome.fa> Providing Cufflinks with the multifasta file your reads were mapped to via this option instructs it to run our bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. See How Cufflinks Works for more details.
-u/--multi-read-correct Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome. See How Cufflinks Works for more details.
-c/--min-alignment-count <int> The minimum number of alignments in a locus for needed to conduct significance testing on changes in that locus observed between samples. If no testing is performed, changes in the locus are deemed not signficant, and the locus' observed changes don't contribute to correction for multiple testing. The default is 10 fragment alignments.
-M/--mask-file <mask.(gtf/gff)> Tells Cuffdiff to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.
--FDR <float> The allowed false discovery rate. The default is 0.05.
--library-type See Library Types
--library-norm-method See Library Normalization Methods
--dispersion-method See Cross-replicate dispersion estimation methods
Advanced Options:
-m/--frag-len-mean <int> This is the expected (mean) fragment length. The default is 200bp.
Note: Cufflinks now learns the fragment length mean for each SAM file, so using this option is no longer recommended with paired-end reads.
-s/--frag-len-std-dev <int> The standard deviation for the distribution on fragment lengths. The default is 80bp.
Note: Cufflinks now learns the fragment length standard deviation for each SAM file, so using this option is no longer recommended with paired-end reads.
--num-importance-samples <int> Deprecated
--max-mle-iterations <int> Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
-v/--verbose Print lots of status updates and other diagnostic information.
-q/--quiet Suppress messages other than serious warnings and errors.
--no-update-check Turns off the automatic routine that contacts the Cufflinks server to check for a more recent version.
--poisson-dispersion Use the Poisson fragment dispersion model instead of learning one in each condition.
--emit-count-tables Cuffdiff will output a file for each condition (called <sample>_counts.txt) containing the fragment counts, fragment count variances, and fitted variance model. For internal debugging only. This option will be removed in a future version of Cuffdiff.
-F/--min-isoform-fraction <0.0-1.0> Cuffdiff will round down to zero the abundance of alternative isoforms quantified at below the specified fraction of the major isoforms. This is done after MLE estimation but before MAP estimation to improve robustness of confidence interval generation and differential expression analysis. The default is 1e-5, and we recommend you not alter this parameter.
--max-bundle-frags <int> Sets the maximum number of fragments a locus may have before being skipped. Skipped loci are marked with status HIDATA. Default: 1000000
--max-frag-count-draws <int> Cuffdiff will make this many draws from each transcript's predicted negative binomial random numbder generator. Each draw is a number of fragments that will be probabilistically assigned to the transcripts in the transcriptome. Used to estimate the variance-covariance matrix on assigned fragment counts. Default: 100.
--max-frag-assign-draws <int> For each fragment drawn from a transcript, Cuffdiff will assign it this many times (probabilistically), thus estimating the assignment uncertainty for each transcript. Used to estimate the variance-covariance matrix on assigned fragment counts. Default: 50.
-F/--min-outlier-p <0.0-1.0> DEPRECATED
--min-reps-for-js-test <int> Cuffdiff won't test genes for differential regulation unless the conditions in question have at least this many replicates. Default: 3.
--no-effective-length-correction Cuffdiff will not employ its "effective" length normalization to transcript FPKM.
--no-length-correction Cuffdiff will not normalize fragment counts by transcript length at all. Use this option when fragment count is independent of the size of the features being quantified (e.g. for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length). Experimental option, use with caution.

Cuffdiff Output



    1) FPKM tracking files

    Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. The results are output in FPKM tracking files in the format described here. There are four FPKM tracking files:

    isoforms.fpkm_tracking Transcript FPKMs
    genes.fpkm_tracking Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id
    cds.fpkm_tracking Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id
    tss_groups.fpkm_tracking Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id

    2) Count tracking files

    Cuffdiff estimates the number of fragments that originated from each transcript, primary transcript, and gene in each sample. Primary transcript and gene counts are computed by summing the counts of transcripts in each primary transcript group or gene group. The results are output in count tracking files in the format described here. There are four Count tracking files:

    isoforms.count_tracking Transcript counts
    genes.count_tracking Gene counts. Tracks the summed counts of transcripts sharing each gene_id
    cds.count_tracking Coding sequence counts. Tracks the summed counts of transcripts sharing each p_id, independent of tss_id
    tss_groups.count_tracking Primary transcript counts. Tracks the summed counts of transcripts sharing each tss_id

    3) Read group tracking files

    Cuffdiff calculates the expression and fragment count for each transcript, primary transcript, and gene in each replicate. The results are output in per-replicate tracking files in the format described here. There are four read group tracking files:

    isoforms.read_group_tracking Transcript read group tracking
    genes.read_group_tracking Gene read group tracking. Tracks the summed expression and counts of transcripts sharing each gene_id in each replicate
    cds.read_group_tracking Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each p_id, independent of tss_id in each replicate
    tss_groups.read_group_tracking Primary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing each tss_id in each replicate

    4) Differential expression tests

    This tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. For each pair of samples x and y, four files are created

    isoform_exp.diff Transcript differential FPKM.
    gene_exp.diff Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id
    tss_group_exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id
    cds_exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id
    Each of the above files has the following format:
    Column number Column name Example Description
    1 Tested id XLOC_000001 A unique identifier describing the transcipt, gene, primary transcript, or CDS being tested
    2 gene Lypla1 The gene_name(s) or gene_id(s) being tested
    3 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the genes or transcripts being tested.
    4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
    5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
    6 Test status NOTEST Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
    7 FPKMx 8.01089 FPKM of the gene in sample x
    8 FPKMy 8.551545 FPKM of the gene in sample y
    9 log2(FPKMy/FPKMx) 0.06531 The (base 2) log of the fold change y/x
    10 test stat 0.860902 The value of the test statistic used to compute significance of the observed change in FPKM
    11 p value 0.389292 The uncorrected p-value of the test statistic
    12 q value 0.985216 The FDR-adjusted p-value of the test statistic
    13 significant no Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

    5) Differential splicing tests - splicing.diff

    This tab delimited file lists, for each primary transcript, the amount of overloading detected among its isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.

    Column number Column name Example Description
    1 Tested id TSS10015 A unique identifier describing the primary transcript being tested.
    2 gene name Rtkn The gene_name or gene_id that the primary transcript being tested belongs to
    3 locus chr6:83087311-83102572 Genomic coordinates for easy browsing to the genes or transcripts being tested.
    4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
    5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
    6 Test status OK Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
    7 Reserved 0
    8 Reserved 0
    9 √JS(x,y) 0.22115 The splice overloading of the primary transcript, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the splice variants
    10 test stat 0.22115 The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
    11 p value 0.000174982 The uncorrected p-value of the test statistic.
    12 q value 0.985216 The FDR-adjusted p-value of the test statistic
    13 significant yes Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

    6) Differential coding output - cds.diff

    This tab delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here.

    Column number Column name Example Description
    1 Tested id XLOC_000002-[chr1:5073200-5152501] A unique identifier describing the gene being tested.
    2 gene name Atp6v1h The gene_name or gene_id
    3 locus chr1:5073200-5152501 Genomic coordinates for easy browsing to the genes or transcripts being tested.
    4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
    5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
    6 Test status OK Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
    7 Reserved 0
    8 Reserved 0
    9 √JS(x,y) 0.0686517 The CDS overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the coding sequences
    10 test stat 0.0686517 The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
    11 p value 0.00546783 The uncorrected p-value of the test statistic
    12 q value 0.985216 The FDR-adjusted p-value of the test statistic
    13 significant yes Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

    7) Differential promoter use - promoters.diff

    This tab delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here.

    Column number Column name Example Description
    1 Tested id XLOC_000019 A unique identifier describing the gene being tested.
    2 gene name Tmem70 The gene_name or gene_id
    3 locus chr1:16651657-16668357 Genomic coordinates for easy browsing to the genes or transcripts being tested.
    4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
    5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
    6 Test status OK Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
    7 Reserved 0
    8 Reserved 0
    9 √JS(x,y) 0.0124768 The promoter overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the primary transcripts
    10 test stat 0.0124768 The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
    11 p value 0.394327 The uncorrected p-value of the test statistic
    12 q value 0.985216 The FDR-adjusted p-value of the test statistic
    13 significant no Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

    8) Read group info - read_groups.info

    This tab delimited file lists, for each replicate, key properties used by Cuffdiff during quantification, such as library normalization factors. The read_groups.info file has the following format:

    Column number Column name Example Description
    1 file mCherry_rep_A/accepted_hits.bam BAM or SAM file containing the data for the read group
    2 condition mCherry Condition to which the read group belongs
    3 replicate_num 0 Replicate number of the read group
    4 total_mass 4.72517e+06 Total number of fragments for the read group
    5 norm_mass 4.72517e+06 Fragment normalization constant used during calculation of FPKMs.
    6 internal_scale 1.23916 Internal scaling factor, used to transform replicates of a single condition onto the "internal" common count scale.
    7 external_scale 0.96 External scaling factor, used to transform counts from different conditions onto an internal common count scale.

    8) Run info - run.info

    This tab delimited file lists various bits of information about a Cuffdiff run to help track what options were provided. For example:

    param   value  cmd_line        cuffdiff  base_ref.gtf mCherry/accepted_hits.bam R49/accepted_hits.bam   version 2.0.0  SVN_revision    3258  boost_version   104700  		

FPKM Tracking Files


FPKM tracking files use a generic format to output estimated expression values. Each FPKM tracking file has the following format:
Column number Column name Example Description
1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)
2 class_code = The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present
3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any
4 gene_id NM_008866 The gene_id(s) associated with the object
5 gene_short_name Lypla1 The gene_short_name(s) associated with the object
6 tss_id TSS1 The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_id isn't present
7 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the object
8 length 2447 The number of base pairs in the transcript, or '-' if not a transcript/primary transcript
9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object
10 q0_FPKM 8.01089 FPKM of the object in sample 0
11 q0_FPKM_lo 7.03583 the lower bound of the 95% confidence interval on the FPKM of the object in sample 0
12 q0_FPKM_hi 8.98595 the upper bound of the 95% confidence interval on the FPKM of the object in sample 0
13 q0_status OK Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.
14 q1_FPKM 8.55155 FPKM of the object in sample 1
15 q1_FPKM_lo 7.77692 the lower bound of the 95% confidence interval on the FPKM of the object in sample 0
16 q1_FPKM_hi 9.32617 the upper bound of the 95% confidence interval on the FPKM of the object in sample 1
17 q1_status 9.32617 the upper bound of the 95% confidence interval on the FPKM of the object in sample 1. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.
3N + 12 qN_FPKM 7.34115 FPKM of the object in sample N
3N + 13 qN_FPKM_lo 6.33394 the lower bound of the 95% confidence interval on the FPKM of the object in sample N
3N + 14 qN_FPKM_hi 8.34836 the upper bound of the 95% confidence interval on the FPKM of the object in sample N
3N + 15 qN_status OK Quantification status for the object in sample N. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.

Count Tracking Files


Count tracking files use a generic format to output estimated fragment count values. Each Count tracking file has the following format:
Column number Column name Example Description
1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)
2 q0_count 201.334 Estimated (externally scaled) number of fragments generated by the object in sample 0
3 q0_count_variance 5988.24 Estimated variance in the number of fragments generated by the object in sample 0
4 q0_count_uncertainty_var 170.21 Estimated variance in the number of fragments generated by the object in sample 0 due to fragment assignment uncertainty.
5 q0_count_dispersion_var 4905.63 Estimated variance in the number of fragments generated by the object in sample 0 due to cross-replicate variability.
6 q0_status OK Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.
7 q1_count 201.334 Estimated (externally scaled) number of fragments generated by the object in sample 1
8 q1_count_variance 5988.24 Estimated variance in the number of fragments generated by the object in sample 1
9 q1_count_uncertainty_var 170.21 Estimated variance in the number of fragments generated by the object in sample 1 due to fragment assignment uncertainty.
10 q1_count_dispersion_var 4905.63 Estimated variance in the number of fragments generated by the object in sample 1 due to cross-replicate variability.
11 q1_status OK Quantification status for the object in sample 1. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.
7 qN_count 201.334 Estimated (externally scaled) number of fragments generated by the object in sample N
4N + 6 qN_count_variance 7.34115 Estimated variance in the number of fragments generated by the object in sample N
4N + 7 qN_count_uncertainty_var 6.33394 Estimated variance in the number of fragments generated by the object in sample N due to fragment assignment uncertainty.
4N + 8 qN_count_dispersion_var 8.34836 Estimated variance in the number of fragments generated by the object in sample N due to cross-replicate variability.
4N + 9 qN_status OK Quantification status for the object in sample N. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.

Read Group Tracking Files


Read group tracking files record per-replicate expression and count data. Each Count tracking file has the following format:
Column number Column name Example Description
1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)
2 condition Fibroblasts Name of the condition
3 replicate 1 Name of the replicate of the condition
4 raw_frags 170.21 The estimate number of (unscaled) fragments originating from the object in this replicate
5 internal_scaled_frags 4905.63 Estimated number of fragments originating from the object, after transforming to the internal common count scale (for comparison between replicates of this condition.)
6 external_scaled_frags 99.21 Estimated number of fragments originating from the object, after transforming to the external common count scale (for comparison between conditions)
7 FPKM 201.334 FPKM of this object in this replicate
8 effective_length 5988.24 The effective length of the object in this replicate. Currently a reserved, unreported field.
9 status OK Quantification status for the object. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.

Library Types


In cases where Cufflinks cannot determine the platform and protocol used to generate input reads, you can supply this information manually, which will allow Cufflinks to infer source strand information with certain protocols. The available options are listed below. For paired-end data, we currently only support protocols where reads are point towards each other.

Library Type Examples Description
fr-unstranded (default) Standard Illumina Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand.
fr-firststrand dUTP, NSR, NNSR Same as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced.
fr-secondstrand Directional Illumina (Ligation), Standard SOLiD Same as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced.

Please contact tophat.cufflinks@gmail.com to request support for a new protocol.


Library Normalization Methods


You can control how library sizes (i.e. sequencing depths) are normalized in Cufflinks and Cuffdiff. Cuffdiff has several methods that require multiple libraries in order to work. Library normalization methods supported by Cufflinks work on one library at a time.

Normalization Method Supported by Cufflinks Supported by Cuffdiff Description
classic-fpkm Yes Yes Library size factor is set to 1 - no scaling applied to FPKM values or fragment counts. (default for Cufflinks)
geometric No Yes FPKMs and fragment counts are scaled via the median of the geometric means of fragment counts across all libraries, as described in Anders and Huber (Genome Biology, 2010). This policy identical to the one used by DESeq. (default for Cuffdiff)
quartile No Yes FPKMs and fragment counts are scaled via the ratio of the 75 quartile fragment counts to the average 75 quartile value across all libraries.

Cross-replicate dispersion estimation methods


Cuffdiff works by modeling the variance in fragment counts across replicates as a function of the mean fragment count across replicates. Strictly speaking, models a quantitity called dispersion - the variance present in a group of samples beyond what is expected from a simple Poisson model of RNA_Seq. You can control how Cuffdiff constructs its model of dispersion in locus fragment counts. Each condition that has replicates can receive its own model, or Cuffdiff can use a global model for all conditions. All of these policies are identical to those used by DESeq (Anders and Huber, Genome Biology, 2010)

Dispersion Method Description
pooled Each replicated condition is used to build a model, then these models are averaged to provide a single global model for all conditions in the experiment. (Default)
per-condition Each replicated condition receives its own model. Only available when all conditions have replicates.
blind All samples are treated as replicates of a single global "condition" and used to build one model.
poisson The Poisson model is used, where the variance in fragment count is predicted to equal the mean across replicates. Not recommended.

Which method you choose largely depends on whether you expect variability in each group of samples to be similar. For example, if you are comparing two groups, A and B, where A has low cross-replicate variability and B has high variability, it may be best to choose per-condition. However, if the conditions have similar levels of variability, you might stick with the default, which sometimes provides a more robust model, especially in cases where each group has few replicates. Finally, if you only have a single replicate in each condition, you must use blind, which treats all samples in the experiment as replicates of a single condition. This method works well when you expect the samples to have very few differentially expressed genes. If there are many differentially expressed genes, Cuffdiff will construct an overly conservative model and you may not get any significant calls. In this case, you will need more replicates in your experiment.

============================================================================

How Cufflinks works



What is Cufflinks?


Cufflinks is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide. Cufflinks runs on Linux and OS X.

Cufflinks is described in our recent paper, and much of the algorithmic and mathematical material is presented in the supplemental methods


How does Cufflinks assemble transcripts?


Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs. In essence, Cufflinks implements a constructive proof of Dilworth's Theorem by constructing a covering relation on the read alignments, and finding a minimum path cover on the directed acyclic graph for the relation.

While Cufflinks works well with unpaired RNA-Seq reads, it is designed with paired reads in mind. The assembly algorithm explicitly handles paired end reads by treating the alignment for a given pair as a single object in the covering relation. The proof of Dilworth's theorem finds a maximum cardinality matching on the bipartite graph of the transitive closure of the DAG. However, there is not necessarily a unique maximum cardinality matching, reflecting the fact that due to the limited size of RNA-Seq cDNA fragments, we may not know with certainty which outcomes of alternative splicing events go together in the same transcripts. Cufflinks tries to find the correct parsimonious set of transcripts by performing a minimum cost maximum matching. The cost of associating splicing events is based on the "percent-spliced-in" score introduced in

This matching is then extended to a minimum cost path cover of the DAG, with each path representing a different transcript.


The algorithm builds on ideas behind the ShoRAH algorithm for haplotype abundance estimation in viral populations, described in:

The assembler also borrows some ideas introduced with the PASA algorithm for annotating genomes from EST and full length mRNA evidence, described in:
Cufflinks is implemented in C++ and makes substantial use of the Boost Libraries as well as the LEMON Graph Library, which was launched by the Egerváry Research Group on Combinatorial Optimization (EGRES).

How does Cufflinks calculate transcript abundances?


In RNA-Seq experiments, cDNA fragments are sequenced and mapped back to genes and ideally, individual transcripts. Properly normalized, the RNA-Seq fragment counts can be used as a measure of relative abundance of transcripts, and Cufflinks measures transcript abundances in Fragments Per Kilobase of exon per Million fragments mapped (FPKM), which is analagous to single-read "RPKM", originally proposed in:



In paired-end RNA-Seq experiments, fragments are sequenced from both ends, providing two reads for each fragment. To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments. This likelihood function can be shown to have a unique maximum, which Cufflinks finds using a numerical optimization algorithm. The program then multiplies these probabilities to compute the overall likelihood that one would observe the fragments in the experiment, given the proposed abundances on the transcripts. Because Cufflinks' statistical model is linear, the likelihood function has a unique maximum value, and Cufflinks finds it with a numerical optimization algorithm.

Using this statistical method, Cufflinks can estimate the abundances of the isoforms present in the sample, either using a known "reference" annotation, or after an ab-initio assembly of the transcripts using only the reference genome.


How does Cufflinks estimate the fragment length distribution?


The probability distribution on the length of fragments plays an important role in assembly, abundance estimation, and bias correction. The accuracy of this distribution will have a great impact on the accuracy of our results. Because of this, we now attempt to "learn" this distribution from the input data instead of relying on an approximate Gaussian distribution, whenever possible.


How does Cufflinks identify and correct for sequence bias?


Often in RNA-Seq experiments, a sequence-specific bias is introduced during the library preparation that challenges the assumption of uniform coverage. For example, a sequence-specific bias caused by the use of random hexamers was identified in



Because this bias is usually caused by primers used either in PCR or reverse transcription, it appears near the ends of the sequenced fragments. We have developed a method to correct this bias by “learning” what sequences are being selected for (or ignored) in a given experiment, and including these measurements in the abundance estimation.

The first step in the process is to generate an initial abundance estimation without using bias correction. Since different transcripts will have different sequences appear in them, we use this approximate abundance to weight reads by the expression level of the transcript from which they arise. This helps us avoid over-counting sequences that may be common in the mapping data due to high expression rather than bias.

We next revisit each fragment in the alignment file and apply the abundance weighting as we “learn” features of the sequence in a window surrounding the 5’ and 3’ end of the transcript using a graphical model of the statistical dependencies between bases in the window. We keep a separate model for each end of the read since the biases in the first and second strand synthesis of the fragment are not always the same.

Finally, we re-estimate the abundances using a new likelihood function that has been adjusted to take the sequence bias into account, based on the parameters of the graphical model we computed in the previous step. The result is a new set of FPKMs that are less affected by sequence-specific bias.

Note that since it must know which ends of reads are fragment ends, Cufflinks will not bias correct reads mapping to transcripts with unknown strandedness.

The full details of our method can be found in



How does Cufflinks handle multi-mapped reads?


Individual reads will sometimes be mapped to multiple positions in the genome due to sequence repeats and homology. By default, Cufflinks will uniformly divide each multi-mapped read to all of the positions it maps to. In other words, a read mapping to 10 positions will count as 10% of a read at each position.

If multi-mapped read correction is enabled (-u/--multi-read-correct), Cufflinks will improve its estimation in a manner inspired by (but using more information than) the 'rescue' method described in



Cufflinks will first calculate initial abundance estimates for all transcripts using the uniform dividing scheme. Cufflinks will then re-estimate the abundances dividing each multi-mapped read probabalistically based on the initial abundance estimation of the genes it maps to, the inferred fragment length, and fragment bias (if bias correction is enabled).


How does Reference Annotation Based Transcript (RABT) assembly work?


Reference annotation based assembly seeks to build upon available information about the transcriptome of an organism to find novel genes and isoforms. When a reference GTF is provided with the -g/--GTF-guide option, the reference transcripts are tiled with faux-reads that will aid in the assembly of novel isoforms. These faux reads are combined with the sequencing reads and are input into the regular Cufflinks assembler. The assembled transfrags are then compared to the reference transcripts to determine if they are sufficiently different to be considered novel. Those that are labeled novel by our criteria (see Cufflinks options to adjust the parameters) are output along with the transcripts from the annotation.

The use of faux-reads was inspired by the methods of



What is Cuffdiff?


Cuffdiff is a program that uses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for signficant differences. You can use it to find differentially expressed genes and transcripts, as well as genes that are being differentially regulated at the transcriptional and post-transcriptional level.

Cuffdiff is described in detail in the manuscript below:


How does Cuffdiff 2 test for differentially expressed and regulated genes?


To identify a gene or transcript as DE, Cuffdiff 2 tests the observed log-fold-change in its expression against the null hypothesis of no change (i.e. the true log-fold-change is zero). Because measurement error, technical variability, and cross-replicate biological variability might result in an observed log-fold-change that is nonzero, Cuffdiff assesses significance using a model of variability in the log-fold-change under the null hypothesis. This model is described in detail in detail in Trapnell and Hendrickson et al. Briefly, Cuffdiff two constructs, for each condition, a table that predicts how much variance there is in the number of reads originating from a gene or transcript. The table is keyed by the average reads across replicates, so to look up the variance for a transcript using the table, Cuffdiff estimates how many reads originated from that transcript, and then queries the table to retrieve the variance for that number of reads. Cuffdiff 2 then accounts for read mapping and assignment uncertainty by simulating probabilistic assignment of the reads mapping to a locus to the splice isoforms for that locus. At the end of the estimation procedure, Cuffdiff 2 obtains an estimate of the number of reads that originated from each gene and transcript, along with variances in those estimates. The read counts are reported along with FPKM values and their variances. Change in expression is reported as the log fold change in FPKM, and the FPKM variances allow the program to estimate the variance in the log-fold-change itself. Naturally, a gene that has highly variable expression will have a highly variable log-fold-change between two conditions.


What's changed since the paper?


Numerous small changes, bugfixes, and minor features have appeared since version 2.0.2 of Cuffdiff (which was used for Trapnell and Hendrickson et al). There are also two more substantial changes:

The first change is that Cuffdiff now reports the FPKM for each condition as the average FPKM calculated for each individual replicate. Previous versions pooled the reads from all replicates of a condition together and computed FPKM for each gene and transcript from the pool. The two approaches yield extremely similar values, but the new method is faster and simpler, and will make implementing several planned features much easier.

The second change is that Cuffdiff 2.1 introduced a new testing method that substantially improves performance over previous releases, including the one used for the paper. The modifications made in Cuffdiff 2.1 improve sensitivity in calling differentially expressed (DE) genes and transcripts while maintaining a low false positive rate. They stem from the method used to calculate the variability in the log fold change in expression. In Trapnell et al, Cuffdiff 2 used the “delta method” to estimate the variance of the log fold change estimate for a gene or transcript. This method yields a simple equation that takes as input the mean and variance of the transcript’s expression in two conditions and produces a variance for the log fold change. However, the equation contains no explicit accounting for the number of replicates used to produce those estimates – they are assumed to be perfectly accurate. The equation also assumes that the distribution of log fold changes (after a particular transformation) is approximately normal. For most genes and transcripts, this approximation is a good one. However, for the remaining genes and transcripts, Cuffdiff 2 sometimes failed to detect a signficant change in expression.

The improved version of Cuffdiff 2 more accurately estimates the variance in the log-fold-change using simulated draws from the model of variance in expression for each of the two conditions. Imagine an experiment that has n replicates in condition A and m replicates in condition B. To estimate the distribution of the log-fold-change in expression for a gene G under the null hypothesis, Cuffdiff first draws n times from the distribution of expression of G according to the algorithm’s model of expression. Cuffdiff then takes the average of the n draws to obtain an expression “measurement”. Then, Cuffdiff draws m from the same distribution and again takes their average. Cuffdiff then takes the log ratio of these averages, places this value in a list, and then repeats the procedure until there are thousands of such log-fold-change samples in the list. The software then makes a similar list, this time using the expression model for condition B – the null hypothesis assumes both sets of replicates originate from the same condition, but we do not know whether A or B is the better representative of that condition, so we must draw samples from both and combine them. To calculate a p-value of observing the real log-fold-change under this null model, we simply sort all the samples and count how many of them are more extreme than the log fold change we actually saw in the real data. This number divided by the total number of draws is our estimate for the p-value.

Cuffdiff 2 reports not only genes and transcripts that are significantly DE between conditions, but also groups of transcripts (i.e. the isoforms of a gene) that show significant changes in expression relative to one another. The test for this is similar to what is described in Trapnell et al, but comparably modified along the lines described above for single genes or transcripts. Draws of expression are made for each transcript in a group according to the number of replicates in the experiment. These are averaged, and the shift in relative transcript abundance for the draw is made using the Jensen-Shannon metric. These draws are added to a list and used to calculate p-values for significance of observed shifts in relative abundance under the null hypothesis.

============================================================

Frequently Asked Questions



What's the difference between FPKM and RPKM?


They're almost the same thing. RPKM stands for Reads Per Kilobase of transcript per Million mapped reads. FPKM stands for Fragments Per Kilobase of transcript per Million mapped reads. In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads.


Can Cufflinks handle strand-specific RNA-Seq data?


Yes. If you mapped your reads with TopHat using the strand-specific mode or manually added XS tags, Cufflinks will automatically treat your data as strand-specific. If this is not the case, you should use the --library-type option along with the type of strand-specific protocol that was used to generate your data. More information can be found here.


Can I use Cufflinks with RNA-Seq data from Bacteria?


Sure, with one or two caveats. We don't recommend assembling bacterial transcripts using Cufflinks at first. If you're working on a new bacterial genome, consider using a computational gene finding application such as Glimmer. These tools are very mature and work well. With an annotation, you can use Cufflinks and Cuffdiff to profile expression and look for differences. Also, note that Cufflinks doesn't handle circular genomes in any special way, so make sure your genome and annotation are suitably linearized.


I want to find differentially expressed genes. Can I use Cufflinks in conjunction with count-based differential expression packages?


It's possible, but we strongly advise against this. Current count-based differential expression tools are poorly suited to differential expression analysis in genomes with alternatively spliced genes. The main reason for this is that when a gene has multiple isoforms, a change in the total number of reads or fragments from that gene doesn't always correspond to a change in expression for that gene. Conversely, a gene's expression may change, but the total number of fragments generated by its isoforms may be very similar. In order to detect changes accurately, it's necessary to estimate how many fragments came from each individual splice variant in each sample. Current count-based tools don't do this (to our knowledge - please send us email if you know of one!). Even if they did, fragments that come from parts of genes that are shared by more than one splice variant can't generally assigned to a single isoform, so the fragment counts for each isoform are only estimates, and there is some uncertainty in the counts. Isoforms that are very similar will have a great deal of uncertainty surrounding their fragment counts. This uncertainty needs to be accounted for when testing for differential expression. So while you could use Cufflinks to estimate isoform-level counts, you'd be throwing away Cufflinks' uncertainty, and thus have more confidence in the differences you see than you really should. This will probably lead to many false positives in your analysis. Furthermore, we do not normalize simply by the length to calculate FPKM but an effective length, as explained in our publications. Calculting counts from FPKM by multiplying by the length will give incorrect results. We strongly encourage you to consider using Cuffdiff to find differentially expressed genes and transcripts.


How can I find out how many fragments come from each transcript?


In response to user-demand and support the development of analysis methods downstream of abundance estimation, we have added this functionality in Cuffdiff 2.0. See the count tracking table description in the manual page. Support for count tracking will be coming to Cufflinks soon. However, people who have asked for this almost always want to use Cufflinks in conjunction with count-based differential expression packages, which is not a good idea.


How is Cuffdiff different from other differential expression tools?


To our knowledge, most other differential expression packages work by counting up the number of fragments that originate from each gene in each sample, looking for differences in the count for each gene across conditions, and testing the observed differences for statistical significance. This approach works well when genes have a single isoform and occur only once in the genome (i.e. aren't members of "gene families"). However, when dealing with genes with more than one isoform, it's generally not possible to tell with certainty which isoform generated each fragment. This greatly complicates the problem of looking for differentially expressed genes.

Most differential expression packages count reads at a gene level and look for differences in these counts directly. However, this approach can be very inaccurate for genes that have multiple isoforms, because a change in relative abundance of the isoforms can change overall gene abundance without generating a large change in the number of reads from that gene. Conversely, a change in overall gene counts does not necessarily imply a similar fold-change in gene expression. For example, suppose a gene has two isoforms, one which is twice as long as the other. At equal abundance, the longer one will produce twice as many reads. Consider two RNA-Seq samples, where only the shorter isoform is present in the first sample, and only the longer isoform is present in the second sample. If the abundance of each isoform is the same in the two conditions, the second sample will have twice as many reads from the gene! Thus, simply looking for differences in the overall number of reads from a gene may lead to the incorrect conclusion that it is differentially expessed.

Cuffdiff looks for differentially expressed genes by estimating how many fragments came from each isoform and then converting the counts into isoform expression levels. To find differentially expressed genes, Cuffdiff calculates expression levels in each condition for each gene by adding up the expression levels for each gene's splice isoforms. Cuffdiff tests for differences in each gene's expression level across conditions. See here for more details about how this works.


Can I use Cufflinks to find SNPs or RNA editing sites?


Not at the moment, because Cufflinks doesn't do any base calling of its own. We recommend that you check out the samtools for this purpose.


Does Cufflinks discover or quantify gene fusions or trans-spliced transcripts?


Not yet, but we will likely support this in a future release (we have no idea how far away that release will be).


Do Cufflinks and Cuffdiff support both BAM and SAM?


Yes. If a SAM is supplied, a message will be output that the file is not a valid BAM file. However, Cufflinks will recognize this and treat the file as a SAM. When using a SAM file, you should include a proper header or ensure that the reads are lexicographically by chromosome and then numerically by left position. You can accomplish this sorting with the command sort -k3,3 -k4,4n in.sam > out.sam.


How is Cuffdiff different from other differential expression tools?


Cuffdiff is able to estimate p-values for differential expression of individual transcripts. Like other tools, it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. More details will be available in a forthcoming publication.


What does "NOTEST" mean in Cuffdiff's output?


This status code is used by Cuffdiff to indicate that neither of the conditions contained enough reads in a locus to support a reliable calculation of expression level. Basically, Cuffdiff wasn't confident that it had enough data in that gene. You can control Cuffdiff's behavior in this regard by lowering or raising the -c option.


What does "LOWDATA" mean in Cuffdiff's output?


This status code is used by Cuffdiff to indicate that in both conditions, the gene being tested was either too complex or too shallowly sequenced to support a reliable calculation of abundance. Cuffdiff can often (though not always) tell via some simple linear algebra that it doesn't have enough reads to pick apart expression levels for all of a gene's isoforms, and without these, it can't calculate expression for the gene. When this happens in both conditions, Cuffdiff marks any observed difference in expression as not statistically signficant and sets the status code as LOWDATA. You should treat any such observed changes as highly unreliable.


What does "FAIL" mean in Cuffdiff's output?


This status code is used by Cuffdiff to indicate that in one or both conditions, Cuffdiff encountered a numerical exception while calculating expression levels in the locus. This can happen because of rounding errors and other "facts of life" when doing math on computers. Typically, this occurs in very complex genes with relatively low expression levels. When this happens in one or both conditions being tested, Cuffdiff marks any observed difference in expression as not statistically signficant and sets the status code as FAIL. You should treat any such observed changes as highly unreliable.


Can Cufflinks/Cuffdiff give me FPKM for each exon or splicing event?


For the foreseeable future, no.


How does cufflinks calculate its p values?


Run Cuffdiff (see here for an example) and look at the gene_exp.diff output file.


How do I find differentially expressed genes with Cuffdiff?


See here


I'm looking for a GTF file with annotated transcripts. Do you provide annotation files?


No. Curating these annotation files would be way too much work for us, and organizations like Ensembl and UCSC already do a great job. We suggest you check out either of these for a high quality annotation GTF for your organism.


I'm trying to assemble a sample. Cufflinks is almost done, but it seems to be hanging at "99% complete". What's going on?


Cufflinks spawns threads for each locus to assemble and quantitate the "bundle" of reads in that locus. Some loci may have more reads and more complicated alternative splicing than others, which requires more CPU cycles. These bundles can continue processing long after all others have completed, leading to this behavior. You may be able to decrease the number of such bundles by masking out ribosomal and mitochondrial RNA using the -M/--mask-file option described in the Manual.


Can I use Cuffdiff on time-series data?


Yes. Check out the -T option in Cuffdiff.


How can I visualize by Cufflinks and Cuffdiff results?


We've built CummeRbund for this purpose, which is based on the elegant, powerful ggplot2 library for this. It's written by Hadley Wickham in R and makes plotting very simple.


I'm having trouble installing Cufflinks on my system. Can you help me?


While binaries are provided for easy installation on Mac and Linux, Cufflinks can also be installed from source on numerous systems. If you are using a Mac we recommend MacPorts. For other systems, we encourage you to visit SeqAnswers for discussions about installation issues on various platforms. You can try sending email to our support address, but please be aware that we get a great deal of email, and requests for installation help are generally the lowest priority for us.

=========================================================

Illumina has provided the RNA-Seq user community with a set of genome sequence indexes (including Bowtie indexes) as well as GTF transcript annotation files. These files can be used with TopHat and Cufflinks to quickly perform expression analysis and gene discovery. The annotation files are augmented with the tss_id and p_id GTF attributes that Cufflinks needs to perform differential splicing, CDS output, and promoter user analysis. We recommend that you download your Bowtie indexes and annotation files from this page. More information about Illumina's iGenomes project can be found here.

安装与使用