Identifying novel genes in C. elegans using SAGE tags
© Nesbitt et al; licensee BioMed Central Ltd. 2010
Received: 20 April 2010
Accepted: 10 December 2010
Published: 10 December 2010
Despite extensive efforts devoted to predicting protein-coding genes in genome sequences, many bona fide genes have not been found and many existing gene models are not accurate in all sequenced eukaryote genomes. This situation is partly explained by the fact that gene prediction programs have been developed based on our incomplete understanding of gene feature information such as splicing and promoter characteristics. Additionally, full-length cDNAs of many genes and their isoforms are hard to obtain due to their low level or rare expression. In order to obtain full-length sequences of all protein-coding genes, alternative approaches are required.
In this project, we have developed a method of reconstructing full-length cDNA sequences based on short expressed sequence tags which is called s equence t ag-based a mplification of c DNA e nds (STACE). Expressed tags are used as anchors for retrieving full-length transcripts in two rounds of PCR amplification. We have demonstrated the application of STACE in reconstructing full-length cDNA sequences using expressed tags mined in an array of serial analysis of gene expression (SAGE) of C. elegans cDNA libraries. We have successfully applied STACE to recover sequence information for 12 genes, for two of which we found isoforms. STACE was used to successfully recover full-length cDNA sequences for seven of these genes.
The STACE method can be used to effectively reconstruct full-length cDNA sequences of genes that are under-represented in cDNA sequencing projects and have been missed by existing gene prediction methods, but their existence has been suggested by short sequence tags such as SAGE tags.
The nematode Caenorhabditis elegans, which is a well-established model organism for biomedical research , is the first metazoan whose genome was subject to whole-genome sequencing . Its gene models were first predicted using the gene prediction program Genefinder (P. Green, unpublished). Over the dozen years since the completion of the C. elegans genome sequencing project , the C. elegans gene set has been curated by the C. elegans research community and by WormBase curators [1, 3–5]. However, the C. elegans gene set is still far from complete for the following reasons: First, because Genefinder, like other gene prediction programs, was developed based on an incomplete understanding of gene structures, it suffers from both false positive and false negative predictions; second, many bona fide genes, especially those of unknown character, have been missed. In WormBase http://www.wormbase.org, the official database for the biology and genomics of C. elegans, less than 40% of the annotated gene models are fully confirmed. All others are either partially supported or not supported at all. Additional gene models have been revealed in transcriptome sequencing [6, 7], suggesting many gene models remain to be discovered. This situation is also true for other species . In the human genome, it has been estimated that the most accurate programs only correctly predict 40% of the annotated genes .
In this project, we explored how to reconstruct full-length gene models for genes that are not correctly represented in the current gene set, using expressed sequence tags obtained in large-scale gene expression projects. In particular, we attempted to reconstruct novel C. elegans gene models using SAGE (serial analysis of gene expression). The SAGE technique was originally developed for profiling gene expression [10, 11]. The expression profiles created with SAGE have a wide range of applications that include therapeutic target identification in cancerous tissues  and others of biological and medical importance . Recently, SAGE was applied to probe gene expression in C. elegans by the C. elegans Gene Expression Consortium http://elegans.bcgsc.bc.ca/home/ge_consortium.html. These SAGE libraries have been fundamental for the success of a variety of research projects [14–19]. While SAGE tags that correspond to existing gene models can be used to evaluate the abundance of gene expression, there are a large number of SAGE tags that do not correspond to existing gene models. These SAGE tags suggest the existence of additional coding exons, splice variants , or novel genes.
Tag based reconstruction of full-length cDNA sequence of novel genes
Expressed sequence tags that cannot be aligned to the C. elegans virtual transcriptome (i.e., cDNA sequences of all annotated transcripts) suggest the existence of yet unannotated genes [13, 21]. We have established a protocol, termed as "sequence t ag-based a mplification of c DNA e nds", or STACE, based on the RACE protocol , to identify potential novel genes. The method can be used to amplify full-length cDNA transcripts that have been reverse-transcribed from the mRNA sequence of novel genes. STACE uses three primer hybridization sites. The first site (the 5' site) is a sequence located at the extreme 5' end of the target transcript, the second site (the 3' site) is downstream of the polyadenylation sequence, and the third site (the gene-specific site) corresponds to the genomic span where the uncharacterized tag maps. The amplicons are then cloned, sequenced and mapped to the genome. As such, STACE not only confirms the existence of a novel gene, but also defines the full-length transcript sequence of the yet undefined gene.
In this project, in order to get a primer hybridization site at the extreme 5' end of the RNA transcripts, we took advantage of the trans-splice leader 1 (SL1) in C. elegans, and used its sequence as a primer for our 5' site. It is appropriate to design the 5' primer based on the SL1 sequence because SL1 is trans-spliced to the extreme 5' end of nearly 50% of all C. elegans mRNAs [23, 24]. For applications in which the sample transcriptome does not undergo trans-splicing of this nature, a common oligo anchoring sequence can be ligated to the 5'end of each transcript. An oligo sequence was attached to the polyadenylation tracks of mRNA through reverse transcription with a modified oligo d(T) primer that included a 3' common sequence (5' - CCAGACACTATGCTCATACGACGCAGT(16)VN - 3'). This provided us with a cDNA library containing transcripts that had a usable 3' site. Finally, we chose gene-specific sites by bioinformatically identifying SAGE tags. When aligned to the C. elegans genome, qualified SAGE tags do not overlap with existing gene models. For each qualified SAGE tag, a primer was designed and used in conjunction with a primer complementary to the SL1 sequence to amplify the upstream amplicon. A second primer was designed and used in conjunction with the primer complementary to the 3' common sequence (above) to amplify the downstream amplicon. The potential template was amplified, and the amplicon sequences were mapped to the C. elegans genome using BLAT , which is available at WormBase http://www.wormbase.org.
Computational selection of SAGE tags that suggest novel genes
SAGE tags used in this study were selected from 33 SAGE libraries, which were sequenced from different tissues and developmental stages of C. eleganshttp://tock.bcgsc.bc.ca/cgi-bin/sage160. There are altogether 220,770 unique SAGE tags in these libraries. Only SAGE tags that did not overlap with annotated protein-coding genes in the WS160 version of the C. elegans genome map were selected for this project.
SAGE tag numbers for each set through the identification of high value candidate SAGE tags.
Tags with frequency count >3
Tags absent from gene boundaries and introns
Tags with appropriate GC content
Tags that can serve as primers
SAGE tag primers tested
Primers based on SAGE tags were designed to ensure a reduced possibility of formation of secondary structures which would inhibit proper annealing of the primers . For many cases, we trimmed sequences from either end of the SAGE tags to ensure primer quality. SAGE tag sequences that could not be used to guide proper primer design were not used. Primer design was done using the Primer3 program .
Two different cDNA libraries were created; one from a mixed stage population of C. elegans and another one from embryonic animals. In order to maximize the number of successful experiments, candidate SAGE tags were only screened against the developmental library that corresponded with the time in development that the tags were originally observed.
New transcripts and novel cDNAs
Result classifications for all sets of tested SAGE tag primers.
Non-Protein Gene Overlap
Number of Candidate cDNAs/Number of SAGE tag Primers Tested
We found a successful STACE result overlapped with a pseudogene. While this transcript may not be translated, using STACE we have clearly shown that it is processed with introns removed and a polyadenylation track added to the 3'end. We have also found that a STACE result overlapped with an annotated ncRNA gene. The transcript was also processed with a previously unknown intron excised and a polyadenylation track added.
Identified cDNA sequences from Set 3 STACE experiments.
SAGE tag primer
SAGE tag location
Sequence 5' mapping boundary
Sequence 3' mapping boundary
GenBank accession number
Status (as of WS200)
Full-length cDNA P.1
Overlaps F07H5.4 (pseudogene): evidence for extension to annotated exon
Partial cDNA P.2
Overlaps with C06C3.10
Full-length cDNA P.3
Overlaps with tts-2 (ncRNA): evidence for new intron
Partial cDNA 1.1
Evidence of a novel gene
Full-length cDNA 1.2
Overlaps with C25F9.11: evidence for new 5' UTR
Overlaps with ZC250.4: evidence for extension to 3' UTR
Partial cDNA 1.4a
Overlaps with T01B6.1: evidence for new coding sequence
Partial cDNA 1.4b
Overlaps with T01B6.1: evidence for new transcriptional start site
Partial cDNA 2.1a
Overlaps with Y46E12BL.4: evidence for new 3' UTR exon
Partial cDNA 2.1b
Overlaps with Y46E12BL.4: evidence for new initial coding exon
Full-length cDNA 2.2
Overlaps with Y24D9A.1: evidence for extension to 3' UTR
Full-length cDNA 3.1
Overlaps with sox-3: evidence for new 3' UTR
Partial cDNA 3.2
Evidence of a novel gene
Full-length cDNA 3.3
Evidence of a novel gene
We compared novel cDNAs with C. elegans gene models predicted using AUGUSTUS , mGENE , TWINSCAN  and FGENESH++ , which are available at WormBase. All cDNAs, which were detected using STACE, when aligned to the C. elegans genome overlap to a certain extent with predicted gene models. The novel full-length cDNA 3.3 aligned well with a prediction from TWINSCAN and with a prediction made by FGENESH++. The annotation extension result (full-length cDNA 3.1) was found to overlap with gene predictions from each of the utilized programs. However, a new 3' UTR exon was shown to be part of this gene model, and this exon did not overlap with the predictions made by any of the described programs. Additionally, the P.3 result overlapped with an existing ncRNA gene model. However, the novel intron suggested by this STACE result was not included in the WormBase gene model, although it overlaps with AUGUSTUS prediction.
We have found that the STACE method can be used to recover accurate full-length gene models. This method is useful for reconstructing gene models for genes that have been missed in cDNA sequencing projects and were missed or mispredicted by gene finders. With the wide application of next-generation sequencing methods in the deep sequencing of transcriptomes, more expressed sequence tags, which indicate the presence of novel genes will be uncovered. We expect that these tags will serve as input to the STACE protocol for further novel gene discovery and determination.
cDNA library production
Two samples of C. elegans were produced that represented both a mixed stage population and an embryonic sample. Tissue samples were put through an RNA extraction using TRIzol (Invitrogen, SKU# 10296-028). The cDNA libraries used in this project were created with the Superscript III reverse transcriptase kit (Invitrogen, SKU# 18080-085), and the primer used to initiate reverse transcription was a modified oligo d(T) primer (5' - CCAGACACTATGCTCATACGACGCAGT(16) VN - 3') (Invitrogen). The protocol accompanying the kit was followed, and the samples were treated with Ribonuclease H (Invitrogen, SKU# 18021-014).
Amplification of tag ends
The reverse complement of each SAGE tag sequence was used to design the SAGE tag primers. These primers were used in conjunction with a primer based on the SL1 sequence (5' - GGTTTAATTACCCAAGTTTGAG - 3') in a PCR. The PCR was initiated with a 94°C melt step for 2 minutes, followed by 32 cycles of a 94°C melt step for 15 seconds, a 60°C annealing step for 45 seconds, and a 72°C extension step for 1 minute. This was followed by a final extension at 72°C for 5 minutes. A Taq polymerase provided by Dr. Harald Hutter was used in all of the PCRs. Amplicons produced by the PCRs were visualized with a 1% gel electrophoresis, and extracted with a QIAquick Gel Extraction kit (Qiagen, ID 28704). These amplicons were then cloned with the InsTAclone kit (Fermentas, #K1214). Cloned amplicons were submitted for sequencing (Macrogen, Seoul, Korea), and returned sequences were mapped back to the C. elegans genome with the BLAT tool  on the WormBase website http://www.wormbase.org/. We opted to use BLAT instead of other alignment tools because this program can take spliced mRNA sequences (i.e. STACE cloned sequences) and align them to the genome in a way that reflects intron - exon boundaries [25, 35]. Those amplicons whose sequence alignment indicated a true positive result were then further studied. The returned sequence was used to design an internal primer that would be compatible with the universal primer (5' - CACTATGCTCATACGACGCAGT - 3'). These primers were then used in a PCR with the same parameters described above to produce the downstream amplicons needed for full-length characterization. Internal primers were designed using the Primer3 program .
We thank Drs. Harald Hutter, David Baillie and Robert Johnsen for their advice, technical assistance, and reagents. We also thank Dr. Johnsen for proofreading the manuscript. We thank members of the Chen, Hutter, and Baillie laboratories for their technical assistance. Lindsay McGhee helped with generating figures. This project is supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) to NC. NC is also a Michael Smith Foundation for Health Research (MSFHR) Scholar and a Canadian Institutes of Health Research (CIHR) New Investigator.
- Hillier LW, Coulson A, Murray JI, Bao Z, Sulston JE, Waterston RH: Genomics in C. elegans: so many genes, such a little worm. Genome Res. 2005, 15: 1651-1660. 10.1101/gr.3729105View ArticlePubMedGoogle Scholar
- , : Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998, 282: 2012-2018. 10.1126/science.282.5396.2012View ArticleGoogle Scholar
- Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen CK: WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 2005, 33: D383-389. 10.1093/nar/gki066View ArticlePubMedPubMed CentralGoogle Scholar
- Waterston R, Martin C, Craxton M, Huynh C, Coulson A, Hillier L, Durbin R, Green P, Shownkeen R, Halloran N: A survey of expressed genes in Caenorhabditis elegans. Nat Genet. 1992, 1: 114-123. 10.1038/ng0592-114View ArticlePubMedGoogle Scholar
- Reboul J, Vaglio P, Rual JF, Lamesch P, Martinez M, Armstrong CM, Li S, Jacotot L, Bertin N, Janky R: C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression. Nat Genet. 2003, 34: 35-41. 10.1038/ng1140View ArticlePubMedGoogle Scholar
- Hillier LW, Reinke V, Green P, Hirst M, Marra MA, Waterston RH: Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 2009, 19: 657-666. 10.1101/gr.088112.108View ArticlePubMedPubMed CentralGoogle Scholar
- Shin H, Hirst M, Bainbridge MN, Magrini V, Mardis E, Moerman DG, Marra MA, Baillie DL, Jones SJ: Transcriptome analysis for Caenorhabditis elegans based on novel expressed sequence tags. BMC Biol. 2008, 6: 30- 10.1186/1741-7007-6-30View ArticlePubMedPubMed CentralGoogle Scholar
- Brent MR: Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res. 2005, 15: 1777-1786. 10.1101/gr.3866105View ArticlePubMedGoogle Scholar
- Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006, 7 (Suppl 1): S2 1-31. 10.1186/gb-2006-7-s1-s2. 10.1186/gb-2006-7-s1-s2View ArticlePubMedGoogle Scholar
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science. 1995, 270: 484-487. 10.1126/science.270.5235.484View ArticlePubMedGoogle Scholar
- Gnatenko DV, Dunn JJ, McCorkle SR, Weissmann D, Perrotta PL, Bahou WF: Transcript profiling of human platelets using microarray and serial analysis of gene expression. Blood. 2003, 101: 2285-2293. 10.1182/blood-2002-09-2797View ArticlePubMedGoogle Scholar
- Porter D, Yao J, Polyak K: SAGE and related approaches for cancer target identification. Drug Discov Today. 2006, 11: 110-118. 10.1016/S1359-6446(05)03694-9View ArticlePubMedGoogle Scholar
- Wang SM: Understanding SAGE data. Trends Genet. 2007, 23: 42-50. 10.1016/j.tig.2006.11.001View ArticlePubMedGoogle Scholar
- Pleasance ED, Marra MA, Jones SJ: Assessment of SAGE in transcript identification. Genome Res. 2003, 13: 1203-1215. 10.1101/gr.873003View ArticlePubMedPubMed CentralGoogle Scholar
- Blacque OE, Perens EA, Boroevich KA, Inglis PN, Li C, Warner A, Khattra J, Holt RA, Ou G, Mah AK: Functional genomics of the cilium, a sensory organelle. Curr Biol. 2005, 15: 935-941. 10.1016/j.cub.2005.04.059View ArticlePubMedGoogle Scholar
- Jones SJ, Riddle DL, Pouzyrev AT, Velculescu VE, Hillier L, Eddy SR, Stricklin SL, Baillie DL, Waterston R, Marra MA: Changes in gene expression associated with developmental arrest and longevity in Caenorhabditis elegans. Genome Res. 2001, 11: 1346-1352. 10.1101/gr.184401View ArticlePubMedGoogle Scholar
- McGhee JD, Fukushige T, Krause MW, Minnema SE, Goszczynski B, Gaudet J, Kohara Y, Bossinger O, Zhao Y, Khattra J: ELT-2 is the predominant transcription factor controlling differentiation and function of the C. elegans intestine, from embryo to adult. Dev Biol. 2009, 327: 551-565. 10.1016/j.ydbio.2008.11.034View ArticlePubMedPubMed CentralGoogle Scholar
- McGhee JD, Sleumer MC, Bilenky M, Wong K, McKay SJ, Goszczynski B, Tian H, Krich ND, Khattra J, Holt RA: The ELT-2 GATA-factor and the global regulation of transcription in the C. elegans intestine. Dev Biol. 2007, 302: 627-645. 10.1016/j.ydbio.2006.10.024View ArticlePubMedGoogle Scholar
- Wang X, Zhao Y, Wong K, Ehlers P, Kohara Y, Jones SJ, Marra MA, Holt RA, Moerman DG, Hansen D: Identification of genes expressed in the hermaphrodite germ line of C. elegans using SAGE. BMC Genomics. 2009, 10: 213- 10.1186/1471-2164-10-213View ArticlePubMedPubMed CentralGoogle Scholar
- Ruzanov P, Jones SJ, Riddle DL: Discovery of novel alternatively spliced C. elegans transcripts by computational analysis of SAGE data. BMC Genomics. 2007, 8: 447- 10.1186/1471-2164-8-447View ArticlePubMedPubMed CentralGoogle Scholar
- Chen J, Sun M, Lee S, Zhou G, Rowley JD, Wang SM: Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags. Proc Natl Acad Sci USA. 2002, 99: 12257-12262. 10.1073/pnas.192436499View ArticlePubMedPubMed CentralGoogle Scholar
- Schaefer BC: Revolutions in rapid amplification of cDNA ends: new strategies for polymerase chain reaction cloning of full-length cDNA ends. Anal Biochem. 1995, 227: 255-273. 10.1006/abio.1995.1279View ArticlePubMedGoogle Scholar
- Zorio DA, Cheng NN, Blumenthal T, Spieth J: Operons as a common form of chromosomal organization in C. elegans. Nature. 1994, 372: 270-272. 10.1038/372270a0View ArticlePubMedGoogle Scholar
- Blumenthal T: Trans-splicing and operons. WormBook. 2005, 1-9.Google Scholar
- Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664.View ArticlePubMedPubMed CentralGoogle Scholar
- Bonetta L: Gene expression: an expression of interest. Nature. 2006, 440: 1233-1237. 10.1038/4401233aView ArticlePubMedGoogle Scholar
- Gamper HB, Cimino GD, Hearst JE: Solution hybridization of crosslinkable DNA oligonucleotides to bacteriophage M13 DNA. Effect of secondary structure on hybridization kinetics and equilibria. J Mol Biol. 1987, 197: 349-362. 10.1016/0022-2836(87)90128-8View ArticlePubMedGoogle Scholar
- Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-386.PubMedGoogle Scholar
- Breathnach R, Benoist C, O'Hare K, Gannon F, Chambon P: Ovalbumin gene: evidence for a leader sequence in mRNA and DNA sequences at the exon-intron boundaries. Proc Natl Acad Sci USA. 1978, 75: 4853-4857. 10.1073/pnas.75.10.4853View ArticlePubMedPubMed CentralGoogle Scholar
- Breathnach R, Chambon P: Organization and expression of eucaryotic split genes coding for proteins. Annu Rev Biochem. 1981, 50: 349-383. 10.1146/annurev.bi.50.070181.002025View ArticlePubMedGoogle Scholar
- Stanke M, Waack S: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003, 19 (Suppl 2): ii215-225.View ArticlePubMedGoogle Scholar
- Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A: mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009, 19: 2133-2143. 10.1101/gr.090597.108View ArticlePubMedPubMed CentralGoogle Scholar
- Korf I, Flicek P, Duan D, Brent MR: Integrating genomic homology into gene structure prediction. Bioinformatics. 2001, 17 (Suppl 1): S140-148.View ArticlePubMedGoogle Scholar
- Solovyev V, Kosarev P, Seledsov I, Vorobyev D: Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006, 7 (Suppl 1): S10 11-12. 10.1186/gb-2006-7-s1-s10. 10.1186/gb-2006-7-s1-s10View ArticleGoogle Scholar
- Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F: The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 2006, 34: D590-598. 10.1093/nar/gkj144View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.