Funded by the Forest Health Initiative.


Data Usage Policy

The draft sequences of the whole genome and the BACs spanning the QTL regions of Castanea mollissima are being made available by the project investigators as a public service. As outlined in the Fort Lauderdale principles, the investigators request and expect to retain the right to the first publications and presentations of any global analysis of this data. The QTL sequencing, despite spanning a small percentage of the total genome, is considered a full project in and of itself, and as such, the first publications and presentations of this data is global in nature and is constrained under the same principles as the whole genome sequence. These restrictions will be lifted upon publication of the data by the investigators or after 12 months, whichever comes first. The data presented here is in a draft state, and the investigators make no guarantees as to its accuracy or completeness.

Questions or concerns may be directed to the project leader, Dr. John Carlson, through our contact form. Authors considering usage of the data in a publication are also requested to notify the project leader.


This project was initiated and led by Dr. John Carlson at Pennsylvania State University and was primarily funded by the Forest Heath Initiative.
The Chinese chestnut is a member of the Fagales order which includes a number of other important hardwood tree species such as oak, walnut and beech. The Vanuxem genotype, provided by The American Chestnut Foundation, was sequenced with a next generation shotgun approach. This first draft of the genome assembly covers 724.4 Mbp of the estimated 800 Mb chestnut genome was obtained in 41,270 scaffolds. The three blight resistance QTL were sequenced to greater depth through next generation sequencing of pools of bacterial artificial chromosomes (BACs).


Whole Genome Sequence (v1.1)

Over 60 Gb of genomic DNA sequence was generated by next generation sequencing technologies, including 14 Gb of Roche 454 shot-gun sequence (26.24 M reads; 17X depth) and 46 Gb Illumina MiSeq paired-end reads (150. M reads; 57.5X). In addition, 43,143 BAC paired-end Sanger sequences (1.5 X tiling path) were produced to associate the genome sequence to the physical map (Fang et al., 2012). The assembly was created with the software Newbler v2.8, using the heterozygosity option (Margulies et al., 2005). The gene models and annotation are the consensus of predictions by GMOD’s software tool Maker (Holt and Yandell, 2011) and the Augustus program (Stanke et al., 2004), using the chestnut transcriptome sequences, Prunus persica protein sequences and Arabidopsis thaliana protein sequences. After contamination removal, the draft genome has been submitted to NCBI

Assembly statistics:

  • 819.7 Mb in 239,153 contigs (N50, 9,340 bp), 103% coverage of estimated genome size (794Mb)
  • 724.0 Mb in 41,260 scaffolds (N50, 39.6 Kb), 91.2% coverage of estimated genome size (794Mb)
  • Organelle DNA: mtDNA
  • Gene Prediction in Genome:
    • 36,478 consensus gene models
    • 38,146 predicted transcripts and peptide sequences


Blight resistance QTL Sequences (v1.0)

Sequence probes were designed from the three blight QTL regions on the consensus genetic map(Kubisiak et al., 2012) and used to probe the BAC libraries. Using the probe results and the physical map (Fang et al., 2012), 190 BACs spanning the QTL regions were selected for sequencing. 503Mb of 454 single ends reads and 3.8Gb of MiSeq paired end reads were generated. The assembly was created with the software Newbler (Margulies et al., 2005) and additional scaffolding was completed with SSPACE (Boetzer et al., 2011). The gene prediction and annotation were completed with GMOD’s software tool Maker (Holt and Yandell, 2011) using the chestnut transcriptome sequences, Prunus persica protein sequences and Arabidopsis thaliana protein sequences.

Assembly statistics:

  • cbr1
    • genetic map location: LGB (40.9-50.4 cM)
    • 99 BACs sequenced
    • 6.7 Mb in 214 scaffolds (N50, 75,056 bp)
    • 432 annotated genes
  • cbr2
    • genetic map location: LGF (38.1-46.8 cM)
    • 51 BACs sequenced
    • 4.1 Mb in 128 scaffolds (N50, 72,331 bp)
    • 219 annotated genes
  • cbr3
    • genetic map location: LGG (35.7-39.5 cM)
    • 40 BACs sequenced
    • 3.0 Mb in 53 scaffolds (N50, 158,218 bp)
    • 131 annotated genes


Download Data - Whole Genome Sequence v1.1

Download Data - QTL BAC Pool Draft Sequence v1.0


  1. Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578–579 (2011).
  2. Fang, G.-C. et al. A physical map of the Chinese chestnut (Castanea mollissima) genome and its integration with the genetic map. Tree Genet. Genomes 1–13 (2012).
  3. Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
  4. Kubisiak, T. L. et al. A transcriptome-based genetic map of Chinese chestnut (Castanea mollissima) and identification of regions of segmental homology with peach (Prunus persica). Tree Genet. Genomes 1–15 (2013).
  5. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005).
  6. Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004).
UTK Logo
NSF Logo