A pan-genome of Solanum and its indigenous crop African eggplant (S. aethiopicum)

Project overview 

Benoit, M., Jenike, K.M., Satterlee, J., Ramakrishnan, S., Gentile, I., Hendelman, A., Passalacqua, M., Suresh, H., Shogat, H., Robitaille, G., Fitzgerald, B., Alonge, M., Wang, X., Santos, R., He, J., Ou, S., Golan, H., Green, Y., Swartwood, K., Sierra, G.P., Orejuela, A., Roda, F., Goodwin, S., McCombie, W.R., Kizito, E.B., Gagnon, E., Knaap, S., Särkinen, T.E., Frary, A., Gillis, J., Van Eck, J., Schatz, M.C., Lippman, Z.B. (2024) Solanum pan-genomics and pan-genetics reveal paralogs as contingencies in crop engineering. bioRxiv DOI:10.1101/2024.09.10.612244

The Michael C. Schatz lab at Johns Hopkins University and Zachary B. Lippman lab at Cold Spring Harbor Laboratory and of the Howard Hughes Medical Institute present a pan-genome of the genus Solanum comprised of 22 species and 9 additional accessions of the indigenous crop species S. aethiopicum (African eggplant).

Genome files

Assemblies: Reference quality genome assemblies for each of the 22 species (and two reference quality genomes for S. muricatum) were generated using a combination of long-read sequencing (Pacific Biosciences, CA, USA) for contigging and optical mapping (Bionano Genomics, CA, USA) for scaffolding. Between 1-4 PacBio Sequel IIe flow cells (Pacific Biosciences, CA, USA) were used for the sequencing of each sample in the Solanum wide pan-genome (average read N50 = 29,067 bp, average coverage = 63X). The exact number of flow cells and sequencing technology for each sample are detailed in. For the additional 9 S. aethiopicum samples, a combination of PacBio Sequel IIe, PacBio Revio sequencing, and Oxford Nanopore sequencing were used to assemble the genomes.

Gene annotation: foundations for gene annotation were based on using the Liftoff algorithm on community established references of tomato (Heinz reference genome) and eggplant (Brinjal reference genome). We augmented the annotation using RNA-sequencing from 15 species and multiple tissues for de novo annotation. In addition, protein evidence from several published Solanaceae genomes, and the UniProt/SwissProt database were utilized to support gene annotation. Structural gene annotations were generated through the Mikado v2.0rc2 framework, leveraging evidence from the Daijin pipeline.

Repeat annotation: De novo transposable element annotation was first performed on each genome using EDTA v2.1.590, with coding sequences from the ITAG4.0 Eggplant V4 annotation91 provided (--cds) to purge gene coding sequences in the transposable element annotation and parameters of --anno 1 --sensitive 1 for sensitive detection and annotation of repeat sequences. Curated tomato repeats were supplied to EDTA (--curatedlib) for the de novo annotation. Transposable element annotations of individual genomes were together processed by panEDTA92 for the creation of consistent pan-genome transposable element annotation. Summary of whole-genome repeat annotations were derived from .sum files generated by panEDTA. Evaluation of repeat assembly quality was performed using LAI b3.2.

For further information, please contact Zach B. Lippman or Michael C. Schatz

 

Image courtesy of Blaine Fitzgerald