Chromosome-scale haploid genome assembly of Durio zibethinus KanYao

Chromosome-scale haploid genome assembly of Durio zibethinus KanYao Chromosome-scale haploid genome assembly of Durio zibethinus KanYao


Sample preparation

The samples were collected from the germplasm nursery of durian in Baoting Li Autonomous County, Sanya City, Hainan Province, China. Young leaves were used for Illumina, HiFi, ONT, and HiC library construction and sequencing. Samples including stem, leaf, flower, seed, fruit at 7 days after flowering (fruit_PFD_7d), fruit pulp at 30 days after flowering (fruit_PFD_30d_pulp), fruit pulp at 60 days after flowering (fruit_PFD_60d_pulp), fruit pulp at 120 days after flowering (fruit_PFD_120d_pulp), fruit pulp at 150 days after flowering (fruit_PFD_150d_pulp), fruit stalk at 30 days after flowering (fruit_PFD_30d_stalk), fruit stalk at 60 days after flowering (fruit_PFD_60d_stalk), fruit spongy layer at 30 days after flowering (fruit_PFD_30d_SL), and fruit spongy layer at 60 days after flowering (fruit_PFD_60d_SL) were collected for RNA-seq and Iso-Seq library construction and sequencing.

Nucleic acid extraction

DNA extraction was performed using the Plant Genomic DNA Extraction Kit (Tiangen, DP320-03), while RNA extraction utilized the Plant Total RNA Extraction Kit (Tiangen, DP432). The integrity was assessed through 1% agarose gel electrophoresis, purity was determined using Nanodrop, and nucleic acid concentration was measured using the Qubit® DNA Assay Kit in the Qubit® 3.0 Fluorometer (Invitrogen, USA).

Library construction and sequencing

Illumina library construction and sequencing protocol: A total amount of 0.2 μg DNA per sample was used for the DNA library preparations. The sequencing library was generated using the Rapid Plus DNA Lib Prep Kit for Illumina (RK20208) following the manufacturer’s recommendations and index codes were added to each sample. Briefly, genomic DNA sample was fragmented by sonication to a size of 350 bp. Then DNA fragments were endpolished, A-tailed, and ligated with the full-length adapter for Illumina sequencing, followed by further PCR amplification. After purifying PCR products with the AMPure XP system (Beckman Coulter, Beverly, USA), DNA concentration was measured using the Qubit®3.0 Fluorometer (Invitrogen, USA). Libraries were then analyzed for size distribution with the Agilent 2100 Bioanalyzer and quantified by real-time PCR (>2 nM). The clustering of the index-coded samples was carried out on a cBot Cluster Generation System using Illumina PE Cluster Kit (Illumina, USA) following the manufacturer’s instructions. After cluster generation, the DNA libraries were sequenced on the Illumina HiSeq. 2000 platform and 150 bp paired-end reads were generated.

PacBio (CCS) library construction and sequencing protocol: Genomic DNA (gDNA) was sheared using a g-TUBE to fragments between 6 and 20 kb for 10 kb and 20 kb SMRTbell library construction. DNA with long overhangs was treated with ExoVII before damage repair. Repair enzymes from the Template Prep Kit were utilized to correct various DNA damages. T4 DNA Polymerase filled in 5’ overhangs and removed 3’ overhangs, while T4 PNK phosphorylated 5’ ends. Hairpin adapters were ligated to the repaired ends. Excess and imperfect SMRTbells were removed with ExoIII and ExoVII. AMPure PB Beads purified the library, which was further size-selected using the BluePippin System for large fragments. The selected SMRTbell templates were then purified again with AMPure PB Beads. A sequencing primer was attached, and a polymerase bound to the templates for loading into Zero-Mode Waveguides (ZWMs) using the Binding Kit. Finally, libraries were loaded into SMRT Cells, and sequencing was performed using the Sequel II instrument with the SMRT Cell 8 M Tray. The total length of the raw sequencing data is 95.7 Gb, with a total of 5,965,883 reads. The maximum read length is 58,876 bp, the minimum read length is 86 bp, the average read length is 16,038 bp, and the N50 is 19,782 bp.

ONT library construction and sequencing protocol: Genomic DNA was fragmented using a Covaris g-TUBE, followed by size verification. The DNA underwent repair and end-preparation with NEBNext End repair/dA-tailing Module and FFPE DNA Repair Mix, followed by purification with AMPure XP beads. Adapters from the NEBNext Quick Ligation Module were then ligated to the prepared DNA, followed by another round of AMPure XP bead cleaning. Specific fragment lengths were targeted using L or S Fragment Buffers. Finally, the DNA was eluted in Elution Buffer. The library prepared for sequencing was loaded onto the Oxford_Nanopore PromethION platform. The total length of the raw sequencing data is 94.4 Gb, with a total of 4,071,912 reads. The maximum read length is 718,956 bp, the minimum read length is 9 bp, the average read length is 23,178 bp, and the N50 is 53,343 bp.

HiC library construction and sequencing protocol: Hi-C libraries were constructed using an established protocol with modifications. The sample was ground in liquid nitrogen, cross-linked with 4% formaldehyde under vacuum at room temperature for 30 minutes, and the reaction was quenched with 2.5 M glycine. After chilling on ice, the sample was centrifuged, the pellet was washed with PBS, and subsequently, resuspended in lysis buffer. The supernatant was removed post-centrifugation. The nuclei were washed with NEB buffer, resuspended, and solubilized with SDS, then incubated at 65 °C. Triton X-100 was used to quench the SDS. An overnight digestion with DpnII was performed at 37 °C. DNA ends were marked with biotin-14-dCTP, and blunt-end ligation was conducted. The chromatin was ligated and cross-links were reversed by proteinase K treatment at 65 °C. DNA purification followed using phenol-chloroform extraction. Biotin from unligated ends was removed with T4 DNA polymerase. Sonication sheared DNA ends were repaired, and biotin-labeled Hi-C samples were enriched with streptavidin beads. A-tails were added, Illumina PE adapters were ligated, and the libraries were PCR-amplified before being sequenced on an Illumina HiSeq. 2000 platform.

RNA-seq library construction and sequencing protocol: Sequencing libraries were generated using the NEBNext Ultra RNA Library Prep Kit for Illumina following the manufacturer’s protocol, and index codes were assigned to each sample. mRNA was isolated from total RNA with poly-T magnetic beads and fragmented using divalent cations at high temperatures. First-strand cDNA synthesis was conducted using random primers and M-MuLV Reverse Transcriptase, followed by second-strand synthesis with DNA Polymerase I and RNase H. Overhangs were blunted, and 3’ ends were adenylated. NEB Next Adaptors with hairpin loop structures were ligated to the cDNA. The library fragments were then size-selected to 370-420 bp using the AMPure XP system. USER Enzyme was applied to adaptor-ligated cDNA, followed by PCR with Phusion High-Fidelity DNA polymerase and primers. PCR products were purified and the library quality was assessed on the Agilent 5400 system and quantified by QPCR. The qualified libraries were pooled and sequenced on an Illumina HiSeq. 2000 platform using a PE150 approach by Novogene Bioinformatics Technology Co., Ltd. Each sample undergoes independent library construction with three biological replicates.

Iso-Seq library construction and sequencing protocol: Iso-Seq libraries were prepared with the SMRTbell® Prep Kit 3.0 according to the prescribed method. Total RNA was isolated and mRNA was enriched. The mRNA was then reverse-transcribed into cDNA using an oligo-dT primer. Double-stranded cDNA was synthesized and subjected to damage repair and end-conditioning to create blunt-ended, 5’-phosphorylated ends. SMRTbell adapters were ligated to the prepared cDNA, forming the SMRTbell template required for sequencing. Exonuclease treatment was applied to remove failed ligation products and to ensure that only SMRTbell templates were present. The library was then size-selected to the desired range using the BluePippin™ Size Selection System. Following size selection, the library was assessed for quality and quantity. Once validated, the library was sequenced on a PacBio Sequel II system, which employed Single Molecule, Real-Time (SMRT) technology to generate long reads for full-length transcript sequencing.

Whole-genome sequencing yielded a total of 106 Gb of Illumina reads (~140× coverage), 95.7 Gb of HiFi reads (~125× coverage), 75.3 Gb of ONT reads (~100× coverage), and 90.3 Gb of Hi-C reads (~120× coverage). Illumina reads and Hi-C reads were filtered with the default parameters of fastp (v0.20.0)11 software. HiFi and ONT reads were filtered using Filtlong (https://github.com/rrwick/Filtlong) software, retaining reads longer than 12 kb and 30 kb with quality scores above 90% to avoid possible errors, respectively. HiFi and ONT filtered reads of 86.1 Gb (~110× coverage) and 75.3 Gb (~96.5× coverage) were used for genome assembly.

Genome survey and analysis

The clean ILLUMINA reads were then used for k-mer counting with Jellyfish (v2.3.0)12 software. Following 21-mer counting, the resulting matrix was utilized to calculate the haploid genome size and heterozygosity using Genomescope (v2.0)13 with parameters -p 2 -k 21. The results indicate that the genome size is approximately 753 Mb, with a high heterozygosity of around 1.23% (Fig. 2).

Fig. 2
figure 2

Genome survey results based on K-mer analysis. (a) Phenotypic images of ‘Kan Yao’ durian, including the plant, flower, whole fruit, and cross-section of the fruit. (b) Analysis of genome size and heterozygosity using Jellyfish (v2.3.0) and Genomescope (v2.0) with parameters -p 2 -k 21.

Assessment of sequencing data saturation

HiFi reads were randomly divided into different coverage levels (10×, 30×, 50×, 70×, 90×, 110×), and initial assemblies were performed with Hifiasm (v0.19.5)14 software to evaluate data saturation. The results indicate that with the increase in sequencing data, the total length of the assembled sequences also increases (Fig. 3). However, the contig N50 reaches a plateau at 70× coverage, while BUSCO (v4.1.4) completeness plateaus at 30× coverage. Consequently, the HiFi sequencing data in this study (~110×) have achieved saturation, with the optimal coverage being 70×.

Fig. 3
figure 3

Evaluation of Assembly Metrics at Different Sequencing Data Coverage Levels. (a) Line graph showing the contig N50 values for assemblies at various coverage levels (10×, 30×, 50×, 70×, 90×, 110×). (b) Evaluation of total sequence length and BUSCO completeness for assemblies at different coverage levels (10×, 30×, 50×, 70×, 90×, 110×).

Genome assembly

Genome contigs were assembled using Hifiasm (v0.19.5)14 software, incorporating HiFi, ONT, and Hi-C reads (hifiasm -o asm -t48–h1 hic_clean_1.fq.gz–h2 hic_clean_2.fq.gz hifi.fastq.gz–ul ONT.fastq.gz), resulting in hap1 and hap2 haplotype genome drafts. Chromosome mounting and assembly of contigs was conducted using 3D-DNA (v190716)15, followed by manual correction with Juicebox (v1.11.08)16 software. Twenty-eight chromosomes were extracted and used as a reference genome for quarTeT AssemblyMapper (v1.0.3)17 software to re-anchor the haplotype data from Hifiasm. ONT data were then used to fill the gaps by quarTeT GapFiller (v1.0.3)17, resulting in the final genome assemblies, designated as hap1 and hap2, respectively. 21 chromosomes of hap1 had no gaps, whereas 22 chromosomes of hap2 had no gaps (Fig. 3). The assembly results were analyzed using QUAST (v5.0.2)18 software. The total genome sizes for hap1 and hap2 were 737.2 Mb and 763.8 Mb respectively, closely aligning with the estimated size of ~753 Mb. The contig N50 values were 22.9 Mb and 21.5 Mb, while the scaffold N50 values were 25.9 Mb and 26.7 Mb, respectively (Table 1). Telomere detection was performed using TIDK software (https://github.com/tolkit/telomeric-identifier), identifying the telomeric repeat unit as AAACCCT. All 28 chromosomes of the hap1 genome had telomeres detected at one or both ends (Fig. 4), with telomeres present at both ends of 23 chromosomes and at only one end of 5 chromosomes (chr7A, chr12A, chr20A, chr23A, chr25A). Similarly, all 28 chromosomes of the hap2 genome had telomeres detected at one or both ends. Telomeres present at both ends of 22 chromosomes and at only one end of 6 chromosomes (chr7B, chr11B, chr12B, chr20B, chr23B, chr25B). Our assembly results were also compared with a recently published genome of the Durio zibethinus ‘Kan Yao’ cultivar19, which only assembled a single haploid genome. In contrast, our study successfully assembled two haploid genomes. While both assemblies show similar results in terms of chromosome number, total length of sequences anchored to chromosomes, proportion of repetitive sequences, and BUSCO scores, our two haploid assemblies exhibit superior performance in terms of Contig N50 and QV values (Table 1).

Table 1 Summary of genome assembly.
Fig. 4
figure 4

Schematic diagram of chromosome structure. The upper diagram represents the hap1 haplotype genome, while the lower diagram represents the hap2 haplotype genome. Red triangles indicate telomeres, and purple lines represent gaps.

Genome annotation

Repeat sequence annotation and masking were performed using RepeatModeler (v2.0.1)20 and RepeatMasker (v4.0.7)21. First, the genome sequence was input into RepeatModeler, which identifies repeat sequence models based on the Dfam (v3.2)22 and RepBase-2018102623 databases to build a repeat sequence library. Then, RepeatMasker annotated the repeat sequences and converted them to lowercase letters. The proportion of repeat sequences in the hap1 and hap2 genomes was 56.75% and 58.48%, respectively. The types of repeat sequences included transposons such as DNA, LTR, LINE, SINE, and RC, as well as satellite DNA, simple repeats, and unknown types. The majority of repeat sequences were LTR transposons, specifically Gypsy (27.1% and 33.56%) and Copia (5.25% and 4.80%).

Gene structure annotation employed three strategies: de novo, homology-based, and transcriptome-based annotation. For de novo annotation, Braker (v2.1.6)24 was used to build models from Arabidopsis protein sequences (arabidopsis_pep_20101214.fa) and merged RNA-seq reads. Homology-based annotation was conducted with GenomeThreader (v1.7.3)25, referencing ‘Musang King’ durian protein annotations (GCF_002303985.1_Duzib1.0_protein.faa). Transcriptome-based annotation utilized PASA (v2.5.0)26 with Iso-Seq reads to accurately annotate the gene structures. The results from these strategies were then merged using EvidenceModeler (v.1.1.1)27 and updated with PASA to incorporate UTR and alternative splicing information, resulting in the final annotation file. The haplotype genomes hap1 and hap2 were annotated with 50,417 and 50,390 coding genes, corresponding to 92,276 and 91,712 transcripts, respectively (Table 2). Gene and repeat sequence densities were calculated using a 500 kb sliding window (bedtools makewindows -w 500000). Protein sequence homology within the genome was analyzed using DIAMOND (v2.0.4.142)28 software. Collinearity analysis was performed with MCScanX29, and visualizations were created using Circos (v0.69-8)30 (Fig. 5). BUSCO (v4.1.4)31 analysis comparing the CDS sequences of transcripts with the embryophyta_odb10 database (v2020-09-10) revealed that the two sets of haplotype genome contained approximately 94.1% and 94% complete homologous conserved genes, respectively (Table 2).

Table 2 Summary of genome annotation data.
Fig. 5
figure 5

Circos plot of haplotype 1 (a) and haplotype 2 (b) genomes. The outer track represents the Gene density histogram, and the inner track represents the Repeat sequence density histogram.

The function annotation of the protein-coding genes followed this protocol: 1. DIAMOND (v2.0.4.142) was used with parameters -e 0.001 -f 5 -k 1 to align against the GenBank-NR (https://www.ncbi.nlm.nih.gov/protein), SwissProt32, and UniRef9033 databases; 2. The pfam_scan.pl script was run with parameters -clan_overlap -as -cpu 16 -e_seq. 1e-5 -e_dom 1e-5 to align against the Pfam34 database; 3. eggNOG-mapper (v2.0.0)35 was run with parameters –no_file_comments -m diamond to BLAST against the EggNOG (v5.0)36 database for GO and KO annotations. The annotation rates for the NR and UniRef90 databases were similar, both around 91%, while the rates for the Pfam and Swiss-Prot databases were lower, at approximately 70% (Table 2). Non-coding gene annotation was compared with the Rfam37 database using Infernal (v1.1.4)38 software, which identified 5,254 and 5,496 non-coding RNAs in hap1 and hap2, respectively. These included various types such as riboswitches, tRNAs, miRNAs, rRNAs, snRNAs, ribozymes, antisense RNAs, and sRNAs. The majority were rRNAs, followed by snRNAs and tRNAs. A total of 281 and 284 miRNAs were annotated in hap1 and hap2, respectively (Table 2).

Comparative of haplotype genomes

Synteny analysis of the two haplotype genomes was performed using the MUMmer (v4.0.0)39 software, which demonstrated that the two haplotype genomes are nearly identical (Fig. 6). Hi-C interaction heatmap analysis was carried out using the HiCPlotter40 software, which clearly showed strong intra-chromosomal interaction signals and notable interaction signals between homologous chromosomes, while inter-chromosomal interaction signals were weak. The gene collinearity analysis between the two haplotype genomes was performed using the TBtools (v2.096)41 software. This analysis revealed that the majority of genes exhibit conserved chromosomal distributions between the two haplotype genomes, while a small number of genes show evidence of rearrangements or translocations.

Fig. 6
figure 6

Comparative analysis of the two haplotype genome structures. (a) Genomic sequence collinearity analysis. (b) Hi-C interaction heatmap analysis. (c) Gene collinearity analysis. (d) Chromosomal structural variation (SV) analysis.

Filtered ONT sequencing reads were utilized for chromosomal structural variation (SV) analysis with the NGMLR(v0.2.7)42 and Sniffles(v1.0.11)43 software. In total, 25,572 structural variation sites (SVs) were detected, including eight types: deletion (DEL), insertion (INS), duplication (DUP), breakend (BND), inversion (INV), deletion/inversion (DEL/INV), inversion/duplication (INVDUP), and duplication/insertion (DUP/INS) (Fig. 6). Among these, DEL was the most frequent with 12,864 occurrences, followed by INS with 10,282 occurrences.




Source link

Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use