Thc Credit V1 91
Thc Credit V1 91
Cannabis sativa L., a native plant of Central Asia, is first cultivated in Asia and Europe and is now one of the most popularly cultivated plants worldwide1. In China, hemp fiber has been used to produce textiles for the past 6000 years2.
C. sativa is one of the most valuable agriculturally important crops in nature. Although it is widely used to produce paper, textiles, building materials, food, and medicine, a secondary metabolite, tetrahydrocannabinol (THC), is also used to produce well-known drugs. Frequent, long-term, selective breeding has produced both hemp fiber and medicinal cannabis strains, with medicinal cannabis showing promise in effectively treating various diseases3 by relieving an array of symptoms, including pain, nausea, anxiety, and inflammation4,5,6,7. The therapeutic efficacy of medicinal cannabis is mainly dependent on cannabinoids, which are endemic metabolites unique to C. sativa8, among which THC and cannabidiol (CBD) are the main chemical cannabinoid compounds.
Although cannabis has considerable economic and medical value, information about its genome is limited. While a genomic draft was published recently, in 20119, the splicing of this draft was neither of good quality nor complete, thus hindering its usefulness.
The cannabis genome has been sequenced9, but the sequenced plant came from a cultivated variety. Generally, cultivated varieties lose substantial genetic diversity through successive bottlenecks due to domestication and selection for traits to increase yield under intensive human cultivation12. Therefore, wild-type varieties are an important source of genetic diversity for molecular breeding. In this report, we performed genomic sequencing, assembly, annotation, and evolutionary analysis in wild-type varieties of C. sativa. The genetic data obtained in this study will be a valuable resource for future studies assessing the pharmacology, chemical constituents, cultivation, and genetic improvement of the traits of these plants and could be used as a reference in future population genetic studies of C. sativa.
The Hi-C approach efficiently uses high-throughput sequencing to determine the state of genome folding by measuring the frequency of contact between pairs of loci14,15. Originally, this technique was developed to generate chromosomal genome assemblies, but it was subsequently found to be useful for genome-wide chromosome conformation capture16.
BUSCO (v3.0)18 was employed to evaluate the accuracy and completeness of our genome assembly, gene set, and transcripts. Based on the OrthoDB ( ) database, BUSCO built several large, single-copy gene sets covering the branches of the evolutionary tree. When comparing the gene set to the genome, it was noted that the proportion of complete BUSCOs was 92.6% (Supplementary Information; Table S3), indicating that the genome assembly integrity was very good.
Due to potential contamination during sequencing and assembly, we further evaluated our genome assembly by using GC depth analysis. The GC depth scatter plot showed no significant differentiation, and points were concentrated around the 34% area, indicating high assembly quality without any bacterial contamination (Supplementary Information; Fig. S2). Finally, the sequencing profile base depth was close to a Poisson distribution, further indicating that the assembled genome showed high assembly quality (Supplementary Information; Fig. S3).
To evaluate the consistency of the next-generation data, we compared the sequencing reads to the assembled scaffold sequences, and the resultant comparison ratio for the reads and genomic coverage showed that they were deep and complete (Supplementary Information; Fig. S4). The comparison rate of the next-generation data was 96.77% (Supplementary Information; Table S4), indicating that the assembled genome was of high quality.
We predicted non-coding RNAs, such as rRNAs, tRNAs, snRNAs, and miRNAs, by comparing their sequences with the known non-coding RNA library Rfam24. A total of 2,441 rRNAs, 214 snRNAs, and 281 miRNAs were thus predicted (Supplementary Information; Table S7, Figshare 3, 4, 5). tRNAscan-SE25 was used to predict tRNA sequences in the genome, resulting in 712 tRNAs (Supplementary Information; Table S7). To further verify our gene annotation results, we conducted a BUSCO evaluation using the embryophyta_odb10 database, producing a result of 93%, indicating that the annotation results were acceptable (Supplementary Information; Table S8).
OrthoMCL (v1.4)26 was used to classify gene families with single and multiple copies from both closely related and remotely related species. (Supplementary Information; Table S9 and Fig. S5), resulting in the identification of 930 C. sativa-specific genes. C. sativa shows more genes in common with Trema orientale and Morus notabilis than with other species (Supplementary Information; Fig. S6). We used MUSCLE software (v3.8.31)27 to perform multiple sequence alignments for all single-copy gene families sequences. After we constructed the integrated supergene sequence, which was based on the four-fold degenerated sites (4DTv sites) of orthologous family genes, we used PhyML (v3.0)28 to construct the species phylogenetic tree (ML-Tree). As shown in Fig. 2, Vitis vinifera and Fragaria vesca in Rosaceae diverged from one another earlier than T. orientale, M. notabilis, and Ziziphus jujuba diverged from each other, and C. sativa is most closely related to T. orientale.
The longer the branch length, the longer the divergence time. The closer the branches, the closer the predicted genetic relationship. In general, we considered a bootstrapping value above 85 to represent good support for the result. Numbers above the nodes are bootstrap values
The blue numbers at the node positions represent the divergence time of each species in millions of years (Ma). The numbers in parentheses indicate the confidence interval of the divergence time, which can be used to estimate the divergence time of target species and other species. The red dots are the calibration time used to correct the time of species divergence
Gene families expansion and contraction were analyzed based on mathematical statistical tests. After the cluster analysis of gene families, those with abnormal gene numbers in individual species were filtered, and CAFE (v4.1)34 and probabilistic graphical models (PGMs) were then used to simulate the acquisition and loss of genes under the specified phylogenetic tree and to analyze gene families expansion and contraction using hypothesis testing (Fig. 4). We found 12,801 gene families in the MCRA (most recent common ancestor). In comparison to M. notabilis, T. orientale, V. vinifera, F. vesca, Musca domestica, Z. jujuba, and Papaver somniferum, there were 2,599 gene families showing expansion and 1,298 gene families showing contraction in C. sativa.
Green numbers represent the number of gene families present when a species expanded during evolution, and red numbers represent the number of gene families present when a species contracted during evolution
Genomic synteny block analysis can be used to determine the evolutionary source of chromosomes between species38,39,40. In this study, we used BLASTP (v2.2.31+) to analyze the aligned protein sequences of C. sativa and Z. jujuba (Rhamnaceae) and then used MCScan (v0.8) to evaluate those results by using genome synteny blocks. Our results showed that C. sativa and Z. jujuba present a strong genomic synteny relationship (Fig. 6a).
Because of the high repetition rate and high heterozygosity in the cannabis genome, no high-quality cannabis genome has been generated previously. There are unknown regions in the cannabis genome assembled using SOAPdenovo software by Van Bakel et al. in 20119. The genome was not assembled to the chromosome level, and the number and length of the scaffolds in that study are much lower than the values expected for plant genes. Therefore, the cannabis genome assembled in 2011 shows poor quality and does not contain annotation information, which greatly limits its applicability to research on cannabis. In this study, we reassembled a wild-type cannabis genome by using third-generation sequencing data and thus obtained a high-quality cannabis genome.
Single-molecule real-time sequencing has the characteristics of a high throughput and long read length, which can reduce the number of contigs after sequencing and can effectively increase the number and integrity of genomes during the process of genome splicing. We combined TGS and NGS sequencing methods with Hi-C assembly technology to construct a high-density wild-type cannabis genome sequence map.
After obtaining the high-quality cannabis genome, we annotated its genes and thus considerably improved upon the 20119 version of the genome. Following the completion of the assembly of its repeat sequences and statistical analysis, we found that cannabis has abundant repeat regions, which may be the cause of the poor quality of the cannabis genome assembled by Van Bakel et al. in 20119. This high-quality reference genome will undoubtedly benefit researchers in the exploration and manipulation of the agronomic characteristics of C. sativa.
To understand the evolutionary status of cannabis, we analyzed its evolution and divergence times. Through these analysis, we found that the evolutionary status of cannabis and T. orientale is close at the molecular level, and their kinship is thus very close. The quality of the T. orientale genome is still relatively poor, and the high-quality cannabis genome that we obtained in this study could therefore provide useful information for the future study of the T. orientale genome and its evolution. By analyzing whole-genome duplication events in cannabis, we found three recent WGD events and one large-scale duplication event in cannabis and that cannabis shares two WGD incidents with T. orientale. Our data further elucidate the evolutionary status of cannabis.