Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life

High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes – generically known as restriction-site associated DNA sequencing (RAD-seq) – is a common strategy to generate genome-wide genotypic and sequence data from eukaryotes. A critical design...

Descripción completa

Detalles Bibliográficos
Autores principales: Herrera, Santiago, Reyes Herrera, Paula H., Shank, Timothy M.
Formato: article
Lenguaje:Inglés
Publicado: Cold Sprimg Harbor Laboratory (CSH) 2025
Materias:
Acceso en línea:https://www.biorxiv.org/content/10.1101/007781v2
http://hdl.handle.net/20.500.12324/40752
id RepoAGROSAVIA40752
record_format dspace
institution Corporación Colombiana de Investigación Agropecuaria
collection Repositorio AGROSAVIA
language Inglés
topic Investigación agropecuaria - A50
Árbol
Genoma
Enzima
Diseño experimental
Transversal
http://aims.fao.org/aos/agrovoc/c_7887
http://aims.fao.org/aos/agrovoc/c_3224
http://aims.fao.org/aos/agrovoc/c_2603
http://aims.fao.org/aos/agrovoc/c_29466
spellingShingle Investigación agropecuaria - A50
Árbol
Genoma
Enzima
Diseño experimental
Transversal
http://aims.fao.org/aos/agrovoc/c_7887
http://aims.fao.org/aos/agrovoc/c_3224
http://aims.fao.org/aos/agrovoc/c_2603
http://aims.fao.org/aos/agrovoc/c_29466
Herrera, Santiago
Reyes Herrera, Paula H.
Shank, Timothy M.
Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life
description High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes – generically known as restriction-site associated DNA sequencing (RAD-seq) – is a common strategy to generate genome-wide genotypic and sequence data from eukaryotes. A critical design element of any RAD-seq study is a knowledge of the approximate number of genetic markers that can be obtained for a taxon using different restriction enzymes, as this number determines the scope of a project, and ultimately defines its success. This number can only be directly determined if a reference genome sequence is available, or it can be estimated if the genome size and restriction recognition sequence probabilities are known. However, both scenarios are uncommon for non-model species. Here, we performed systematic in silico surveys of recognition sequences, for diverse and commonly used type II restriction enzymes across the eukaryotic tree of life. Our observations reveal that recognition-sequence frequencies for a given restriction enzyme are strikingly variable among broad eukaryotic taxonomic groups, being largely determined by phylogenetic relatedness. We demonstrate that genome sizes can be predicted from cleavage frequency data obtained with restriction enzymes targeting ‘neutral’ elements. Models based on genomic compositions are also effective tools to accurately calculate probabilities of recognition sequences across taxa, and can be applied to species for which reduced-representation data is available (including transcriptomes and ‘neutral’ RAD-seq datasets). The analytical pipeline developed in this study, PredRAD (https://github.com/phrh/PredRAD), and the resulting databases constitute valuable resources that will help guide the design of any study using RAD-seq or related methods.
format article
author Herrera, Santiago
Reyes Herrera, Paula H.
Shank, Timothy M.
author_facet Herrera, Santiago
Reyes Herrera, Paula H.
Shank, Timothy M.
author_sort Herrera, Santiago
title Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life
title_short Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life
title_full Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life
title_fullStr Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life
title_full_unstemmed Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life
title_sort predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life
publisher Cold Sprimg Harbor Laboratory (CSH)
publishDate 2025
url https://www.biorxiv.org/content/10.1101/007781v2
http://hdl.handle.net/20.500.12324/40752
work_keys_str_mv AT herrerasantiago predictinggenomesizesandrestrictionenzymerecognitionsequenceprobabilitiesacrosstheeukaryotictreeoflife
AT reyesherrerapaulah predictinggenomesizesandrestrictionenzymerecognitionsequenceprobabilitiesacrosstheeukaryotictreeoflife
AT shanktimothym predictinggenomesizesandrestrictionenzymerecognitionsequenceprobabilitiesacrosstheeukaryotictreeoflife
_version_ 1842255487381274624
spelling RepoAGROSAVIA407522025-03-06T03:00:18Z Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life Herrera, Santiago Reyes Herrera, Paula H. Shank, Timothy M. Investigación agropecuaria - A50 Árbol Genoma Enzima Diseño experimental Transversal http://aims.fao.org/aos/agrovoc/c_7887 http://aims.fao.org/aos/agrovoc/c_3224 http://aims.fao.org/aos/agrovoc/c_2603 http://aims.fao.org/aos/agrovoc/c_29466 High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes – generically known as restriction-site associated DNA sequencing (RAD-seq) – is a common strategy to generate genome-wide genotypic and sequence data from eukaryotes. A critical design element of any RAD-seq study is a knowledge of the approximate number of genetic markers that can be obtained for a taxon using different restriction enzymes, as this number determines the scope of a project, and ultimately defines its success. This number can only be directly determined if a reference genome sequence is available, or it can be estimated if the genome size and restriction recognition sequence probabilities are known. However, both scenarios are uncommon for non-model species. Here, we performed systematic in silico surveys of recognition sequences, for diverse and commonly used type II restriction enzymes across the eukaryotic tree of life. Our observations reveal that recognition-sequence frequencies for a given restriction enzyme are strikingly variable among broad eukaryotic taxonomic groups, being largely determined by phylogenetic relatedness. We demonstrate that genome sizes can be predicted from cleavage frequency data obtained with restriction enzymes targeting ‘neutral’ elements. Models based on genomic compositions are also effective tools to accurately calculate probabilities of recognition sequences across taxa, and can be applied to species for which reduced-representation data is available (including transcriptomes and ‘neutral’ RAD-seq datasets). The analytical pipeline developed in this study, PredRAD (https://github.com/phrh/PredRAD), and the resulting databases constitute valuable resources that will help guide the design of any study using RAD-seq or related methods. 2025-03-05T13:00:48Z 2025-03-05T13:00:48Z 2015-07 2015 article Artículo científico http://purl.org/coar/resource_type/c_2df8fbb1 info:eu-repo/semantics/article https://purl.org/redcol/resource_type/ART http://purl.org/coar/version/c_970fb48d4fbd8a85 https://www.biorxiv.org/content/10.1101/007781v2 http://hdl.handle.net/20.500.12324/40752 reponame:Biblioteca Digital Agropecuaria de Colombia instname:Corporación colombiana de investigación agropecuaria AGROSAVIA eng BioRxiv 3 1 1 35 Andersen EC et al. 2012. Chromosome-scale selective sweeps shape Caenorhabditis elegans genomic diversity. Nat. Genet. 44:285–290. doi: 10.1038/ng.1050. Andolfatto P et al. 2011. Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res. 652 21:610–617. doi: 10.1101/gr.115402.110. Andolfatto P et al. 2011. Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res. 652 21:610–617. doi: 10.1101/gr.115402.110. Baird NA et al. 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 654 3:e3376. Baird NA et al. 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 654 3:e3376. Bejerano G et al. 2004. Ultraconserved elements in the human genome. Science. 304:1321–1325. doi: 656 10.1126/science.1098119. Beutler E, Gelbart T, Han J, Koziol JA, Beutler B. 1989. Evolution of the genome and the genetic code: Selection at 658 the dinucleo- tide level by methylation and polyribonucleotide cleavage. Proc. Natl. Acad. Sci. U. S. A. 86:192–196. Bird AP. 1980. DNA methylation and the frequency of Cpg in animal DNA. Nucleic Acids Res. 8:1499–1504. doi: 660 Doi 10.1093/Nar/8.7.1499. Burge C, Campbell AM, Karlin S. 1992. Over- and under-representation of short oligonucleotides in DNA 662 sequences. Proc. Natl. Acad. Sci. U. S. A. 89:1358–1362. Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA. 2013. Stacks: an analysis tool set for population 664 genomics. Mol. Ecol. 22:3124–3140. doi: 10.1111/Mec.12354. DaCosta JM, Sorenson MD. 2014. Amplification biases and consistent recovery of loci in a double-digest RAD-seq 666 protocol. PLoS One. 9:e106713. doi: 10.1371/journal.pone.0106713. Davey JW et al. 2011. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. 668 Nat. Rev. Genet. 12:499–510. doi: 10.1038/nrg3012. Davey JW et al. 2013. Special features of RAD sequencing data: Implications for genotyping. Mol. Ecol. 22:3151– 670 3164. doi: 10.1111/Mec.12084. Davey JW, Blaxter ML. 2011. RADSeq: Next-generation population genetics. Briefings Funct. Genomics 672 Proteomics. 9:416–423. doi: 10.1093/bfgp/elq031. Dolezel J, Bartos J. 2005. Plant DNA flow cytometry and estimation of nuclear genome size. Ann Bot. 95:99–110. 674 doi: 10.1093/aob/mci005. Dolezel J, Greilhuber J, Suda J. 2007. Estimation of nuclear DNA content in plants using flow cytometry. Nat. 676 Protoc. 2:2233–2244. http://dx.doi.org/10.1038/nprot.2007.310. Eaton DAR, Ree RH. 2013. Inferring phylogeny and introgression using RADseq data: an example from flowering 678 plants (Pedicularis: Orobanchaceae). Syst. Biol. 62:689–706. doi: 10.1093/sysbio/syt032. Elshire RJ et al. 2011. A robust, simple genotyping-by-sequencing (GBS) spproach for high diversity species. PLoS 680 One. 6:e19379. http://dx.plos.org/10.1371/journal.pone.0019379.pdf. Emerson KJ et al. 2010. Resolving postglacial phylogeography using high-throughput sequencing. Proc. Natl. Acad. 682 Sci. U. S. A. 107:16196–16200. doi: 10.1073/pnas.1006538107. Gentles AJ, Karlin S. 2001. Genome-scale compositional comparisons in eukaryotes. Genome 683 Res. 11:540–546. doi: 684 10.1101/gr.163101. Hardie DC, Gregory TR, Hebert PD. 2002. From pixels to picograms: a beginners’ guide to genome quantification 686 by Feulgen image analysis densitometry. J Histochem Cytochem. 50:735–749. Herrera S, Shank TM. 2015. RAD sequencing enables unprecedented phylogenetic resolution and objective species 688 delimitation in recalcitrant divergent taxa. bioRxiv. 019745. doi: 10.1101/019745. Herrera S, Watanabe H, Shank TM. 2015. Evolutionary and biogeographical patterns of barnacles from deep-sea 690 hydrothermal vents. Mol. Ecol. 24:673–689. doi: 10.1111/mec.13054. Hohenlohe PA et al. 2010. Population genomics of parallel adaptation in threespine stickleback using sequenced 692 RAD tags. PLoS Genet. 6:e1000862. Karlin S, Burge C, Campbell AM. 1992. Statistical analyses of counts and distributions of restriction sites in DNA 694 sequences. Nucleic Acids Res. 20:1363–1370. Karlin S, Burge C, Campbell AM. 1992. Statistical analyses of counts and distributions of restriction sites in DNA 694 sequences. Nucleic Acids Res. 20:1363–1370. Karlin S, Campbell AM, Mrázek J. 1998. Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 696 32:185–225. doi: 10.1146/annurev.genet.32.1.185. Karlin S, Mrázek J. 1997. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. 698 U. S. A. 94:10227–10232. Katzman S et al. 2007. Human genome ultraconserved elements are ultraselected. Science. 317:915. doi: 700 10.1126/science.1142430. Lyko F, Ramashoye BH, Jaenisch R. 2000. DNA methylation in Drosophila melanogaster. Nature. 408:538–540. Lyko F, Ramashoye BH, Jaenisch R. 2000. DNA methylation in Drosophila melanogaster. Nature. 408:538–540. Mastretta-Yanes A et al. 2014. Restriction site-associated DNA sequencing, genotyping error estimation and de 706 novo assembly optimization for population genetic inference. Mol. Ecol. Resour. 15:28–41. doi: 10.1111/1755- 707 0998.12291. Miller MR, Dunham JP, Amores A, Cresko WA, Johnson EA. 2007. Rapid and cost-effective polymorphism 709 identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res. 17:240–248. doi: 710 10.1101/Gr.5681207. Miller W et al. 2007. 28-Way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome 712 Res. 17:1797–1808. doi: 10.1101/gr.6761107. Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE. 2012. Double digest RADseq: An inexpensive method 714 for de novo SNP discovery and genotyping in model and non-model species. PLoS One. 7:e37135. doi: 715 10.1371/journal.pone.0037135.t001. Rambach A, Tiollais P. 1974. Bacteriophage having EcoRI endonucleases sites only in the nonessential sites of the 717 genome. Proc. Natl. Acad. Sci. U. S. A. 71:3927–3930. Reitzel AM, Herrera S, Layden MJ, Martindale MQ, Shank TM. 2013. Going where traditional 718 markers have not 719 gone before: Utility of and promise for RAD sequencing in marine invertebrate phylogeography and population 720 genomics. Mol. Ecol. 22:2953–2960. doi: 10.1111/mec.12228. Rocha EPC, Danchin A, Viari A. 2001. Evolutionary role of restriction/modification systems as revealed by 722 comparative genome analysis. Genome Res. 11:946–958. doi: 10.1101/gr.153101. Scaglione D et al. 2012. RAD tag sequencing as a source of SNP markers in Cynara cardunculus L. BMC 724 Genomics. 13:3. doi: 10.1186/1471-2164-13-3. Scaglione D et al. 2012. RAD tag sequencing as a source of SNP markers in Cynara cardunculus L. BMC 724 Genomics. 13:3. doi: 10.1186/1471-2164-13-3. Siepel A et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome 726 Res. 15:1034–1050. doi: 10.1101/gr.3715005. Singh GB. 2009. Stochastic Models for Biological Patterns. In: Bioinformatics for Systems Biology. Krawetz, S, 728 editor. Springer: New York pp. 151–162. doi: 10.1007/978-1-59745-440-7. Šmarda P, Bureš P, Šmerda J, Horová L. 2011. Measurements of genomic GC content in plant genomes with flow 730 cytometry: a test for reliability. New Phytol. 193:513–521. doi: 10.1111/j.1469-8137.2011.03942.x. Toonen RJ et al. 2013. ezRAD: A simplified method for genomic genotyping in non-model organisms. PeerJ. 732 1:e203. doi: 10.7717/peerj.203. Vinogradov A. 1998. Genome size and GC-percent in vertebrates as determined by flow cytometry: The triangular 734 relationship. Cytometry. 31:100–109. Vinogradov A. 1994. Measurement by flow cytometry of genomic AT/GC ratio and genome size. Cytometry. 736 16:34–40. Wagner CE et al. 2012. Genome-wide RAD sequence data provide unprecedented resolution of species boundaries 738 and relationships in the Lake Victoria cichlid adaptive radiation. Mol. Ecol. 22:787–798. doi: 10.1111/mec.12023. Wang J et al. 2013. A Y-like social chromosome causes alternative colony organization in fire ants. Nature. 740 493:664–668. doi: 10.1038/nature11832. Weber JN, Peterson BK, Hoekstra HE. 2013. Discrete genetic modules are responsible for complex burrow 742 evolution in Peromyscus mice. Nature. 493:402–405. doi: 10.1038/nature11816. White TA, Perkins SE, Heckel G, Searle JB. 2013. Adaptive evolution during an ongoing range expansion: the 744 invasive bank vole (Myodes glareolus) in Ireland. Mol. Ecol. 22:2971–2985. doi: 10.1111/mec.12343. Yoder JA, Walsh CP, Bestor TH. 1997. Cytosine methylation and the ecology of intragenomic parasites. Trends 746 Genet. 13:335–340. doi: Doi 10.1016/S0168-9525(97)01181-5. Attribution-NonCommercial-ShareAlike 4.0 International http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf application/pdf Cold Sprimg Harbor Laboratory (CSH) BioRxiv; (2015): BioRxiv (July);p. 1 - 35.