Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species

Genotype-by-sequencing (GBS) is a widely used cost-effective technique to obtain large numbers of genetic markers from populations. Although a standard reference-based pipeline can be followed to analyze these reads, a reference genome is still not available for a large number of species. Hence,...

Descripción completa

Detalles Bibliográficos
Autores principales: Parra Salazar, Andrea, Gomez, Jorge, Lozano Arce, Daniela, Reyes Herrera, Paula H., Duitama, Jorge
Formato: article
Lenguaje:Inglés
Publicado: Cold Sprimg Harbor Laboratory (CSH) 2024
Materias:
Acceso en línea:https://www.biorxiv.org/content/10.1101/2020.11.28.402131v1
http://hdl.handle.net/20.500.12324/39793
https://doi.org/10.1101/2020.11.28.402131
id RepoAGROSAVIA39793
record_format dspace
institution Corporación Colombiana de Investigación Agropecuaria
collection Repositorio AGROSAVIA
language Inglés
topic Investigación agropecuaria - A50
Análisis de datos
Método estadístico
Genómica
Transversal
http://aims.fao.org/aos/agrovoc/c_15962
http://aims.fao.org/aos/agrovoc/c_7377
http://aims.fao.org/aos/agrovoc/c_92382
spellingShingle Investigación agropecuaria - A50
Análisis de datos
Método estadístico
Genómica
Transversal
http://aims.fao.org/aos/agrovoc/c_15962
http://aims.fao.org/aos/agrovoc/c_7377
http://aims.fao.org/aos/agrovoc/c_92382
Parra Salazar, Andrea
Gomez, Jorge
Lozano Arce, Daniela
Reyes Herrera, Paula H.
Duitama, Jorge
Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
description Genotype-by-sequencing (GBS) is a widely used cost-effective technique to obtain large numbers of genetic markers from populations. Although a standard reference-based pipeline can be followed to analyze these reads, a reference genome is still not available for a large number of species. Hence, several research groups require reference-free approaches to generate the genetic variability information that can be obtained from a GBS experiment. Unfortunately, tools to perform de-novo analysis of GBS reads are scarce and some of the existing solutions are difficult to operate under different settings generated by the existing GBS protocols. In this manuscript we describe a novel algorithm to perform reference-free variants detection and genotyping from GBS reads. Nonexact searches on a dynamic hash table of consensus sequences allow to perform efficient read clustering and sorting. This algorithm was integrated in the Next Generation Sequencing Experience Platform (NGSEP) to integrate the state-ofthe- art variants detector already implemented in this tool. We performed benchmark experiments with three different real populations of plants and animals with different structures and ploidies, and sequenced with different GBS protocols at different read depths. These experiments show that NGSEP has comparable and in some cases better accuracy and always better computational efficiency compared to existing solutions. We expect that this new development will be useful for several research groups conducting population genetic studies in a wide variety of species.
format article
author Parra Salazar, Andrea
Gomez, Jorge
Lozano Arce, Daniela
Reyes Herrera, Paula H.
Duitama, Jorge
author_facet Parra Salazar, Andrea
Gomez, Jorge
Lozano Arce, Daniela
Reyes Herrera, Paula H.
Duitama, Jorge
author_sort Parra Salazar, Andrea
title Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_short Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_full Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_fullStr Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_full_unstemmed Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_sort robust and efficient software for reference-free genomic diversity analysis of gbs data on diploid and polyploid species
publisher Cold Sprimg Harbor Laboratory (CSH)
publishDate 2024
url https://www.biorxiv.org/content/10.1101/2020.11.28.402131v1
http://hdl.handle.net/20.500.12324/39793
https://doi.org/10.1101/2020.11.28.402131
work_keys_str_mv AT parrasalazarandrea robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies
AT gomezjorge robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies
AT lozanoarcedaniela robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies
AT reyesherrerapaulah robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies
AT duitamajorge robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies
_version_ 1808105490899533824
spelling RepoAGROSAVIA397932024-08-06T03:00:51Z Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species Parra Salazar, Andrea Gomez, Jorge Lozano Arce, Daniela Reyes Herrera, Paula H. Duitama, Jorge Investigación agropecuaria - A50 Análisis de datos Método estadístico Genómica Transversal http://aims.fao.org/aos/agrovoc/c_15962 http://aims.fao.org/aos/agrovoc/c_7377 http://aims.fao.org/aos/agrovoc/c_92382 Genotype-by-sequencing (GBS) is a widely used cost-effective technique to obtain large numbers of genetic markers from populations. Although a standard reference-based pipeline can be followed to analyze these reads, a reference genome is still not available for a large number of species. Hence, several research groups require reference-free approaches to generate the genetic variability information that can be obtained from a GBS experiment. Unfortunately, tools to perform de-novo analysis of GBS reads are scarce and some of the existing solutions are difficult to operate under different settings generated by the existing GBS protocols. In this manuscript we describe a novel algorithm to perform reference-free variants detection and genotyping from GBS reads. Nonexact searches on a dynamic hash table of consensus sequences allow to perform efficient read clustering and sorting. This algorithm was integrated in the Next Generation Sequencing Experience Platform (NGSEP) to integrate the state-ofthe- art variants detector already implemented in this tool. We performed benchmark experiments with three different real populations of plants and animals with different structures and ploidies, and sequenced with different GBS protocols at different read depths. These experiments show that NGSEP has comparable and in some cases better accuracy and always better computational efficiency compared to existing solutions. We expect that this new development will be useful for several research groups conducting population genetic studies in a wide variety of species. Corporación Colombiana de Investigación Agropecuaria - (AGROSAVIA) Universidad de los Andes (ULA) 2024-08-05T20:29:56Z 2024-08-05T20:29:56Z 2020-11-28 2020 article Artículo científico http://purl.org/coar/resource_type/c_2df8fbb1 info:eu-repo/semantics/article https://purl.org/redcol/resource_type/ART http://purl.org/coar/version/c_970fb48d4fbd8a85 https://www.biorxiv.org/content/10.1101/2020.11.28.402131v1 http://hdl.handle.net/20.500.12324/39793 https://doi.org/10.1101/2020.11.28.402131 reponame:Biblioteca Digital Agropecuaria de Colombia instname:Corporación colombiana de investigación agropecuaria AGROSAVIA eng BioRxiv 1 20 Kimberly R. Andrews, Jeffrey M. Good, Michael R. Miller, Gordon Luikart, and Paul A. Hohenlohe. Harnessing the power of radseq for ecological and evolutionary genomics. Nature Reviews Genetics, 17, 2 2016. D. A. Ayala-Usma, G. Danies, K. Myers, M. O. Bond, J. A. Romero-Navarro, H. S. Judelson, S. Restrepo, and W. E. Fry. Genome-wide association study identifies single nucleotide polymorphism markers associated with mycelial growth (at 15, 20, and 25 c), mefenoxam resistance, and mating type in ¡i¿phytophthora infestans¡/i¿. Phytopathology R , 110, 4 2020. Nathan A Baird, Paul D Etter, Tressa S Atwood, Mark C Currey, Anthony L Shiver, Zachary A Lewis, Eric U Selker, William A Cresko, and Eric A Johnson. Rapid snp discovery and genetic mapping using sequenced rad markers. PloS one, 3:e3376, 1 2008. Julian Catchen, Paul A. Hohenlohe, Susan Bassham, Angel Amores, andWilliam A. Cresko. Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 6 2013. Julian M Catchen, Angel Amores, Paul Hohenlohe, William Cresko, and John H Postlethwait. Stacks: building and genotyping loci de novo from short-read sequences. G3 (Bethesda, Md.), 1:171–182, 8 2011. John W Davey, Paul a Hohenlohe, Paul D Etter, Jason Q Boone, Julian M Catchen, and Mark L Blaxter. Genome-wide genetic marker discovery and genotyping using nextgeneration sequencing. Nature reviews. Genetics, 12:499–510, 7 2011. Deren A R Eaton. Pyrad: Assembly of de novo radseq loci for phylogenetic analyses. Bioinformatics, 30:1844–1849, 2014. R. C. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32, 3 2004. Robert C. Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics, 26, 10 2010. Robert J. Elshire, Jeffrey C. Glaubitz, Qi Sun, Jesse a. Poland, Ken Kawamoto, Edward S. Buckler, and Sharon E. Mitchell. A robust, simple genotyping-by-sequencing (gbs) approach for high diversity species. PLoS ONE, 6:e19379, 2011. Jeffrey C. Glaubitz, Terry M. Casstevens, Fei Lu, James Harriman, Robert J. Elshire, Qi Sun, and Edward S. Buckler. Tassel-gbs: A high capacity genotyping by sequencing analysis pipeline. PLoS ONE, 9:e90346, 2014. Mar´ıa Fernanda Guindon, Eugenia Martin, Vanina Cravero, Krishna K. Gali, Thomas D. Warkentin, and Enrique Cointry. Linkage map development by gbs, ssr, and srap techniques and yield-related qtls in pea. Molecular Breeding, 39, 4 2019. Paul A. Hohenlohe, Susan Bassham, Paul D. Etter, Nicholas Stiffler, Eric A. Johnson, and William A. Cresko. Population genomics of parallel adaptation in threespine stickleback using sequenced rad tags. PLoS Genetics, 6, 2 2010. Ver´onica Hoyos, Guido Plaza, Xiang Li, and Ana L. Caicedo. Something old, something new: Evolution of colombian weedy rice ( ¡i¿oryza¡/i¿ spp.) through de novo dedomestication, exotic gene flow, and hybridization. Evolutionary Applications, 13, 9 2020. Yoshihiro Kawahara, Melissa de la Bastide, John P Hamilton, Hiroyuki Kanamori, W Richard McCombie, Shu Ouyang, David C Schwartz, Tsuyoshi Tanaka, Jianzhong Wu, Shiguo Zhou, Kevin L Childs, Rebecca M Davidson, Haining Lin, Lina Quesada-Ocampo, Brieanne Vaillancourt, Hiroaki Sakai, Sung Shin Lee, Jungsok Kim, Hisataka Numa, Takeshi Itoh, C Robin Buell, and Takashi Matsumoto. Improvement of the oryza sativa nipponbare reference genome using next generation sequence and optical map data. Rice, 6:4, 2013. Beat Keller, Daniel Ariza-Suarez, Juan de la Hoz, Johan Steven Aparicio, Ana Elisabeth Portilla-Benavides, Hector Fabio Buendia, Victor Manuel Mayor, Bruno Studer, and Bodo Raatz. Genomic prediction of agronomic traits in common bean (phaseolus vulgaris l.) under environmental stress. Frontiers in Plant Science, 11, 7 2020. Fei Lu, Alexander E. Lipka, Jeff Glaubitz, Rob Elshire, Jerome H. Cherney, Michael D. Casler, Edward S. Buckler, and Denise E. Costich. Switchgrass genomic diversity, ploidy, and evolution: Novel insights from a network-based snp discovery protocol. PLoS Genetics, 9:e1003215, 2013. M. Lynch. Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects. Molecular Biology and Evolution, 25, 8 2008. Alice H. MacQueen, Jeffrey W. White, Rian Lee, Juan M. Osorno, Jeremy Schmutz, Phillip N. Miklas, Jim Myers, Phillip E. McClean, and Thomas E. Juenger. Genetic associations in four decades of multienvironment trials reveal agronomic trait evolution in common bean. Genetics, 215, 5 2020. Takahiro Maruki and Michael Lynch. Genotype calling from population-genomic sequencing data. G3, 7, 5 2017. A. Mastretta-Yanes, N. Arrigo, N. Alvarez, T. H. Jorgensen, D. Pi˜nero, and B. C. Emerson. Restriction site-associated dna sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Molecular Ecology Resources, 15:28–41, 2015. Arthur T O Melo and Iago Hale. Expanded functionality, increased accuracy, and enhanced speed in the de novo genotyping-by-sequencing pipeline gbs-snp-crop. Bioinformatics, 35, 5 2019. Arthur T.O. Melo, Radhika Bartaula, and Iago Hale. Gbs-snp-crop: A reference-optional pipeline for snp discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data. BMC Bioinformatics, 17:1–15, 2016. C. Perea, J.F. De La Hoz, D.F. Cruz, J.D. Lobaton, P. Izquierdo, J.C. Quintero, B. Raatz, and J. Duitama. Bioinformatic analysis of genotype by sequencing (gbs) data with ngsep. BMC Genomics, 17, 2016. Brant K. Peterson, Jesse N. Weber, Emily H. Kay, Heidi S. Fisher, and Hopi E. Hoekstra. Double digest radseq: An inexpensive method for de novo snp discovery and genotyping in model and non-model species. PLoS ONE, 7, 5 2012. Gina M Pham, John P Hamilton, Joshua C Wood, Joseph T Burke, Hainan Zhao, Brieanne Vaillancourt, Shujun Ou, Jiming Jiang, and C Robin Buell. Construction of a chromosomescale long-read reference genome assembly for potato. GigaScience, 9, 9 2020. Nicolas C. Rochette, Angel G. Rivera-Col´on, and Julian M. Catchen. Stacks 2: Analytical methods for paired-end sequencing improve radseq-based population genomics. Molecular Ecology, 28, 11 2019. Torbjørn Rognes, Tom´aˇs Flouri, Ben Nichols, Christopher Quince, and Fr´ed´eric Mah´e. Vsearch: a versatile open source tool for metagenomics. PeerJ, 4, 10 2016. Maria C Romay, Mark J Millard, Jeffrey C Glaubitz, Jason a Peiffer, Kelly L Swarts, TerryM Casstevens, Robert J Elshire, Charlotte B Acharya, Sharon E Mitchell, Sherry a Flint-Garcia, Michael D McMullen, James B Holland, Edward S Buckler, and Candice a Gardner. Comprehensive genotyping of the usa national maize inbred seed bank. Genome biology, 14:R55, 6 2013. Armin Scheben, Jacqueline Batley, and David Edwards. Genotyping-by-sequencing approaches to characterize crop genomes: choosing the right tool for the right application. Plant Biotechnology Journal, 15, 2 2017. Stephan Schr¨oder, Sujan Mamidi, Rian Lee, Michael R. McKain, Phillip E. McClean, and Juan M. Osorno. Optimization of genotyping by sequencing (gbs) data in common bean (phaseolus vulgaris l.). Molecular Breeding, 36, 1 2016. Jennifer Spindel, Hasina Begum, Deniz Akdemir, Parminder Virk, and Bertrand Collard. Genomic selection and association mapping in rice ( oryza sativa ): Effect of trait genetic architecture , training population composition , marker number and statistical model on accuracy of rice genomic selection in elite , tropical rice breeding lines. PLoS genetics, 11:e1004982, 2015. Xiaowen Sun, Dongyuan Liu, Xiaofeng Zhang, Wenbin Li, Hui Liu, Weiguo Hong, Chuanbei Jiang, Ning Guan, Chouxian Ma, Huaping Zeng, Chunhua Xu, Jun Song, Long Huang, Chunmei Wang, Junjie Shi, Rui Wang, Xianhu Zheng, Cuiyun Lu, Xiaowu Wang, and Hongkun Zheng. Slaf-seq: An efficient method of large-scale de novo snp discovery and genotyping using high-throughput sequencing. PLoS ONE, 8, 3 2013. Daniel Tello, Juanita Gil, Cristian D Loaiza, John J Riascos, Nicol´as Cardozo, and Jorge Duitama. Ngsep3: accurate variant calling across species and sequencing protocols. Bioinformatics, 35, 11 2019. Shubha Vij, Heiner Kuhl, Inna S. Kuznetsova, Aleksey Komissarov, Andrey A. Yurchenko, Peter Van Heusden, Siddharth Singh, Natascha M. Thevasagayam, Sai Rama Sridatta Prakki, Kathiresan Purushothaman, Jolly M. Saju, Junhui Jiang, Stanley Kimbung Mbandi, Mario Jonas, Amy Hin Yan Tong, Sarah Mwangi, Doreen Lau, Si Yan Ngoh, Woei Chang Liew, Xueyan Shen, Lawrence S. Hon, James P. Drake, Matthew Boitano, Richard Hall, Chen- Shan Chin, Ramkumar Lachumanan, Jonas Korlach, Vladimir Trifonov, Marsel Kabilov, Alexey Tupikin, Darrell Green, Simon Moxon, Tyler Garvin, Fritz J. Sedlazeck, Gregory W. Vurture, Gopikrishna Gopalapillai, Vinaya Kumar Katneni, Tansyn H. Noble, Vinod Scaria, Sridhar Sivasubbu, Dean R. Jerry, Stephen J. O’Brien, Michael C. Schatz, Tam´as Dalmay, StephenW. Turner, Si Lok, Alan Christoffels, and L´aszl´o Orb´an. Chromosomal-level assembly of the asian seabass genome using long sequence reads and multi-layered scaffolding. PLOS Genetics, 12, 4 2016. Le Wang, Zi Yi Wan, Huan Sein Lim, and Gen Hua Yue. Genetic variability, local selection and demographic history: genomic evidence of evolving towards allopatric speciation in asian seabass. Molecular Ecology, 25, 8 2016. Xiaoxia Yu, Mingfei Zhang, Zhuo Yu, Dongsheng Yang, Jingwei Li, Guofang Wu, and Jiaqi Li. An snp-based high-density genetic linkage map for tetraploid potato using specific length amplified fragment sequencing (slaf-seq) technology. Agronomy, 10, 1 2020. Attribution-NonCommercial-ShareAlike 4.0 International http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf application/pdf C.I Tibaitatá Colombia Cold Sprimg Harbor Laboratory (CSH) New york (Estados Unidos) BioRxiv; (2020): BioRxiv (Nov.);p. 1-20