Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species

Genotype-by-sequencing (GBS) is a widely used cost-effective technique to obtain large numbers of genetic markers from populations. Although a standard reference-based pipeline can be followed to analyze these reads, a reference genome is still not available for a large number of species. Hence,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Parra Salazar, Andrea, Gomez, Jorge, Lozano Arce, Daniela, Reyes Herrera, Paula H., Duitama, Jorge
Formato:	Artículo
Lenguaje:	Inglés
Publicado:	Cold Sprimg Harbor Laboratory (CSH) 2024
Materias:	Investigación agropecuaria - A50 Análisis de datos Método estadístico Genómica Transversal http://aims.fao.org/aos/agrovoc/c_15962 http://aims.fao.org/aos/agrovoc/c_7377 http://aims.fao.org/aos/agrovoc/c_92382
Acceso en línea:	https://www.biorxiv.org/content/10.1101/2020.11.28.402131v1 http://hdl.handle.net/20.500.12324/39793 https://doi.org/10.1101/2020.11.28.402131

_version_	1855495228756066304
author	Parra Salazar, Andrea Gomez, Jorge Lozano Arce, Daniela Reyes Herrera, Paula H. Duitama, Jorge
author_browse	Duitama, Jorge Gomez, Jorge Lozano Arce, Daniela Parra Salazar, Andrea Reyes Herrera, Paula H.
author_facet	Parra Salazar, Andrea Gomez, Jorge Lozano Arce, Daniela Reyes Herrera, Paula H. Duitama, Jorge
author_sort	Parra Salazar, Andrea
collection	Repositorio AGROSAVIA
description	Genotype-by-sequencing (GBS) is a widely used cost-effective technique to obtain large numbers of genetic markers from populations. Although a standard reference-based pipeline can be followed to analyze these reads, a reference genome is still not available for a large number of species. Hence, several research groups require reference-free approaches to generate the genetic variability information that can be obtained from a GBS experiment. Unfortunately, tools to perform de-novo analysis of GBS reads are scarce and some of the existing solutions are difficult to operate under different settings generated by the existing GBS protocols. In this manuscript we describe a novel algorithm to perform reference-free variants detection and genotyping from GBS reads. Nonexact searches on a dynamic hash table of consensus sequences allow to perform efficient read clustering and sorting. This algorithm was integrated in the Next Generation Sequencing Experience Platform (NGSEP) to integrate the state-ofthe- art variants detector already implemented in this tool. We performed benchmark experiments with three different real populations of plants and animals with different structures and ploidies, and sequenced with different GBS protocols at different read depths. These experiments show that NGSEP has comparable and in some cases better accuracy and always better computational efficiency compared to existing solutions. We expect that this new development will be useful for several research groups conducting population genetic studies in a wide variety of species.
format	Artículo
id	RepoAGROSAVIA39793
institution	Corporación Colombiana de Investigación Agropecuaria
language	Inglés
publishDate	2024
publishDateRange	2024
publishDateSort	2024
publisher	Cold Sprimg Harbor Laboratory (CSH)
publisherStr	Cold Sprimg Harbor Laboratory (CSH)
record_format	dspace
spelling	RepoAGROSAVIA397932024-08-06T03:00:51Z Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species Parra Salazar, Andrea Gomez, Jorge Lozano Arce, Daniela Reyes Herrera, Paula H. Duitama, Jorge Investigación agropecuaria - A50 Análisis de datos Método estadístico Genómica Transversal http://aims.fao.org/aos/agrovoc/c_15962 http://aims.fao.org/aos/agrovoc/c_7377 http://aims.fao.org/aos/agrovoc/c_92382 Genotype-by-sequencing (GBS) is a widely used cost-effective technique to obtain large numbers of genetic markers from populations. Although a standard reference-based pipeline can be followed to analyze these reads, a reference genome is still not available for a large number of species. Hence, several research groups require reference-free approaches to generate the genetic variability information that can be obtained from a GBS experiment. Unfortunately, tools to perform de-novo analysis of GBS reads are scarce and some of the existing solutions are difficult to operate under different settings generated by the existing GBS protocols. In this manuscript we describe a novel algorithm to perform reference-free variants detection and genotyping from GBS reads. Nonexact searches on a dynamic hash table of consensus sequences allow to perform efficient read clustering and sorting. This algorithm was integrated in the Next Generation Sequencing Experience Platform (NGSEP) to integrate the state-ofthe- art variants detector already implemented in this tool. We performed benchmark experiments with three different real populations of plants and animals with different structures and ploidies, and sequenced with different GBS protocols at different read depths. These experiments show that NGSEP has comparable and in some cases better accuracy and always better computational efficiency compared to existing solutions. We expect that this new development will be useful for several research groups conducting population genetic studies in a wide variety of species. Corporación Colombiana de Investigación Agropecuaria - (AGROSAVIA) Universidad de los Andes (ULA) 2024-08-05T20:29:56Z 2024-08-05T20:29:56Z 2020-11-28 2020 article Artículo científico http://purl.org/coar/resource_type/c_2df8fbb1 info:eu-repo/semantics/article https://purl.org/redcol/resource_type/ART http://purl.org/coar/version/c_970fb48d4fbd8a85 https://www.biorxiv.org/content/10.1101/2020.11.28.402131v1 http://hdl.handle.net/20.500.12324/39793 https://doi.org/10.1101/2020.11.28.402131 reponame:Biblioteca Digital Agropecuaria de Colombia instname:Corporación colombiana de investigación agropecuaria AGROSAVIA eng BioRxiv 1 20 Kimberly R. Andrews, Jeffrey M. Good, Michael R. Miller, Gordon Luikart, and Paul A. Hohenlohe. Harnessing the power of radseq for ecological and evolutionary genomics. Nature Reviews Genetics, 17, 2 2016. D. A. Ayala-Usma, G. Danies, K. Myers, M. O. Bond, J. A. Romero-Navarro, H. S. Judelson, S. Restrepo, and W. E. Fry. Genome-wide association study identifies single nucleotide polymorphism markers associated with mycelial growth (at 15, 20, and 25 c), mefenoxam resistance, and mating type in ¡i¿phytophthora infestans¡/i¿. Phytopathology R , 110, 4 2020. Nathan A Baird, Paul D Etter, Tressa S Atwood, Mark C Currey, Anthony L Shiver, Zachary A Lewis, Eric U Selker, William A Cresko, and Eric A Johnson. Rapid snp discovery and genetic mapping using sequenced rad markers. PloS one, 3:e3376, 1 2008. Julian Catchen, Paul A. Hohenlohe, Susan Bassham, Angel Amores, andWilliam A. Cresko. Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 6 2013. Julian M Catchen, Angel Amores, Paul Hohenlohe, William Cresko, and John H Postlethwait. Stacks: building and genotyping loci de novo from short-read sequences. G3 (Bethesda, Md.), 1:171–182, 8 2011. John W Davey, Paul a Hohenlohe, Paul D Etter, Jason Q Boone, Julian M Catchen, and Mark L Blaxter. Genome-wide genetic marker discovery and genotyping using nextgeneration sequencing. Nature reviews. Genetics, 12:499–510, 7 2011. Deren A R Eaton. Pyrad: Assembly of de novo radseq loci for phylogenetic analyses. Bioinformatics, 30:1844–1849, 2014. R. C. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32, 3 2004. Robert C. Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics, 26, 10 2010. Robert J. Elshire, Jeffrey C. Glaubitz, Qi Sun, Jesse a. Poland, Ken Kawamoto, Edward S. Buckler, and Sharon E. Mitchell. A robust, simple genotyping-by-sequencing (gbs) approach for high diversity species. PLoS ONE, 6:e19379, 2011. Jeffrey C. Glaubitz, Terry M. Casstevens, Fei Lu, James Harriman, Robert J. Elshire, Qi Sun, and Edward S. Buckler. Tassel-gbs: A high capacity genotyping by sequencing analysis pipeline. PLoS ONE, 9:e90346, 2014. Mar´ıa Fernanda Guindon, Eugenia Martin, Vanina Cravero, Krishna K. Gali, Thomas D. Warkentin, and Enrique Cointry. Linkage map development by gbs, ssr, and srap techniques and yield-related qtls in pea. Molecular Breeding, 39, 4 2019. Paul A. Hohenlohe, Susan Bassham, Paul D. Etter, Nicholas Stiffler, Eric A. Johnson, and William A. Cresko. Population genomics of parallel adaptation in threespine stickleback using sequenced rad tags. PLoS Genetics, 6, 2 2010. Ver´onica Hoyos, Guido Plaza, Xiang Li, and Ana L. Caicedo. Something old, something new: Evolution of colombian weedy rice ( ¡i¿oryza¡/i¿ spp.) through de novo dedomestication, exotic gene flow, and hybridization. Evolutionary Applications, 13, 9 2020. Yoshihiro Kawahara, Melissa de la Bastide, John P Hamilton, Hiroyuki Kanamori, W Richard McCombie, Shu Ouyang, David C Schwartz, Tsuyoshi Tanaka, Jianzhong Wu, Shiguo Zhou, Kevin L Childs, Rebecca M Davidson, Haining Lin, Lina Quesada-Ocampo, Brieanne Vaillancourt, Hiroaki Sakai, Sung Shin Lee, Jungsok Kim, Hisataka Numa, Takeshi Itoh, C Robin Buell, and Takashi Matsumoto. Improvement of the oryza sativa nipponbare reference genome using next generation sequence and optical map data. Rice, 6:4, 2013. Beat Keller, Daniel Ariza-Suarez, Juan de la Hoz, Johan Steven Aparicio, Ana Elisabeth Portilla-Benavides, Hector Fabio Buendia, Victor Manuel Mayor, Bruno Studer, and Bodo Raatz. Genomic prediction of agronomic traits in common bean (phaseolus vulgaris l.) under environmental stress. Frontiers in Plant Science, 11, 7 2020. Fei Lu, Alexander E. Lipka, Jeff Glaubitz, Rob Elshire, Jerome H. Cherney, Michael D. Casler, Edward S. Buckler, and Denise E. Costich. Switchgrass genomic diversity, ploidy, and evolution: Novel insights from a network-based snp discovery protocol. PLoS Genetics, 9:e1003215, 2013. M. Lynch. Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects. Molecular Biology and Evolution, 25, 8 2008. Alice H. MacQueen, Jeffrey W. White, Rian Lee, Juan M. Osorno, Jeremy Schmutz, Phillip N. Miklas, Jim Myers, Phillip E. McClean, and Thomas E. Juenger. Genetic associations in four decades of multienvironment trials reveal agronomic trait evolution in common bean. Genetics, 215, 5 2020. Takahiro Maruki and Michael Lynch. Genotype calling from population-genomic sequencing data. G3, 7, 5 2017. A. Mastretta-Yanes, N. Arrigo, N. Alvarez, T. H. Jorgensen, D. Pi˜nero, and B. C. Emerson. Restriction site-associated dna sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Molecular Ecology Resources, 15:28–41, 2015. Arthur T O Melo and Iago Hale. Expanded functionality, increased accuracy, and enhanced speed in the de novo genotyping-by-sequencing pipeline gbs-snp-crop. Bioinformatics, 35, 5 2019. Arthur T.O. Melo, Radhika Bartaula, and Iago Hale. Gbs-snp-crop: A reference-optional pipeline for snp discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data. BMC Bioinformatics, 17:1–15, 2016. C. Perea, J.F. De La Hoz, D.F. Cruz, J.D. Lobaton, P. Izquierdo, J.C. Quintero, B. Raatz, and J. Duitama. Bioinformatic analysis of genotype by sequencing (gbs) data with ngsep. BMC Genomics, 17, 2016. Brant K. Peterson, Jesse N. Weber, Emily H. Kay, Heidi S. Fisher, and Hopi E. Hoekstra. Double digest radseq: An inexpensive method for de novo snp discovery and genotyping in model and non-model species. PLoS ONE, 7, 5 2012. Gina M Pham, John P Hamilton, Joshua C Wood, Joseph T Burke, Hainan Zhao, Brieanne Vaillancourt, Shujun Ou, Jiming Jiang, and C Robin Buell. Construction of a chromosomescale long-read reference genome assembly for potato. GigaScience, 9, 9 2020. Nicolas C. Rochette, Angel G. Rivera-Col´on, and Julian M. Catchen. Stacks 2: Analytical methods for paired-end sequencing improve radseq-based population genomics. Molecular Ecology, 28, 11 2019. Torbjørn Rognes, Tom´aˇs Flouri, Ben Nichols, Christopher Quince, and Fr´ed´eric Mah´e. Vsearch: a versatile open source tool for metagenomics. PeerJ, 4, 10 2016. Maria C Romay, Mark J Millard, Jeffrey C Glaubitz, Jason a Peiffer, Kelly L Swarts, TerryM Casstevens, Robert J Elshire, Charlotte B Acharya, Sharon E Mitchell, Sherry a Flint-Garcia, Michael D McMullen, James B Holland, Edward S Buckler, and Candice a Gardner. Comprehensive genotyping of the usa national maize inbred seed bank. Genome biology, 14:R55, 6 2013. Armin Scheben, Jacqueline Batley, and David Edwards. Genotyping-by-sequencing approaches to characterize crop genomes: choosing the right tool for the right application. Plant Biotechnology Journal, 15, 2 2017. Stephan Schr¨oder, Sujan Mamidi, Rian Lee, Michael R. McKain, Phillip E. McClean, and Juan M. Osorno. Optimization of genotyping by sequencing (gbs) data in common bean (phaseolus vulgaris l.). Molecular Breeding, 36, 1 2016. Jennifer Spindel, Hasina Begum, Deniz Akdemir, Parminder Virk, and Bertrand Collard. Genomic selection and association mapping in rice ( oryza sativa ): Effect of trait genetic architecture , training population composition , marker number and statistical model on accuracy of rice genomic selection in elite , tropical rice breeding lines. PLoS genetics, 11:e1004982, 2015. Xiaowen Sun, Dongyuan Liu, Xiaofeng Zhang, Wenbin Li, Hui Liu, Weiguo Hong, Chuanbei Jiang, Ning Guan, Chouxian Ma, Huaping Zeng, Chunhua Xu, Jun Song, Long Huang, Chunmei Wang, Junjie Shi, Rui Wang, Xianhu Zheng, Cuiyun Lu, Xiaowu Wang, and Hongkun Zheng. Slaf-seq: An efficient method of large-scale de novo snp discovery and genotyping using high-throughput sequencing. PLoS ONE, 8, 3 2013. Daniel Tello, Juanita Gil, Cristian D Loaiza, John J Riascos, Nicol´as Cardozo, and Jorge Duitama. Ngsep3: accurate variant calling across species and sequencing protocols. Bioinformatics, 35, 11 2019. Shubha Vij, Heiner Kuhl, Inna S. Kuznetsova, Aleksey Komissarov, Andrey A. Yurchenko, Peter Van Heusden, Siddharth Singh, Natascha M. Thevasagayam, Sai Rama Sridatta Prakki, Kathiresan Purushothaman, Jolly M. Saju, Junhui Jiang, Stanley Kimbung Mbandi, Mario Jonas, Amy Hin Yan Tong, Sarah Mwangi, Doreen Lau, Si Yan Ngoh, Woei Chang Liew, Xueyan Shen, Lawrence S. Hon, James P. Drake, Matthew Boitano, Richard Hall, Chen- Shan Chin, Ramkumar Lachumanan, Jonas Korlach, Vladimir Trifonov, Marsel Kabilov, Alexey Tupikin, Darrell Green, Simon Moxon, Tyler Garvin, Fritz J. Sedlazeck, Gregory W. Vurture, Gopikrishna Gopalapillai, Vinaya Kumar Katneni, Tansyn H. Noble, Vinod Scaria, Sridhar Sivasubbu, Dean R. Jerry, Stephen J. O’Brien, Michael C. Schatz, Tam´as Dalmay, StephenW. Turner, Si Lok, Alan Christoffels, and L´aszl´o Orb´an. Chromosomal-level assembly of the asian seabass genome using long sequence reads and multi-layered scaffolding. PLOS Genetics, 12, 4 2016. Le Wang, Zi Yi Wan, Huan Sein Lim, and Gen Hua Yue. Genetic variability, local selection and demographic history: genomic evidence of evolving towards allopatric speciation in asian seabass. Molecular Ecology, 25, 8 2016. Xiaoxia Yu, Mingfei Zhang, Zhuo Yu, Dongsheng Yang, Jingwei Li, Guofang Wu, and Jiaqi Li. An snp-based high-density genetic linkage map for tetraploid potato using specific length amplified fragment sequencing (slaf-seq) technology. Agronomy, 10, 1 2020. Attribution-NonCommercial-ShareAlike 4.0 International http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf application/pdf C.I Tibaitatá Colombia Cold Sprimg Harbor Laboratory (CSH) New york (Estados Unidos) BioRxiv; (2020): BioRxiv (Nov.);p. 1-20
spellingShingle	Investigación agropecuaria - A50 Análisis de datos Método estadístico Genómica Transversal http://aims.fao.org/aos/agrovoc/c_15962 http://aims.fao.org/aos/agrovoc/c_7377 http://aims.fao.org/aos/agrovoc/c_92382 Parra Salazar, Andrea Gomez, Jorge Lozano Arce, Daniela Reyes Herrera, Paula H. Duitama, Jorge Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title	Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_full	Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_fullStr	Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_full_unstemmed	Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_short	Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species
title_sort	robust and efficient software for reference free genomic diversity analysis of gbs data on diploid and polyploid species
topic	Investigación agropecuaria - A50 Análisis de datos Método estadístico Genómica Transversal http://aims.fao.org/aos/agrovoc/c_15962 http://aims.fao.org/aos/agrovoc/c_7377 http://aims.fao.org/aos/agrovoc/c_92382
url	https://www.biorxiv.org/content/10.1101/2020.11.28.402131v1 http://hdl.handle.net/20.500.12324/39793 https://doi.org/10.1101/2020.11.28.402131
work_keys_str_mv	AT parrasalazarandrea robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies AT gomezjorge robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies AT lozanoarcedaniela robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies AT reyesherrerapaulah robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies AT duitamajorge robustandefficientsoftwareforreferencefreegenomicdiversityanalysisofgbsdataondiploidandpolyploidspecies

Robust and efficient software for reference-free genomic diversity analysis of GBS data on diploid and polyploid species

Ejemplares similares