Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns

A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several...

Descripción completa

Detalles Bibliográficos
Autores principales: Videla, María Eugenia, Iglesias, Juliana, Bruno, Cecilia Inés
Formato: Artículo
Lenguaje:Inglés
Publicado: Springer Nature 2022
Materias:
Acceso en línea:http://hdl.handle.net/20.500.12123/11153
https://link.springer.com/article/10.1007/s10681-021-02926-5
https://doi.org/10.1007/s10681-021-02926-5
_version_ 1855484727628136448
author Videla, María Eugenia
Iglesias, Juliana
Bruno, Cecilia Inés
author_browse Bruno, Cecilia Inés
Iglesias, Juliana
Videla, María Eugenia
author_facet Videla, María Eugenia
Iglesias, Juliana
Bruno, Cecilia Inés
author_sort Videla, María Eugenia
collection INTA Digital
description A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.
format Artículo
id INTA11153
institution Instituto Nacional de Tecnología Agropecuaria (INTA -Argentina)
language Inglés
publishDate 2022
publishDateRange 2022
publishDateSort 2022
publisher Springer Nature
publisherStr Springer Nature
record_format dspace
spelling INTA111532022-12-21T11:01:53Z Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns Videla, María Eugenia Iglesias, Juliana Bruno, Cecilia Inés Maíz Genética de Poblaciones Genomas Mejoramiento Genético Maize Population Genetics Genomes Genetic Improvement Unsupervised Learning Population Genetic Structure Multivariate Technique Outcome Misclassification A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species. EEA Pergamino Fil: Videla, María Eugenia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina Fil: Videla, María Eugenia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina Fil: Videla, María Eugenia. Universidad Nacional de Villa María; Argentina Fil: Iglesias, Juliana. Instituto Nacional de Tecnología Agropecuaria (INTA). Estación Experimental Agropecuaria Pergamino. Departamento de Maíz; Argentina Fil: Iglesias, Juliana. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Agrarias, Naturales y Ambientales; Argentina Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina 2022-02-15T14:34:44Z 2022-02-15T14:34:44Z 2021-09 info:ar-repo/semantics/artículo info:eu-repo/semantics/article info:eu-repo/semantics/acceptedVersion http://hdl.handle.net/20.500.12123/11153 https://link.springer.com/article/10.1007/s10681-021-02926-5 1573-5060 (online) 0014-2336 https://doi.org/10.1007/s10681-021-02926-5 eng info:eu-repograntAgreement/INTA/2019-PE-E6-I114-001/2019-PE-E6-I114-001/AR./Caracterización de la diversidad genética de plantas, animales y microorganismos mediante herramientas de genómica aplicada. info:eu-repo/semantics/restrictedAccess application/pdf Springer Nature Euphytica 217 (10) : 195 (October 2021)
spellingShingle Maíz
Genética de Poblaciones
Genomas
Mejoramiento Genético
Maize
Population Genetics
Genomes
Genetic Improvement
Unsupervised Learning
Population Genetic Structure
Multivariate Technique
Outcome Misclassification
Videla, María Eugenia
Iglesias, Juliana
Bruno, Cecilia Inés
Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_full Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_fullStr Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_full_unstemmed Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_short Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_sort relative performance of cluster algorithms and validation indices in maize genome wide structure patterns
topic Maíz
Genética de Poblaciones
Genomas
Mejoramiento Genético
Maize
Population Genetics
Genomes
Genetic Improvement
Unsupervised Learning
Population Genetic Structure
Multivariate Technique
Outcome Misclassification
url http://hdl.handle.net/20.500.12123/11153
https://link.springer.com/article/10.1007/s10681-021-02926-5
https://doi.org/10.1007/s10681-021-02926-5
work_keys_str_mv AT videlamariaeugenia relativeperformanceofclusteralgorithmsandvalidationindicesinmaizegenomewidestructurepatterns
AT iglesiasjuliana relativeperformanceofclusteralgorithmsandvalidationindicesinmaizegenomewidestructurepatterns
AT brunoceciliaines relativeperformanceofclusteralgorithmsandvalidationindicesinmaizegenomewidestructurepatterns