Marriage between variable selection and prediction methods to model plant disease risk

Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impac...

Descripción completa

Detalles Bibliográficos
Autores principales: Suarez, Franco, Bruno, Cecilia, Kurina Giannini, Franca, Gimenez, Maria, Rodriguez Pardina, Patricia, Balzarini, Mónica Graciela
Formato: info:ar-repo/semantics/artículo
Lenguaje:Inglés
Publicado: Elsevier 2023
Materias:
Acceso en línea:http://hdl.handle.net/20.500.12123/15634
https://www.sciencedirect.com/science/article/pii/S1161030123002630
https://doi.org/10.1016/j.eja.2023.126995
_version_ 1855037345350287360
author Suarez, Franco
Bruno, Cecilia
Kurina Giannini, Franca
Gimenez, Maria
Rodriguez Pardina, Patricia
Balzarini, Mónica Graciela
author_browse Balzarini, Mónica Graciela
Bruno, Cecilia
Gimenez, Maria
Kurina Giannini, Franca
Rodriguez Pardina, Patricia
Suarez, Franco
author_facet Suarez, Franco
Bruno, Cecilia
Kurina Giannini, Franca
Gimenez, Maria
Rodriguez Pardina, Patricia
Balzarini, Mónica Graciela
author_sort Suarez, Franco
collection INTA Digital
description Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus –one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)– and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF prediction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
format info:ar-repo/semantics/artículo
id INTA15634
institution Instituto Nacional de Tecnología Agropecuaria (INTA -Argentina)
language Inglés
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher Elsevier
publisherStr Elsevier
record_format dspace
spelling INTA156342024-07-11T12:54:47Z Marriage between variable selection and prediction methods to model plant disease risk Suarez, Franco Bruno, Cecilia Kurina Giannini, Franca Gimenez, Maria Rodriguez Pardina, Patricia Balzarini, Mónica Graciela Multicollinearity Plant Diseases Multicolinearidad Enfermedades de las Plantas Logistic Regression Random Forest Feature Selection Prediction Models Pathosystems Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus –one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)– and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF prediction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk. Instituto de Patología Vegetal Fil: Suarez, Franco. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina Fil: Suarez, Franco. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina Fil: Suarez, Franco. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina Fil: Bruno, Cecilia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina Fil: Kurina Giannini, Franca. Aarhus Universitet. institut for agroøkologi. Jornær sektioner; Dinamarca Fil: Gimenez, Maria De La Paz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina Fil: Gimenez, Maria De La Paz. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina Fil: Rodriguez Pardina, Patricia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina Fil: Rodriguez Pardina, Patricia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina Fil: Balzarini, Mónica Graciela. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina Fil: Balzarini, Mónica Graciela. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina Fil: Balzarini, Mónica Graciela. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina 2023-10-23T10:21:16Z 2023-10-23T10:21:16Z 2023-10-11 info:ar-repo/semantics/artículo info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://hdl.handle.net/20.500.12123/15634 https://www.sciencedirect.com/science/article/pii/S1161030123002630 1161-0301 https://doi.org/10.1016/j.eja.2023.126995 eng info:eu-repo/semantics/restrictedAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) application/pdf Elsevier European Journal of Agronomy 151: 126995 (November 2023)
spellingShingle Multicollinearity
Plant Diseases
Multicolinearidad
Enfermedades de las Plantas
Logistic Regression
Random Forest
Feature Selection
Prediction Models
Pathosystems
Suarez, Franco
Bruno, Cecilia
Kurina Giannini, Franca
Gimenez, Maria
Rodriguez Pardina, Patricia
Balzarini, Mónica Graciela
Marriage between variable selection and prediction methods to model plant disease risk
title Marriage between variable selection and prediction methods to model plant disease risk
title_full Marriage between variable selection and prediction methods to model plant disease risk
title_fullStr Marriage between variable selection and prediction methods to model plant disease risk
title_full_unstemmed Marriage between variable selection and prediction methods to model plant disease risk
title_short Marriage between variable selection and prediction methods to model plant disease risk
title_sort marriage between variable selection and prediction methods to model plant disease risk
topic Multicollinearity
Plant Diseases
Multicolinearidad
Enfermedades de las Plantas
Logistic Regression
Random Forest
Feature Selection
Prediction Models
Pathosystems
url http://hdl.handle.net/20.500.12123/15634
https://www.sciencedirect.com/science/article/pii/S1161030123002630
https://doi.org/10.1016/j.eja.2023.126995
work_keys_str_mv AT suarezfranco marriagebetweenvariableselectionandpredictionmethodstomodelplantdiseaserisk
AT brunocecilia marriagebetweenvariableselectionandpredictionmethodstomodelplantdiseaserisk
AT kurinagianninifranca marriagebetweenvariableselectionandpredictionmethodstomodelplantdiseaserisk
AT gimenezmaria marriagebetweenvariableselectionandpredictionmethodstomodelplantdiseaserisk
AT rodriguezpardinapatricia marriagebetweenvariableselectionandpredictionmethodstomodelplantdiseaserisk
AT balzarinimonicagraciela marriagebetweenvariableselectionandpredictionmethodstomodelplantdiseaserisk