Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India

The increasing availability of complex, geo-referenced on-farm data demands analytical frameworks that can guide crop management recommendations. Recent developments in interpretable machine learning techniques offer opportunities to use these methods in agronomic studies. Our objectives were two-fo...

Full description

Bibliographic Details
Main Authors: Nayak, Harisankar, Silva, ‪João Vasco, Parihar, Chiter Mal, Krupnik, Timothy J., Sena, Dipaka Ranjan, Kakraliya, S., Jat, Hanuman Sahay, Sidhu, Harminder Singh, Sharma, Parbodh Chander, Jat, Mangi Lal, Sapkota, Tek Bahadur
Format: Journal Article
Language:Inglés
Published: Elsevier 2022
Subjects:
Online Access:https://hdl.handle.net/10568/127194
_version_ 1855541969612177408
author Nayak, Harisankar
Silva, ‪João Vasco
Parihar, Chiter Mal
Krupnik, Timothy J.
Sena, Dipaka Ranjan
Kakraliya, S.
Jat, Hanuman Sahay
Sidhu, Harminder Singh
Sharma, Parbodh Chander
Jat, Mangi Lal
Sapkota, Tek Bahadur
author_browse Jat, Hanuman Sahay
Jat, Mangi Lal
Kakraliya, S.
Krupnik, Timothy J.
Nayak, Harisankar
Parihar, Chiter Mal
Sapkota, Tek Bahadur
Sena, Dipaka Ranjan
Sharma, Parbodh Chander
Sidhu, Harminder Singh
Silva, ‪João Vasco
author_facet Nayak, Harisankar
Silva, ‪João Vasco
Parihar, Chiter Mal
Krupnik, Timothy J.
Sena, Dipaka Ranjan
Kakraliya, S.
Jat, Hanuman Sahay
Sidhu, Harminder Singh
Sharma, Parbodh Chander
Jat, Mangi Lal
Sapkota, Tek Bahadur
author_sort Nayak, Harisankar
collection Repository of Agricultural Research Outputs (CGSpace)
description The increasing availability of complex, geo-referenced on-farm data demands analytical frameworks that can guide crop management recommendations. Recent developments in interpretable machine learning techniques offer opportunities to use these methods in agronomic studies. Our objectives were two-fold: (1) to assess the performance of different machine learning methods to explain on-farm wheat yield variability in the Northwestern Indo-Gangetic Plains of India, and (2) to identify the most important drivers and interactions explaining wheat yield variability. A suite of fine-tuned machine learning models (ridge and lasso regression, classification and regression trees, k-nearest neighbor, support vector machines, gradient boosting, extreme gradient boosting, and random forest) were statistically compared using the R2, root mean square error (RMSE), and mean absolute error (MAE). The best performing model was again fine-tuned using a grid search approach for the bias-variance trade-off. Three post-hoc model agnostic techniques were used to interpret the best performing model: variable importance (a variable was considered “important” if shuffling its values increased or decreased the model error considerably), interaction strength (based on Friedman’s H-statistic), and two-way interaction (i.e., how much of the total variability in wheat yield was explained by a particular two-way interaction). Model outputs were compared against empirical data to contextualize results and provide a blueprint for future analysis in other production systems. Tree-based and decision boundary-based methods outperformed regression-based methods in explaining wheat yield variability. Random forest was the best performing method in terms of goodness-of-fit and model precision and accuracy with RMSE, MAE, and R2 ranging between 367 and 470 kg ha−1, 276–345 kg ha−1, and 0.44–0.63, respectively. Random forest was then used for selection of important variables and interactions. The most important management variables explaining wheat yield variability were nitrogen application rate and crop residue management, whereas the average of monthly cumulative solar radiation during February and March (coinciding with reproductive phase of wheat) was the most important biophysical variable. The effect size of these variables on wheat yield ranged between 227 kg ha−1 for nitrogen application rate to 372 kg ha−1 for cumulative solar radiation during February and March. The effect of important interactions on wheat yield was detected in the data namely the interaction between crop residue management and disease management and, nitrogen application rate and seeding rate. For instance, farmers’ fields with moderate disease incidence yielded 750 kg ha−1 less when crop residues were removed than when crop residues were retained. Similarly, wheat yield response to residue retention was higher under low seed and N application rates. As an inductive research approach, the appropriate application of interpretable machine learning methods can be used to extract agronomically actionable information from large-scale farmer field data.
format Journal Article
id CGSpace127194
institution CGIAR Consortium
language Inglés
publishDate 2022
publishDateRange 2022
publishDateSort 2022
publisher Elsevier
publisherStr Elsevier
record_format dspace
spelling CGSpace1271942026-01-16T10:10:43Z Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India Nayak, Harisankar Silva, ‪João Vasco Parihar, Chiter Mal Krupnik, Timothy J. Sena, Dipaka Ranjan Kakraliya, S. Jat, Hanuman Sahay Sidhu, Harminder Singh Sharma, Parbodh Chander Jat, Mangi Lal Sapkota, Tek Bahadur forests machine learning wheat yields crop residues The increasing availability of complex, geo-referenced on-farm data demands analytical frameworks that can guide crop management recommendations. Recent developments in interpretable machine learning techniques offer opportunities to use these methods in agronomic studies. Our objectives were two-fold: (1) to assess the performance of different machine learning methods to explain on-farm wheat yield variability in the Northwestern Indo-Gangetic Plains of India, and (2) to identify the most important drivers and interactions explaining wheat yield variability. A suite of fine-tuned machine learning models (ridge and lasso regression, classification and regression trees, k-nearest neighbor, support vector machines, gradient boosting, extreme gradient boosting, and random forest) were statistically compared using the R2, root mean square error (RMSE), and mean absolute error (MAE). The best performing model was again fine-tuned using a grid search approach for the bias-variance trade-off. Three post-hoc model agnostic techniques were used to interpret the best performing model: variable importance (a variable was considered “important” if shuffling its values increased or decreased the model error considerably), interaction strength (based on Friedman’s H-statistic), and two-way interaction (i.e., how much of the total variability in wheat yield was explained by a particular two-way interaction). Model outputs were compared against empirical data to contextualize results and provide a blueprint for future analysis in other production systems. Tree-based and decision boundary-based methods outperformed regression-based methods in explaining wheat yield variability. Random forest was the best performing method in terms of goodness-of-fit and model precision and accuracy with RMSE, MAE, and R2 ranging between 367 and 470 kg ha−1, 276–345 kg ha−1, and 0.44–0.63, respectively. Random forest was then used for selection of important variables and interactions. The most important management variables explaining wheat yield variability were nitrogen application rate and crop residue management, whereas the average of monthly cumulative solar radiation during February and March (coinciding with reproductive phase of wheat) was the most important biophysical variable. The effect size of these variables on wheat yield ranged between 227 kg ha−1 for nitrogen application rate to 372 kg ha−1 for cumulative solar radiation during February and March. The effect of important interactions on wheat yield was detected in the data namely the interaction between crop residue management and disease management and, nitrogen application rate and seeding rate. For instance, farmers’ fields with moderate disease incidence yielded 750 kg ha−1 less when crop residues were removed than when crop residues were retained. Similarly, wheat yield response to residue retention was higher under low seed and N application rates. As an inductive research approach, the appropriate application of interpretable machine learning methods can be used to extract agronomically actionable information from large-scale farmer field data. 2022-10 2023-01-16T13:04:18Z 2023-01-16T13:04:18Z Journal Article https://hdl.handle.net/10568/127194 en Open Access application/pdf Elsevier Nayak, H. S., Silva, J. V., Parihar, C. M., Krupnik, T. J., Sena, D. R., Kakraliya, S. K., Jat, H. S., Sidhu, H. S., Sharma, P. C., Jat, M. L., & Sapkota, T. B. (2022). Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India. Field Crops Research, 287, 108640. https://doi.org/10.1016/j.fcr.2022.108640
spellingShingle forests
machine learning
wheat
yields
crop residues
Nayak, Harisankar
Silva, ‪João Vasco
Parihar, Chiter Mal
Krupnik, Timothy J.
Sena, Dipaka Ranjan
Kakraliya, S.
Jat, Hanuman Sahay
Sidhu, Harminder Singh
Sharma, Parbodh Chander
Jat, Mangi Lal
Sapkota, Tek Bahadur
Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_full Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_fullStr Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_full_unstemmed Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_short Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_sort interpretable machine learning methods to explain on farm yield variability of high productivity wheat in northwest india
topic forests
machine learning
wheat
yields
crop residues
url https://hdl.handle.net/10568/127194
work_keys_str_mv AT nayakharisankar interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT silvajoaovasco interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT pariharchitermal interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT krupniktimothyj interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT senadipakaranjan interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT kakraliyas interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT jathanumansahay interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT sidhuharmindersingh interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT sharmaparbodhchander interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT jatmangilal interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
AT sapkotatekbahadur interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia