A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset

Background Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most w...

Full description

Bibliographic Details
Main Authors: Zhou, Yong, Kathiresan, Nagarajan, Yu, Zhichao, Rivera, Luis F., Yang, Yujian, Thimma, Manjula, Manickam, Keerthana, Chebotarov, Dmytro, Mauleon, Ramil, Chougule, Kapeel, Wei, Sharon, Gao, Tingting, Green, Carl D., Zuccolo, Andrea, Xie, Weibo, Ware, Doreen, Zhang, Jianwei, McNally, Kenneth L., Wing, Rod A.
Format: Journal Article
Language:Inglés
Published: Springer 2024
Subjects:
Online Access:https://hdl.handle.net/10568/163079
_version_ 1855539219800260608
author Zhou, Yong
Kathiresan, Nagarajan
Yu, Zhichao
Rivera, Luis F.
Yang, Yujian
Thimma, Manjula
Manickam, Keerthana
Chebotarov, Dmytro
Mauleon, Ramil
Chougule, Kapeel
Wei, Sharon
Gao, Tingting
Green, Carl D.
Zuccolo, Andrea
Xie, Weibo
Ware, Doreen
Zhang, Jianwei
McNally, Kenneth L.
Wing, Rod A.
author_browse Chebotarov, Dmytro
Chougule, Kapeel
Gao, Tingting
Green, Carl D.
Kathiresan, Nagarajan
Manickam, Keerthana
Mauleon, Ramil
McNally, Kenneth L.
Rivera, Luis F.
Thimma, Manjula
Ware, Doreen
Wei, Sharon
Wing, Rod A.
Xie, Weibo
Yang, Yujian
Yu, Zhichao
Zhang, Jianwei
Zhou, Yong
Zuccolo, Andrea
author_facet Zhou, Yong
Kathiresan, Nagarajan
Yu, Zhichao
Rivera, Luis F.
Yang, Yujian
Thimma, Manjula
Manickam, Keerthana
Chebotarov, Dmytro
Mauleon, Ramil
Chougule, Kapeel
Wei, Sharon
Gao, Tingting
Green, Carl D.
Zuccolo, Andrea
Xie, Weibo
Ware, Doreen
Zhang, Jianwei
McNally, Kenneth L.
Wing, Rod A.
author_sort Zhou, Yong
collection Repository of Agricultural Research Outputs (CGSpace)
description Background Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. Results Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a “subpopulation aware” 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). Conclusions This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.
format Journal Article
id CGSpace163079
institution CGIAR Consortium
language Inglés
publishDate 2024
publishDateRange 2024
publishDateSort 2024
publisher Springer
publisherStr Springer
record_format dspace
spelling CGSpace1630792025-12-08T10:06:44Z A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset Zhou, Yong Kathiresan, Nagarajan Yu, Zhichao Rivera, Luis F. Yang, Yujian Thimma, Manjula Manickam, Keerthana Chebotarov, Dmytro Mauleon, Ramil Chougule, Kapeel Wei, Sharon Gao, Tingting Green, Carl D. Zuccolo, Andrea Xie, Weibo Ware, Doreen Zhang, Jianwei McNally, Kenneth L. Wing, Rod A. single nucleotide polymorphisms genomes rice sorghum maize soybeans food crops Background Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. Results Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a “subpopulation aware” 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). Conclusions This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment. 2024-01-25 2024-12-05T16:04:15Z 2024-12-05T16:04:15Z Journal Article https://hdl.handle.net/10568/163079 en Open Access application/pdf Springer Zhou, Y., Kathiresan, N., Yu, Z. et al. A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biol 22, 13 (2024). https://doi.org/10.1186/s12915-024-01820-5
spellingShingle single nucleotide polymorphisms
genomes
rice
sorghum
maize
soybeans
food crops
Zhou, Yong
Kathiresan, Nagarajan
Yu, Zhichao
Rivera, Luis F.
Yang, Yujian
Thimma, Manjula
Manickam, Keerthana
Chebotarov, Dmytro
Mauleon, Ramil
Chougule, Kapeel
Wei, Sharon
Gao, Tingting
Green, Carl D.
Zuccolo, Andrea
Xie, Weibo
Ware, Doreen
Zhang, Jianwei
McNally, Kenneth L.
Wing, Rod A.
A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
title A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
title_full A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
title_fullStr A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
title_full_unstemmed A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
title_short A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
title_sort high performance computational workflow to accelerate gatk snp detection across a 25 genome dataset
topic single nucleotide polymorphisms
genomes
rice
sorghum
maize
soybeans
food crops
url https://hdl.handle.net/10568/163079
work_keys_str_mv AT zhouyong ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT kathiresannagarajan ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT yuzhichao ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT riveraluisf ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT yangyujian ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT thimmamanjula ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT manickamkeerthana ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT chebotarovdmytro ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT mauleonramil ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT chougulekapeel ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT weisharon ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT gaotingting ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT greencarld ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT zuccoloandrea ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT xieweibo ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT waredoreen ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT zhangjianwei ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT mcnallykennethl ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT wingroda ahighperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT zhouyong highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT kathiresannagarajan highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT yuzhichao highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT riveraluisf highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT yangyujian highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT thimmamanjula highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT manickamkeerthana highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT chebotarovdmytro highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT mauleonramil highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT chougulekapeel highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT weisharon highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT gaotingting highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT greencarld highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT zuccoloandrea highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT xieweibo highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT waredoreen highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT zhangjianwei highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT mcnallykennethl highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset
AT wingroda highperformancecomputationalworkflowtoaccelerategatksnpdetectionacrossa25genomedataset