Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets

De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexi...

Descripción completa

Detalles Bibliográficos
Autores principales: Gonzalez, Sergio Alberto, Rivarola, Maximo Lisandro, Ribone, Andrés Ignacio, Lew, Sergio Eduardo, Paniego, Norma Beatriz
Formato: info:ar-repo/semantics/artículo
Lenguaje:Inglés
Publicado: Sage Publications 2025
Materias:
Acceso en línea:http://hdl.handle.net/20.500.12123/21747
https://journals.sagepub.com/doi/full/10.1177/11779322241274957
https://doi.org/10.1177/11779322241274957
_version_ 1855038459913175040
author Gonzalez, Sergio Alberto
Rivarola, Maximo Lisandro
Ribone, Andrés Ignacio
Lew, Sergio Eduardo
Paniego, Norma Beatriz
author_browse Gonzalez, Sergio Alberto
Lew, Sergio Eduardo
Paniego, Norma Beatriz
Ribone, Andrés Ignacio
Rivarola, Maximo Lisandro
author_facet Gonzalez, Sergio Alberto
Rivarola, Maximo Lisandro
Ribone, Andrés Ignacio
Lew, Sergio Eduardo
Paniego, Norma Beatriz
author_sort Gonzalez, Sergio Alberto
collection INTA Digital
description De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
format info:ar-repo/semantics/artículo
id INTA21747
institution Instituto Nacional de Tecnología Agropecuaria (INTA -Argentina)
language Inglés
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Sage Publications
publisherStr Sage Publications
record_format dspace
spelling INTA217472025-03-20T12:26:05Z Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets Gonzalez, Sergio Alberto Rivarola, Maximo Lisandro Ribone, Andrés Ignacio Lew, Sergio Eduardo Paniego, Norma Beatriz ARN Transcriptómica Genética Modelos de Simulación RNA Transcriptomics Genetics Simulation Models De Novo Assembly De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction. Instituto de Biotecnología Fil: Gonzalez, Sergio Alberto. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnologia y Biología Molecular; Argentina Fil: Gonzalez, Sergio Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Rivarola, Maximo Lisandro. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Ribone, Andrés Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Lew, Sergio Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Lew, Sergio Eduardo. Universidad de Buenos Aires. Facultad de Ingeniería. Instituto de Ingeniería Biomédica; Argentina Fil: Paniego, Norma Beatriz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular; Argentina Fil: Paniego, Norma Beatriz. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina 2025-03-20T12:21:22Z 2025-03-20T12:21:22Z 2024-12 info:ar-repo/semantics/artículo info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://hdl.handle.net/20.500.12123/21747 https://journals.sagepub.com/doi/full/10.1177/11779322241274957 1177-9322 https://doi.org/10.1177/11779322241274957 eng info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) application/pdf Sage Publications Bioinformatics and Biology Insights 18 : 1-13 (2024)
spellingShingle ARN
Transcriptómica
Genética
Modelos de Simulación
RNA
Transcriptomics
Genetics
Simulation Models
De Novo Assembly
Gonzalez, Sergio Alberto
Rivarola, Maximo Lisandro
Ribone, Andrés Ignacio
Lew, Sergio Eduardo
Paniego, Norma Beatriz
Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_fullStr Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full_unstemmed Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_short Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_sort comprehensive analysis of the influence of technical and biological variations on de novo assembly of rna seq datasets
topic ARN
Transcriptómica
Genética
Modelos de Simulación
RNA
Transcriptomics
Genetics
Simulation Models
De Novo Assembly
url http://hdl.handle.net/20.500.12123/21747
https://journals.sagepub.com/doi/full/10.1177/11779322241274957
https://doi.org/10.1177/11779322241274957
work_keys_str_mv AT gonzalezsergioalberto comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets
AT rivarolamaximolisandro comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets
AT riboneandresignacio comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets
AT lewsergioeduardo comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets
AT paniegonormabeatriz comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets