K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationall...

Full description

Bibliographic Details
Main Authors: Contreras-Moreira, Bruno, Filippi, Carla Valeria, Naamati, Guy, García Girón, Carlos, Allen, James E., Flicek, Paul
Format: Artículo
Language:Inglés
Published: Wiley 2021
Subjects:
Online Access:http://hdl.handle.net/20.500.12123/10882
https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143
https://doi.org/10.1002/tpg2.20143
_version_ 1855484674770468864
author Contreras-Moreira, Bruno
Filippi, Carla Valeria
Naamati, Guy
García Girón, Carlos
Allen, James E.
Flicek, Paul
author_browse Allen, James E.
Contreras-Moreira, Bruno
Filippi, Carla Valeria
Flicek, Paul
García Girón, Carlos
Naamati, Guy
author_facet Contreras-Moreira, Bruno
Filippi, Carla Valeria
Naamati, Guy
García Girón, Carlos
Allen, James E.
Flicek, Paul
author_sort Contreras-Moreira, Bruno
collection INTA Digital
description The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
format Artículo
id INTA10882
institution Instituto Nacional de Tecnología Agropecuaria (INTA -Argentina)
language Inglés
publishDate 2021
publishDateRange 2021
publishDateSort 2021
publisher Wiley
publisherStr Wiley
record_format dspace
spelling INTA108822021-12-10T13:48:21Z K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes Contreras-Moreira, Bruno Filippi, Carla Valeria Naamati, Guy García Girón, Carlos Allen, James E. Flicek, Paul Genomas Fitogenética Genética Genomes Plant Genetics Genetics The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts. Instituto de Biotecnología Fil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); Argentina Fil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido 2021-12-10T13:45:33Z 2021-12-10T13:45:33Z 2021-09 info:ar-repo/semantics/artículo info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://hdl.handle.net/20.500.12123/10882 https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143 1940-3372 https://doi.org/10.1002/tpg2.20143 eng info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) application/pdf Wiley The Plant Genome 14 (3) : e20143 (November 2021)
spellingShingle Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics
Contreras-Moreira, Bruno
Filippi, Carla Valeria
Naamati, Guy
García Girón, Carlos
Allen, James E.
Flicek, Paul
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_fullStr K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full_unstemmed K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_short K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_sort k mer counting and curated libraries drive efficient annotation of repeats in plant genomes
topic Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics
url http://hdl.handle.net/20.500.12123/10882
https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143
https://doi.org/10.1002/tpg2.20143
work_keys_str_mv AT contrerasmoreirabruno kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT filippicarlavaleria kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT naamatiguy kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT garciagironcarlos kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT allenjamese kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT flicekpaul kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes