A review on knowledge and information extraction from PDF documents and storage approaches

Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the re...

Descripción completa

Detalles Bibliográficos
Autores principales: Atagong, S.D., Tonnang, H.E.Z., Senagi, K., Wamalwa, M., Agboka, K.M., Odindi, J.
Formato: Journal Article
Lenguaje:Inglés
Publicado: 2025
Materias:
Acceso en línea:https://hdl.handle.net/10568/178932
_version_ 1855538053839323136
author Atagong, S.D.
Tonnang, H.E.Z.
Senagi, K.
Wamalwa, M.
Agboka, K.M.
Odindi, J.
author_browse Agboka, K.M.
Atagong, S.D.
Odindi, J.
Senagi, K.
Tonnang, H.E.Z.
Wamalwa, M.
author_facet Atagong, S.D.
Tonnang, H.E.Z.
Senagi, K.
Wamalwa, M.
Agboka, K.M.
Odindi, J.
author_sort Atagong, S.D.
collection Repository of Agricultural Research Outputs (CGSpace)
description Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the reliability of current solutions remains contested, particularly in terms of accuracy, domain adaptability, and the effort required to implement robust systems. This study presents a comprehensive review of existing literature on information extraction from PDF documents, conducted using the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology. The review identifies prevailing trends and methodologies in information extraction, including rule-based systems, statistical learning approaches, and neural networkbased models, while highlighting their limitations. Challenges include, among others, the rigidity and complexity of rule-based methods, the scarcity of well-annotated, domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.To address these shortcomings, the study proposes a conceptual framework comprising nine core components: projects manager, documents manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework is intended to enhance the accuracy, domain adaptability, and usability of PDF information extraction systems.
format Journal Article
id CGSpace178932
institution CGIAR Consortium
language Inglés
publishDate 2025
publishDateRange 2025
publishDateSort 2025
record_format dspace
spelling CGSpace1789322025-12-18T02:15:42Z A review on knowledge and information extraction from PDF documents and storage approaches Atagong, S.D. Tonnang, H.E.Z. Senagi, K. Wamalwa, M. Agboka, K.M. Odindi, J. natural language processing large language models knowledge-base system knowledge Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the reliability of current solutions remains contested, particularly in terms of accuracy, domain adaptability, and the effort required to implement robust systems. This study presents a comprehensive review of existing literature on information extraction from PDF documents, conducted using the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology. The review identifies prevailing trends and methodologies in information extraction, including rule-based systems, statistical learning approaches, and neural networkbased models, while highlighting their limitations. Challenges include, among others, the rigidity and complexity of rule-based methods, the scarcity of well-annotated, domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.To address these shortcomings, the study proposes a conceptual framework comprising nine core components: projects manager, documents manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework is intended to enhance the accuracy, domain adaptability, and usability of PDF information extraction systems. 2025 2025-12-17T15:46:22Z 2025-12-17T15:46:22Z Journal Article https://hdl.handle.net/10568/178932 en Open Access application/pdf Atagong, S.D., Tonnang, H.E.Z., Senagi, K., Wamalwa, M., Agboka, K.M. & Odindi, J. (2025). A review on knowledge and information extraction from PDF documents and storage approaches. Frontiers in Artificial Intelligence, 8: 1466092, 1-12.
spellingShingle natural language processing
large language models
knowledge-base system
knowledge
Atagong, S.D.
Tonnang, H.E.Z.
Senagi, K.
Wamalwa, M.
Agboka, K.M.
Odindi, J.
A review on knowledge and information extraction from PDF documents and storage approaches
title A review on knowledge and information extraction from PDF documents and storage approaches
title_full A review on knowledge and information extraction from PDF documents and storage approaches
title_fullStr A review on knowledge and information extraction from PDF documents and storage approaches
title_full_unstemmed A review on knowledge and information extraction from PDF documents and storage approaches
title_short A review on knowledge and information extraction from PDF documents and storage approaches
title_sort review on knowledge and information extraction from pdf documents and storage approaches
topic natural language processing
large language models
knowledge-base system
knowledge
url https://hdl.handle.net/10568/178932
work_keys_str_mv AT atagongsd areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT tonnanghez areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT senagik areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT wamalwam areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT agbokakm areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT odindij areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT atagongsd reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT tonnanghez reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT senagik reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT wamalwam reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT agbokakm reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches
AT odindij reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches