A review on knowledge and information extraction from PDF documents and storage approaches

Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the re...

Descripción completa

Detalles Bibliográficos
Autores principales: Atagong, S.D., Tonnang, H.E.Z., Senagi, K., Wamalwa, M., Agboka, K.M., Odindi, J.
Formato: Journal Article
Lenguaje:Inglés
Publicado: 2025
Materias:
Acceso en línea:https://hdl.handle.net/10568/178932
Descripción
Sumario:Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the reliability of current solutions remains contested, particularly in terms of accuracy, domain adaptability, and the effort required to implement robust systems. This study presents a comprehensive review of existing literature on information extraction from PDF documents, conducted using the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology. The review identifies prevailing trends and methodologies in information extraction, including rule-based systems, statistical learning approaches, and neural networkbased models, while highlighting their limitations. Challenges include, among others, the rigidity and complexity of rule-based methods, the scarcity of well-annotated, domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.To address these shortcomings, the study proposes a conceptual framework comprising nine core components: projects manager, documents manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework is intended to enhance the accuracy, domain adaptability, and usability of PDF information extraction systems.