A review on knowledge and information extraction from PDF documents and storage approaches
Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the re...
| Autores principales: | , , , , , |
|---|---|
| Formato: | Journal Article |
| Lenguaje: | Inglés |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://hdl.handle.net/10568/178932 |
| _version_ | 1855538053839323136 |
|---|---|
| author | Atagong, S.D. Tonnang, H.E.Z. Senagi, K. Wamalwa, M. Agboka, K.M. Odindi, J. |
| author_browse | Agboka, K.M. Atagong, S.D. Odindi, J. Senagi, K. Tonnang, H.E.Z. Wamalwa, M. |
| author_facet | Atagong, S.D. Tonnang, H.E.Z. Senagi, K. Wamalwa, M. Agboka, K.M. Odindi, J. |
| author_sort | Atagong, S.D. |
| collection | Repository of Agricultural Research Outputs (CGSpace) |
| description | Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the reliability of current solutions remains contested, particularly in terms of accuracy, domain adaptability, and the effort required to implement robust systems. This study presents a comprehensive review of existing literature on information extraction from PDF documents, conducted using the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology. The review identifies prevailing trends and methodologies in information extraction, including rule-based systems, statistical learning approaches, and neural networkbased models, while highlighting their limitations. Challenges include, among others, the rigidity and complexity of rule-based methods, the scarcity of well-annotated, domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.To address these shortcomings, the study proposes a conceptual framework comprising nine core components: projects manager, documents manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework is intended to enhance the accuracy, domain adaptability, and usability of PDF information extraction systems. |
| format | Journal Article |
| id | CGSpace178932 |
| institution | CGIAR Consortium |
| language | Inglés |
| publishDate | 2025 |
| publishDateRange | 2025 |
| publishDateSort | 2025 |
| record_format | dspace |
| spelling | CGSpace1789322025-12-18T02:15:42Z A review on knowledge and information extraction from PDF documents and storage approaches Atagong, S.D. Tonnang, H.E.Z. Senagi, K. Wamalwa, M. Agboka, K.M. Odindi, J. natural language processing large language models knowledge-base system knowledge Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the reliability of current solutions remains contested, particularly in terms of accuracy, domain adaptability, and the effort required to implement robust systems. This study presents a comprehensive review of existing literature on information extraction from PDF documents, conducted using the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology. The review identifies prevailing trends and methodologies in information extraction, including rule-based systems, statistical learning approaches, and neural networkbased models, while highlighting their limitations. Challenges include, among others, the rigidity and complexity of rule-based methods, the scarcity of well-annotated, domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.To address these shortcomings, the study proposes a conceptual framework comprising nine core components: projects manager, documents manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework is intended to enhance the accuracy, domain adaptability, and usability of PDF information extraction systems. 2025 2025-12-17T15:46:22Z 2025-12-17T15:46:22Z Journal Article https://hdl.handle.net/10568/178932 en Open Access application/pdf Atagong, S.D., Tonnang, H.E.Z., Senagi, K., Wamalwa, M., Agboka, K.M. & Odindi, J. (2025). A review on knowledge and information extraction from PDF documents and storage approaches. Frontiers in Artificial Intelligence, 8: 1466092, 1-12. |
| spellingShingle | natural language processing large language models knowledge-base system knowledge Atagong, S.D. Tonnang, H.E.Z. Senagi, K. Wamalwa, M. Agboka, K.M. Odindi, J. A review on knowledge and information extraction from PDF documents and storage approaches |
| title | A review on knowledge and information extraction from PDF documents and storage approaches |
| title_full | A review on knowledge and information extraction from PDF documents and storage approaches |
| title_fullStr | A review on knowledge and information extraction from PDF documents and storage approaches |
| title_full_unstemmed | A review on knowledge and information extraction from PDF documents and storage approaches |
| title_short | A review on knowledge and information extraction from PDF documents and storage approaches |
| title_sort | review on knowledge and information extraction from pdf documents and storage approaches |
| topic | natural language processing large language models knowledge-base system knowledge |
| url | https://hdl.handle.net/10568/178932 |
| work_keys_str_mv | AT atagongsd areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT tonnanghez areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT senagik areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT wamalwam areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT agbokakm areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT odindij areviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT atagongsd reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT tonnanghez reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT senagik reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT wamalwam reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT agbokakm reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches AT odindij reviewonknowledgeandinformationextractionfrompdfdocumentsandstorageapproaches |