Multimodal RAG Analysis of Product Datasheet

Authors

DOI:

https://doi.org/10.11113/oiji2024.12n2.309

Keywords:

Retrieval-augmented Generation, Multimodal RAG, Vision Model

Abstract

Large language models such as ChatGPT serves as multipurpose chatbot that can provide information across diverse disciplines. However, in order to generate timely and accurate response, retrieval-augmented generation method has been devised to enhance the response of these models. The release of vision models has paved the way for practitioners to perform multimodal retrieval augmented generation on documents that commonly consist of a combination of text, images and tables. Hence, this method is explored to analyze a product datasheet and match it with minimum specification required by potential clients. It is demonstrated that multimodal retrieval augmented generation performed better compared to basic retrieval augmented generation that did not consider the information contained in image and table specifically. While the performance of this method still lagged behind the commercially available GPT-4o, information is not exchanged with any external parties which could potentially address any privacy issue with regards to highly sensitive information. The incorporation of various best practices in this domain as highlighted in other studies can potentially improve the output generation of this method toward matching or exceeding the performance of commercially available tools.

Downloads

Published

2024-12-27

How to Cite

Lau, D., Narayana Samy, G., Abdul Rahim, D. F., Maarop, N., Selvananthan, M., Ali, M., & Perumal, S. (2024). Multimodal RAG Analysis of Product Datasheet. Open International Journal of Informatics, 12(2), 1–12. https://doi.org/10.11113/oiji2024.12n2.309