nach oben

Erschienen in:

2024 | OriginalPaper | Buchkapitel

MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting

verfasst von : Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, Rajiv Shah

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high school-level multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning

Nächstes Kapitel Adversarial Text Purification: A Large Language Model Approach for Defense

Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)

Anand, A., et al.: SciPhyRAG-retrieval augmentation to improve LLMs on physics Q & A. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 50–63. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_4CrossRef

OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). n. pag

Anand, A., et al.: Context-enhanced language models for generating multi-paper citations. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 80–94. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_6CrossRef

Anand, A., et al.: KG-CTG: citation generation through knowledge graph-guided large language models. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 37–49. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_3CrossRef

Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)

Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural. Inf. Process. Syst. 35, 2507–2521 (2022)

Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

10.

Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

11.

Arora, D., Singh, H.G.: Have LLMs advanced enough? A challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074 (2023). pag

12.

Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017)

13.

Wang, X., et al.: SciBench: evaluating college-level scientific problem-solving abilities of large language models. ArXiv abs/2307.10635 (2023). n. pag

14.

Hendrycks, D., et al.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)

15.

Huang, Y., et al.: C-eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023)

16.

Chen, J., et al.: GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517 (2021)

17.

Jin, N., Siebert, J., Li, D., Chen, Q.: A survey on table question answering: recent advances. In: Sun, M., et al. (eds.) CCKS 2022. CCIS, vol. 1669, pp. 174–186. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-7596-7_14CrossRef

18.

Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)

19.

Deepak, G., Kumari, S., Ekbal, A., Bhattacharyya, P.: MMQA: a multi-domain multi-lingual question-answering framework for English and Hindi. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

20.

https://openai.com/research/gpt-4v-system-card

21.

Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

22.

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

23.

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)

24.

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

25.

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

26.

Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

27.

Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)

28.

Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023)

29.

https://chat.openai.com/

30.

Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. CoRR abs/2106.09685 (2021)

Titel: MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting
verfasst von: Avinash Anand
Janak Kapuriya
Apoorv Singh
Jay Saraf
Naman Lal
Astha Verma
Rushali Gupta
Rajiv Shah
Verlag: Springer Nature Singapore
Buch: Advances in Knowledge Discovery and Data Mining
Print ISBN: 978-981-9722-64-8

Electronic ISBN: 978-981-9722-62-4

Copyright-Jahr: 2024
DOI: https://doi.org/10.1007/978-981-97-2262-4_5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner