Skip to main content

2024 | OriginalPaper | Buchkapitel

MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting

verfasst von : Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, Rajiv Shah

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high school-level multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
4.
Zurück zum Zitat OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). n. pag OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). n. pag
7.
Zurück zum Zitat Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022) Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
8.
Zurück zum Zitat Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural. Inf. Process. Syst. 35, 2507–2521 (2022) Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural. Inf. Process. Syst. 35, 2507–2521 (2022)
9.
Zurück zum Zitat Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023) Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:​2302.​00923 (2023)
11.
Zurück zum Zitat Arora, D., Singh, H.G.: Have LLMs advanced enough? A challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074 (2023). pag Arora, D., Singh, H.G.: Have LLMs advanced enough? A challenging problem solving benchmark for large language models. arXiv preprint arXiv:​2305.​15074 (2023). pag
12.
13.
Zurück zum Zitat Wang, X., et al.: SciBench: evaluating college-level scientific problem-solving abilities of large language models. ArXiv abs/2307.10635 (2023). n. pag Wang, X., et al.: SciBench: evaluating college-level scientific problem-solving abilities of large language models. ArXiv abs/2307.10635 (2023). n. pag
15.
Zurück zum Zitat Huang, Y., et al.: C-eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023) Huang, Y., et al.: C-eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:​2305.​08322 (2023)
16.
Zurück zum Zitat Chen, J., et al.: GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517 (2021) Chen, J., et al.: GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:​2105.​14517 (2021)
18.
Zurück zum Zitat Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022) Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:​2203.​10244 (2022)
19.
Zurück zum Zitat Deepak, G., Kumari, S., Ekbal, A., Bhattacharyya, P.: MMQA: a multi-domain multi-lingual question-answering framework for English and Hindi. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018) Deepak, G., Kumari, S., Ekbal, A., Bhattacharyya, P.: MMQA: a multi-domain multi-lingual question-answering framework for English and Hindi. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
23.
24.
Zurück zum Zitat Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021) Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
25.
Zurück zum Zitat Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023) Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:​2306.​15195 (2023)
26.
27.
28.
Zurück zum Zitat Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023) Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models. arXiv preprint arXiv:​2308.​09687 (2023)
30.
Zurück zum Zitat Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. CoRR abs/2106.09685 (2021) Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. CoRR abs/2106.09685 (2021)
Metadaten
Titel
MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting
verfasst von
Avinash Anand
Janak Kapuriya
Apoorv Singh
Jay Saraf
Naman Lal
Astha Verma
Rushali Gupta
Rajiv Shah
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2262-4_5

Premium Partner