Skip to main content

2024 | OriginalPaper | Buchkapitel

Transformer based Multitask Learning for Image Captioning and Object Detection

verfasst von : Debolena Basak, P. K. Srijith, Maunendra Sankar Desarkar

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning framework that combines image captioning and object detection into a joint model. We propose TICOD, Transformer-based Image Captioning and Object Detection model for jointly training both tasks by combining the losses obtained from image captioning and object detection networks. By leveraging joint training, the model benefits from the complementary information shared between the two tasks, leading to improved performance for image captioning. Our approach utilizes a transformer-based architecture that enables end-to-end network integration for image captioning and object detection and performs both tasks jointly. We evaluate the effectiveness of our approach through comprehensive experiments on the MS-COCO dataset. Our model outperforms the baselines from image captioning literature by achieving a \(3.65\%\) improvement in BERTScore.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on CVPR (2018) Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on CVPR (2018)
2.
Zurück zum Zitat Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on CVPR (2018) Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on CVPR (2018)
3.
Zurück zum Zitat Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision. ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13 Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision. ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://​doi.​org/​10.​1007/​978-3-030-58452-8_​13
4.
Zurück zum Zitat Chen, T., et al.: A unified sequence interface for vision tasks. In: NeurIPS (2022) Chen, T., et al.: A unified sequence interface for vision tasks. In: NeurIPS (2022)
5.
Zurück zum Zitat Cornia, M., et al.: Meshed-memory transformer for image captioning. In: CVPR (2020) Cornia, M., et al.: Meshed-memory transformer for image captioning. In: CVPR (2020)
6.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
7.
Zurück zum Zitat Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929 (2020)
8.
Zurück zum Zitat Fariha, A.: Automatic image captioning using multitask learning. In: NeurIPS, vol. 20 (2016) Fariha, A.: Automatic image captioning using multitask learning. In: NeurIPS, vol. 20 (2016)
9.
Zurück zum Zitat Girshick, R.: Fast r-cnn. In: International Conference on Computer Vision (ICCV) (2015) Girshick, R.: Fast r-cnn. In: International Conference on Computer Vision (ICCV) (2015)
10.
Zurück zum Zitat Girshick, R., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on CVPR (2014) Girshick, R., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on CVPR (2014)
11.
Zurück zum Zitat Han, K., et al.: Transformer in transformer. In: Advances in NeurIPS, pp. 15908–15919 (2021) Han, K., et al.: Transformer in transformer. In: Advances in NeurIPS, pp. 15908–15919 (2021)
12.
Zurück zum Zitat He, P., et al.: Deberta: decoding-enhanced bert with disentangled attention. In: ICLR (2021) He, P., et al.: Deberta: decoding-enhanced bert with disentangled attention. In: ICLR (2021)
13.
Zurück zum Zitat Jiang, H., et al.: In defense of grid features for visual question answering. In: CVPR (2020) Jiang, H., et al.: In defense of grid features for visual question answering. In: CVPR (2020)
14.
Zurück zum Zitat Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision. ECCV 2020. Springer (2020) Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision. ECCV 2020. Springer (2020)
15.
Zurück zum Zitat Liang, J., et al.: Swinir: image restoration using Swin transformer. In: ICCV (2021) Liang, J., et al.: Swinir: image restoration using Swin transformer. In: ICCV (2021)
17.
Zurück zum Zitat Lin, T.Y., et al.: Feature pyramid networks for object detection. In: CVPR (2017) Lin, T.Y., et al.: Feature pyramid networks for object detection. In: CVPR (2017)
20.
Zurück zum Zitat Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV, pp. 10012–10022 (2021) Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV, pp. 10012–10022 (2021)
22.
23.
Zurück zum Zitat Pan, Y., et al.: X-linear attention networks for image captioning. In: CVPR (2020) Pan, Y., et al.: X-linear attention networks for image captioning. In: CVPR (2020)
24.
Zurück zum Zitat Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019) Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
25.
Zurück zum Zitat Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021) Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
26.
Zurück zum Zitat Ren, S., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015) Ren, S., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015)
27.
Zurück zum Zitat Touvron, H., et al.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021) Touvron, H., et al.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
28.
Zurück zum Zitat Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017) Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
29.
Zurück zum Zitat Vinyals, O., et al: Show and tell: a neural image caption generator. In: CVPR (2015) Vinyals, O., et al: Show and tell: a neural image caption generator. In: CVPR (2015)
30.
Zurück zum Zitat Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF ICCV, pp. 568–578 (2021) Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF ICCV, pp. 568–578 (2021)
32.
Zurück zum Zitat Xinlei Chen, H.F., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015) Xinlei Chen, H.F., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:​1504.​00325 (2015)
33.
Zurück zum Zitat Xu, H., et al.: E2e-vlp: end-to-end vision-language pre-training enhanced by visual learning. In: Proceedings of the 59th Annual Meeting of the ACL and the 11th IJCNLP (2021) Xu, H., et al.: E2e-vlp: end-to-end vision-language pre-training enhanced by visual learning. In: Proceedings of the 59th Annual Meeting of the ACL and the 11th IJCNLP (2021)
34.
Zurück zum Zitat Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd ICML, vol. 37, pp. 2048–2057. PMLR (2015) Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd ICML, vol. 37, pp. 2048–2057. PMLR (2015)
35.
Zurück zum Zitat Zhang, K., et al.: Bertscore: evaluating text generation with bert. In: ICLR (2020) Zhang, K., et al.: Bertscore: evaluating text generation with bert. In: ICLR (2020)
36.
Zurück zum Zitat Zhang, P., et al.: Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on CVPR (2021) Zhang, P., et al.: Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on CVPR (2021)
37.
Zurück zum Zitat Zhao, W., et al.: A multi-task learning approach for image captioning. In: IJCAI 2018 Zhao, W., et al.: A multi-task learning approach for image captioning. In: IJCAI 2018
38.
Zurück zum Zitat Zhuo, T.Y., et al.: Rethinking round-trip translation for machine translation evaluation. In: Findings of the Association for Computational Linguistics: ACL (2023) Zhuo, T.Y., et al.: Rethinking round-trip translation for machine translation evaluation. In: Findings of the Association for Computational Linguistics: ACL (2023)
Metadaten
Titel
Transformer based Multitask Learning for Image Captioning and Object Detection
verfasst von
Debolena Basak
P. K. Srijith
Maunendra Sankar Desarkar
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2253-2_21

Premium Partner