Skip to main content

2024 | OriginalPaper | Buchkapitel

ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality

verfasst von : Jussi Karlgren, Luise Dürlich, Evangelia Gogoulou, Liane Guillou, Joakim Nivre, Magnus Sahlgren, Aarne Talman

Erschienen in: Advances in Information Retrieval

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to bring together some high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. The selected tasks for this first year of ELOQUENT are (1) probing a language model for topical competence; (2) assessing the ability of models to generate and detect hallucinations; (3) assessing the robustness of a model output given variation in the input prompts; and (4) establishing the possibility to distinguish human-generated text from machine-generated text.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Altinisik, E., Sajjad, H., Sencar, H.T., Messaoud, S., Chawla, S.: Impact of adversarial training on robustness and generalizability of language models. arXiv preprint arXiv:2211.05523 (2023) Altinisik, E., Sajjad, H., Sencar, H.T., Messaoud, S., Chawla, S.: Impact of adversarial training on robustness and generalizability of language models. arXiv preprint arXiv:​2211.​05523 (2023)
2.
Zurück zum Zitat Bell, A.: Language style as audience design. Lang. Soc. 13(2) (1984) Bell, A.: Language style as audience design. Lang. Soc. 13(2) (1984)
3.
Zurück zum Zitat Bevendorff, J., et al.: Overview of PAN 2024: multi-author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative AI authorship verification. In: Advances in Information Retrieval: 46th European Conference on IR Research (ECIR) (2024) Bevendorff, J., et al.: Overview of PAN 2024: multi-author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative AI authorship verification. In: Advances in Information Retrieval: 46th European Conference on IR Research (ECIR) (2024)
4.
Zurück zum Zitat Bevendorff, J., et al.: Overview of PAN 2021: authorship verification, profiling hate speech spreaders on twitter, and style change detection. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 12th International Conference of the CLEF Association (2021) Bevendorff, J., et al.: Overview of PAN 2021: authorship verification, profiling hate speech spreaders on twitter, and style change detection. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 12th International Conference of the CLEF Association (2021)
5.
Zurück zum Zitat Bevendorff, J., et al.: Overview of PAN 2020: authorship verification, celebrity profiling, profiling fake news spreaders on twitter, and style change detection. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 11th International Conference of the CLEF Association (2020) Bevendorff, J., et al.: Overview of PAN 2020: authorship verification, celebrity profiling, profiling fake news spreaders on twitter, and style change detection. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 11th International Conference of the CLEF Association (2020)
6.
Zurück zum Zitat Ettinger, A., Rao, S., Daumé III, H., Bender, E.M.: Towards linguistically generalizable NLP systems: a workshop and shared task. In: Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems. Association for Computational Linguistics (2017) Ettinger, A., Rao, S., Daumé III, H., Bender, E.M.: Towards linguistically generalizable NLP systems: a workshop and shared task. In: Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems. Association for Computational Linguistics (2017)
7.
Zurück zum Zitat Freitag, M., et al.: Results of WMT22 metrics shared task: stop using BLEU - neural metrics are better and more robust. In: Proceedings of the Seventh Conference on Machine Translation (WMT). Association for Computational Linguistics (2022) Freitag, M., et al.: Results of WMT22 metrics shared task: stop using BLEU - neural metrics are better and more robust. In: Proceedings of the Seventh Conference on Machine Translation (WMT). Association for Computational Linguistics (2022)
8.
Zurück zum Zitat Karlgren, J.: Adopting systematic evaluation benchmarks in operational settings. In: Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF (2019) Karlgren, J.: Adopting systematic evaluation benchmarks in operational settings. In: Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF (2019)
9.
Zurück zum Zitat Karlgren, J., et al.: Evaluating learning language representations. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 6th International Conference of the CLEF Association (2015) Karlgren, J., et al.: Evaluating learning language representations. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 6th International Conference of the CLEF Association (2015)
10.
Zurück zum Zitat Manakul, P., Liusie, A., Gales, M.J.F.: Selfcheckgpt: zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023) Manakul, P., Liusie, A., Gales, M.J.F.: Selfcheckgpt: zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:​2303.​08896 (2023)
11.
Zurück zum Zitat Moradi, M., Samwald, M.: Evaluating the robustness of neural language models to input perturbations. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021) Moradi, M., Samwald, M.: Evaluating the robustness of neural language models to input perturbations. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)
12.
Zurück zum Zitat Mündler, N., He, J., Jenko, S., Vechev, M.: Self-contradictory hallucinations of large language models: evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852 (2023) Mündler, N., He, J., Jenko, S., Vechev, M.: Self-contradictory hallucinations of large language models: evaluation, detection and mitigation. arXiv preprint arXiv:​2305.​15852 (2023)
13.
Zurück zum Zitat Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., Kiela, D.: Adversarial NLI: a new benchmark for natural language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020) Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., Kiela, D.: Adversarial NLI: a new benchmark for natural language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020)
14.
Zurück zum Zitat Sarvazyan, A.M., González, J.Á., Rosso, P., Franco-Salvador, M.: Supervised machine-generated text detectors: family and scale matters. In: Arampatzis, A., et al. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. LNCS, vol. 14163, pp. 121–132. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-42448-9_11 Sarvazyan, A.M., González, J.Á., Rosso, P., Franco-Salvador, M.: Supervised machine-generated text detectors: family and scale matters. In: Arampatzis, A., et al. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. LNCS, vol. 14163, pp. 121–132. Springer, Cham (2023). https://​doi.​org/​10.​1007/​978-3-031-42448-9_​11
16.
Zurück zum Zitat Singhal, K., et al.: Large language models encode clinical knowledge. Nature 620(7972) (2023) Singhal, K., et al.: Large language models encode clinical knowledge. Nature 620(7972) (2023)
17.
Zurück zum Zitat Stamatatos, E., et al.: Overview of the authorship verification task at PAN 2022. In: Faggioli, G., Ferro, N., Hanbury, A., Potthast, M. (eds.) CLEF 2022 Labs and Workshops, Notebook Papers. CEUR-WS.org (2022) Stamatatos, E., et al.: Overview of the authorship verification task at PAN 2022. In: Faggioli, G., Ferro, N., Hanbury, A., Potthast, M. (eds.) CLEF 2022 Labs and Workshops, Notebook Papers. CEUR-WS.org (2022)
18.
Zurück zum Zitat Stamatatos, E., Potthast, M., Pardo, F.M.R., Rosso, P., Stein, B.: Overview of the PAN/CLEF 2015 evaluation lab. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 6th International Conference of the CLEF Association (2015) Stamatatos, E., Potthast, M., Pardo, F.M.R., Rosso, P., Stein, B.: Overview of the PAN/CLEF 2015 evaluation lab. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction – 6th International Conference of the CLEF Association (2015)
19.
Zurück zum Zitat Wang, B., et al.: InfoBERT: improving robustness of language models from an information theoretic perspective. In: International Conference on Learning Representations (2021) Wang, B., et al.: InfoBERT: improving robustness of language models from an information theoretic perspective. In: International Conference on Learning Representations (2021)
21.
Zurück zum Zitat Zheng, C., Zhou, H., Meng, F., Zhou, J., Huang, M.: Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882 (2023) Zheng, C., Zhou, H., Meng, F., Zhou, J., Huang, M.: Large language models are not robust multiple choice selectors. arXiv preprint arXiv:​2309.​03882 (2023)
Metadaten
Titel
ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality
verfasst von
Jussi Karlgren
Luise Dürlich
Evangelia Gogoulou
Liane Guillou
Joakim Nivre
Magnus Sahlgren
Aarne Talman
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-56069-9_63

Premium Partner