Skip to main content

2024 | OriginalPaper | Buchkapitel

Class Ratio and Its Implications for Reproducibility and Performance in Record Linkage

verfasst von : Jeremy Foxcroft, Peter Christen, Luiza Antonie

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Record linkage is the process of identifying and matching records from different datasets that refer to the same entity. This process can be framed as a pairwise binary classification problem, where a classification model predicts if a pair of records match (i.e., refer to the same entity) or not. Even though training data is paramount in model building and the subsequent predictions, there is a lack of reporting in the literature on training data details, especially the ratio of matching to non-matching examples. The absence of adequate reporting has a significant impact on both the model building and reproducibility of research studies. In this paper we demonstrate how the performance measures commonly used in record linkage (precision, recall, and \(F_1\)-measure) vary with respect to this ratio. Specifically, we show that different class imbalance ratios in training data have a substantial impact in classifier performance, with more imbalanced training data resulting in lower performance. Furthermore, we examine the impact on performance when the class ratio between the test data and the training data is changed. Our extensive experimental study allows us to offer practical advice for constructing training data, building record linkage models, measuring performance, and reporting on the training data details.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
8.
Zurück zum Zitat Christen, P., Hand, D.J., Kirielle, N.: A review of the F-measure: its history, properties, criticism, and alternatives. ACM Comput. Surv. 56(3), 1–24 (2023)CrossRef Christen, P., Hand, D.J., Kirielle, N.: A review of the F-measure: its history, properties, criticism, and alternatives. ACM Comput. Surv. 56(3), 1–24 (2023)CrossRef
13.
Zurück zum Zitat Gilbert, R., et al.: Guild: guidance for information about linking data sets. J. Public Health 40, 191–198 (2017)CrossRef Gilbert, R., et al.: Guild: guidance for information about linking data sets. J. Public Health 40, 191–198 (2017)CrossRef
14.
Zurück zum Zitat Hand, D.J., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)MathSciNetCrossRef Hand, D.J., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)MathSciNetCrossRef
25.
Zurück zum Zitat Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, pp. 19–34. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3183713.3196926 Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, pp. 19–34. Association for Computing Machinery, New York (2018). https://​doi.​org/​10.​1145/​3183713.​3196926
26.
Zurück zum Zitat Papadakis, G., Kirielle, N., Christen, P., Palpanas, T.: A critical re-evaluation of benchmark datasets for (deep) learning-based matching algorithms. In: IEEE International Conference on Data Engineering (ICDE), Utrecht (2024) Papadakis, G., Kirielle, N., Christen, P., Palpanas, T.: A critical re-evaluation of benchmark datasets for (deep) learning-based matching algorithms. In: IEEE International Conference on Data Engineering (ICDE), Utrecht (2024)
27.
Zurück zum Zitat Pineau, J., et al.: Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). J. Mach. Learn. Res. 22(1), 1–20 (2021) Pineau, J., et al.: Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). J. Mach. Learn. Res. 22(1), 1–20 (2021)
28.
Zurück zum Zitat Primpeli, A., Bizer, C.: Profiling entity matching benchmark tasks. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management, CIKM 2020, pp. 3101–3108. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3340531.3412781 Primpeli, A., Bizer, C.: Profiling entity matching benchmark tasks. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management, CIKM 2020, pp. 3101–3108. Association for Computing Machinery, New York (2020). https://​doi.​org/​10.​1145/​3340531.​3412781
Metadaten
Titel
Class Ratio and Its Implications for Reproducibility and Performance in Record Linkage
verfasst von
Jeremy Foxcroft
Peter Christen
Luiza Antonie
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2242-6_16

Premium Partner