Skip to main content

2024 | OriginalPaper | Buchkapitel

Machine Learning Metrics for Network Datasets Evaluation

verfasst von : Dominik Soukup, Daniel Uhříček, Daniel Vašata, Tomáš Čejka

Erschienen in: ICT Systems Security and Privacy Protection

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

High-quality datasets are an essential requirement for leveraging machine learning (ML) in data processing and recently in network security as well. However, the quality of datasets is overlooked or underestimated very often. Having reliable metrics to measure and describe the input dataset enables the feasibility assessment of a dataset. Imperfect datasets may require optimization or updating, e.g., by including more data and merging class labels. Applying ML algorithms will not bring practical value if a dataset does not contain enough information. This work addresses the neglected topics of dataset evaluation and missing metrics. We propose three novel metrics to estimate the quality of an input dataset and help with its improvement or building a new dataset. This paper describes experiments performed on public datasets to show the benefits of the proposed metrics and theoretical definitions for more straightforward interpretation. Additionally, we have implemented and published Python code so that the metrics can be adopted by the worldwide scientific community.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Anderson, B., McGrew, D.: Machine learning for encrypted malware traffic classification: accounting for noisy labels and non-stationarity. In: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017) Anderson, B., McGrew, D.: Machine learning for encrypted malware traffic classification: accounting for noisy labels and non-stationarity. In: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)
2.
Zurück zum Zitat Brabec, J., et al.: On model evaluation under non-constant class imbalance. In: Computational Science (ICCS) (2020) Brabec, J., et al.: On model evaluation under non-constant class imbalance. In: Computational Science (ICCS) (2020)
3.
Zurück zum Zitat Celdrán, A.H., et al.: RITUAL: a platform quantifying the trustworthiness of supervised machine learning. In: 18th International Conference on Network and Service Management (CNSM) (2022) Celdrán, A.H., et al.: RITUAL: a platform quantifying the trustworthiness of supervised machine learning. In: 18th International Conference on Network and Service Management (CNSM) (2022)
4.
Zurück zum Zitat Chen, H., et al.: Data curation and quality assurance for machine learning-based cyber intrusion detection (2021) Chen, H., et al.: Data curation and quality assurance for machine learning-based cyber intrusion detection (2021)
5.
Zurück zum Zitat Zelaya, C.V.G.: Towards explaining the effects of data preprocessing on machine learning. In: 35th International Conference on Data Engineering (2019) Zelaya, C.V.G.: Towards explaining the effects of data preprocessing on machine learning. In: 35th International Conference on Data Engineering (2019)
6.
Zurück zum Zitat Hwang, I., et al.: SimEX: express prediction of inter-dataset similarity by a fleet of autoencoders. arXiv preprint arXiv:2001.04893 (2020) Hwang, I., et al.: SimEX: express prediction of inter-dataset similarity by a fleet of autoencoders. arXiv preprint arXiv:​2001.​04893 (2020)
7.
Zurück zum Zitat Jeřábek, K., Hynek, K., Čejka, T., Ryšavý, O.: Collection of datasets with DNS over https traffic. Data Brief 42, 108310 (2022)CrossRef Jeřábek, K., Hynek, K., Čejka, T., Ryšavý, O.: Collection of datasets with DNS over https traffic. Data Brief 42, 108310 (2022)CrossRef
8.
Zurück zum Zitat Koh, P.W., et al.: WILDS: a benchmark of in-the-wild distribution shifts. In: Proceedings of the 38th International Conference on Machine Learning (2021) Koh, P.W., et al.: WILDS: a benchmark of in-the-wild distribution shifts. In: Proceedings of the 38th International Conference on Machine Learning (2021)
9.
Zurück zum Zitat Komorniczak, J., Ksieniewicz, P.: Problexity - an open-source python library for binary classification problem complexity assessment. arXiv preprint arXiv:2207.06709 (2022) Komorniczak, J., Ksieniewicz, P.: Problexity - an open-source python library for binary classification problem complexity assessment. arXiv preprint arXiv:​2207.​06709 (2022)
11.
Zurück zum Zitat Lee, Y.W., et al.: AIMQ: a methodology for information quality assessment. Inf. Manag. (2002) Lee, Y.W., et al.: AIMQ: a methodology for information quality assessment. Inf. Manag. (2002)
12.
Zurück zum Zitat Lorena, A.C., Garcia, L.P.F., Lehmann, J., Souto, M.C.P., Ho, T.K.: How complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. 52(5) (2019) Lorena, A.C., Garcia, L.P.F., Lehmann, J., Souto, M.C.P., Ho, T.K.: How complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. 52(5) (2019)
13.
Zurück zum Zitat Luxemburk, J., Čejka, T.: Fine-grained TLS services classification with reject option. Comput. Netw. 220, 109467 (2023)CrossRef Luxemburk, J., Čejka, T.: Fine-grained TLS services classification with reject option. Comput. Netw. 220, 109467 (2023)CrossRef
14.
Zurück zum Zitat Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)CrossRef Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)CrossRef
15.
Zurück zum Zitat Obaid, H.S., et al.: The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning. In: 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON) (2019) Obaid, H.S., et al.: The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning. In: 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON) (2019)
16.
Zurück zum Zitat Papadogiannaki, E., Ioannidis, S.: A survey on encrypted network traffic analysis applications, techniques, and countermeasures. ACM Comput. Surv. 54(6) (2021) Papadogiannaki, E., Ioannidis, S.: A survey on encrypted network traffic analysis applications, techniques, and countermeasures. ACM Comput. Surv. 54(6) (2021)
17.
Zurück zum Zitat Pendlebury, F., et al.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Conference on Security Symposium, USA (2019) Pendlebury, F., et al.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Conference on Security Symposium, USA (2019)
18.
Zurück zum Zitat Pesarin, F., Salmaso, L.: A review and some new results on permutation testing for multivariate problems. Stat. Comput. 22(2), 639–646 (2012)MathSciNetCrossRef Pesarin, F., Salmaso, L.: A review and some new results on permutation testing for multivariate problems. Stat. Comput. 22(2), 639–646 (2012)MathSciNetCrossRef
19.
Zurück zum Zitat Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: International Conference on Information Systems Security and Privacy (2018) Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: International Conference on Information Systems Security and Privacy (2018)
20.
Zurück zum Zitat Soukup, D., et al.: Towards evaluating quality of datasets for network traffic domain. In: 17th International Conference on Network and Service Management (CNSM) (2021) Soukup, D., et al.: Towards evaluating quality of datasets for network traffic domain. In: 17th International Conference on Network and Service Management (CNSM) (2021)
22.
Zurück zum Zitat Yoon, J., Arik, S., Pfister, T.: Data valuation using reinforcement learning. In: Daumé, H., III., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 10842–10851. PMLR (2020) Yoon, J., Arik, S., Pfister, T.: Data valuation using reinforcement learning. In: Daumé, H., III., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 10842–10851. PMLR (2020)
23.
Zurück zum Zitat Zhang, Y., Zhao, S., Sang, Y.: Towards unknown traffic identification using deep auto-encoder and constrained clustering. In: Computational Science – ICCS (2019) Zhang, Y., Zhao, S., Sang, Y.: Towards unknown traffic identification using deep auto-encoder and constrained clustering. In: Computational Science – ICCS (2019)
Metadaten
Titel
Machine Learning Metrics for Network Datasets Evaluation
verfasst von
Dominik Soukup
Daniel Uhříček
Daniel Vašata
Tomáš Čejka
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-56326-3_22

Premium Partner