nach oben

Erschienen in:

2024 | OriginalPaper | Buchkapitel

Machine Learning Metrics for Network Datasets Evaluation

verfasst von : Dominik Soukup, Daniel Uhříček, Daniel Vašata, Tomáš Čejka

Erschienen in: ICT Systems Security and Privacy Protection

Verlag: Springer Nature Switzerland

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

High-quality datasets are an essential requirement for leveraging machine learning (ML) in data processing and recently in network security as well. However, the quality of datasets is overlooked or underestimated very often. Having reliable metrics to measure and describe the input dataset enables the feasibility assessment of a dataset. Imperfect datasets may require optimization or updating, e.g., by including more data and merging class labels. Applying ML algorithms will not bring practical value if a dataset does not contain enough information. This work addresses the neglected topics of dataset evaluation and missing metrics. We propose three novel metrics to estimate the quality of an input dataset and help with its improvement or building a new dataset. This paper describes experiments performed on public datasets to show the benefits of the proposed metrics and theoretical definitions for more straightforward interpretation. Additionally, we have implemented and published Python code so that the metrics can be adopted by the worldwide scientific community.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Cyber Key Terrain Identification Using Adjusted PageRank Centrality

Nächstes Kapitel Factors of Intention to Use a Photo Tool: Comparison Between Privacy-Enhancing and Non-privacy-enhancing Tools

https://github.com/soukudom/NDVM.

login.microsoftonline.com, settings-win.data.microsoft.com, outlook.office365.com, api.github.com, v10.events.data.microsoft.com.

Anderson, B., McGrew, D.: Machine learning for encrypted malware traffic classification: accounting for noisy labels and non-stationarity. In: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)

Brabec, J., et al.: On model evaluation under non-constant class imbalance. In: Computational Science (ICCS) (2020)

Celdrán, A.H., et al.: RITUAL: a platform quantifying the trustworthiness of supervised machine learning. In: 18th International Conference on Network and Service Management (CNSM) (2022)

Chen, H., et al.: Data curation and quality assurance for machine learning-based cyber intrusion detection (2021)

Zelaya, C.V.G.: Towards explaining the effects of data preprocessing on machine learning. In: 35th International Conference on Data Engineering (2019)

Hwang, I., et al.: SimEX: express prediction of inter-dataset similarity by a fleet of autoencoders. arXiv preprint arXiv:2001.04893 (2020)

Jeřábek, K., Hynek, K., Čejka, T., Ryšavý, O.: Collection of datasets with DNS over https traffic. Data Brief 42, 108310 (2022)CrossRef

Koh, P.W., et al.: WILDS: a benchmark of in-the-wild distribution shifts. In: Proceedings of the 38th International Conference on Machine Learning (2021)

Komorniczak, J., Ksieniewicz, P.: Problexity - an open-source python library for binary classification problem complexity assessment. arXiv preprint arXiv:2207.06709 (2022)

10.

Lanvin, M., et al.: Errors in the CICIDS2017 dataset and the significant differences in detection performances it makes (2023). https://hal.science/hal-03775466

11.

Lee, Y.W., et al.: AIMQ: a methodology for information quality assessment. Inf. Manag. (2002)

12.

Lorena, A.C., Garcia, L.P.F., Lehmann, J., Souto, M.C.P., Ho, T.K.: How complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. 52(5) (2019)

13.

Luxemburk, J., Čejka, T.: Fine-grained TLS services classification with reject option. Comput. Netw. 220, 109467 (2023)CrossRef

14.

Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)CrossRef

15.

Obaid, H.S., et al.: The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning. In: 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON) (2019)

16.

Papadogiannaki, E., Ioannidis, S.: A survey on encrypted network traffic analysis applications, techniques, and countermeasures. ACM Comput. Surv. 54(6) (2021)

17.

Pendlebury, F., et al.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Conference on Security Symposium, USA (2019)

18.

Pesarin, F., Salmaso, L.: A review and some new results on permutation testing for multivariate problems. Stat. Comput. 22(2), 639–646 (2012)MathSciNetCrossRef

19.

Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: International Conference on Information Systems Security and Privacy (2018)

20.

Soukup, D., et al.: Towards evaluating quality of datasets for network traffic domain. In: 17th International Conference on Network and Service Management (CNSM) (2021)

21.

Wasielewska, K., et al.: Dataset quality assessment with permutation testing showcased on network traffic datasets (2022). http://dx.doi.org/10.36227/techrxiv.20145539.v1

22.

Yoon, J., Arik, S., Pfister, T.: Data valuation using reinforcement learning. In: Daumé, H., III., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 10842–10851. PMLR (2020)

23.

Zhang, Y., Zhao, S., Sang, Y.: Towards unknown traffic identification using deep auto-encoder and constrained clustering. In: Computational Science – ICCS (2019)

Titel: Machine Learning Metrics for Network Datasets Evaluation
verfasst von: Dominik Soukup
Daniel Uhříček
Daniel Vašata
Tomáš Čejka
Verlag: Springer Nature Switzerland
Buch: ICT Systems Security and Privacy Protection
Print ISBN: 978-3-031-56325-6

Electronic ISBN: 978-3-031-56326-3

Copyright-Jahr: 2024
DOI: https://doi.org/10.1007/978-3-031-56326-3_22

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner