Skip to main content

2024 | OriginalPaper | Buchkapitel

VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes

verfasst von : Mahmoud Zamani, Saquib Irtiza, Latifur Khan, Kevin W. Hamlen

Erschienen in: Foundations and Practice of Security

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The first graph masked auto-encoder (GraphMAE) model for software vulnerability detection is designed and developed, with a comparative evaluation against other self-supervised learning (SSL) methods. Evaluation of the domain-specific GraphMAE model (VulMAE) for the vulnerability detection task shows exceptional promise, outperforming all other baseline models in the study. The approach is particularly well-suited for cybersecurity applications where gathering substantial real-world labeled samples is difficult, since graph SSL methods (e.g., contrastive and generative models) offer data classification in AI tasks without requiring vast amounts of labeled data for effective training.
The study fills a key gap in the literature on automated and machine-assisted discovery and patching of software security vulnerabilities, which has become increasingly critical with the dramatic increase in modern software complexity, but for which graph neural network (GNN) approaches are understudied relative to traditional processes, such as manual source code auditing and fuzzing. To conduct the study, the evaluation applies models to source and binary software components sourced from the National Vulnerability Database (NVD). A new dataset is curated by extracting vulnerable code fragments from six applications with NVD-documented security flaws and converting them to four graph types using specialized tools based on code property graphs and binary semantics lifting. The data is used to train contrastive and generative learning models for comparison. VulMAE achieves a weighted F1 score of 0.936 and a weighted Recall of 0.938, which is the highest of all tested methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Booth, H., Rike, D., Witte, G.A.: The national vulnerability database (NVD): Overview. ITL Bulletin, National Institute of Standards and Technology (2013) Booth, H., Rike, D., Witte, G.A.: The national vulnerability database (NVD): Overview. ITL Bulletin, National Institute of Standards and Technology (2013)
2.
Zurück zum Zitat Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Proceedings of International Conference on Computer Aided Verification, pp. 463–469 (2011) Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Proceedings of International Conference on Computer Aided Verification, pp. 463–469 (2011)
3.
Zurück zum Zitat Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet. IEEE Trans. Softw. Eng. 48, 3280–3296 (2022)CrossRef Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet. IEEE Trans. Softw. Eng. 48, 3280–3296 (2022)CrossRef
4.
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifi. Intell. Res. 16(1), 321–357 (2002)CrossRef Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifi. Intell. Res. 16(1), 321–357 (2002)CrossRef
5.
Zurück zum Zitat Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceedings of ACM/IEEE International Symposium Empirical Software Engineering and Measurement (2021) Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceedings of ACM/IEEE International Symposium Empirical Software Engineering and Measurement (2021)
7.
Zurück zum Zitat Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: Proceedings of International Conference on Machine Learning, pp. 4116–4126 (2020) Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: Proceedings of International Conference on Machine Learning, pp. 4116–4126 (2020)
8.
Zurück zum Zitat Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: Proceedings of International Conference on Mining Software Repositories, pp. 596–607 (2022) Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: Proceedings of International Conference on Mining Software Repositories, pp. 596–607 (2022)
9.
Zurück zum Zitat Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: Proceedings of International Conference on Learning Representation (2019) Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: Proceedings of International Conference on Learning Representation (2019)
10.
Zurück zum Zitat Hohnka, M.J., Miller, J.A., Dacumos, K.M., Fritton, T.J., Erdley, J.D., Long, L.N.: Evaluation of compiler-induced vulnerabilities. J. Aerospace Inform. Syst. 16(10), 409–426 (2019)CrossRef Hohnka, M.J., Miller, J.A., Dacumos, K.M., Fritton, T.J., Erdley, J.D., Long, L.N.: Evaluation of compiler-induced vulnerabilities. J. Aerospace Inform. Syst. 16(10), 409–426 (2019)CrossRef
11.
Zurück zum Zitat Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: GraphMAE: self-supervised masked graph autoencoders. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022) Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: GraphMAE: self-supervised masked graph autoencoders. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022)
12.
Zurück zum Zitat Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48(1), 312–320 (2005)CrossRef Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48(1), 312–320 (2005)CrossRef
14.
Zurück zum Zitat Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conferen on Learning Representation (Poster) (2017) Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conferen on Learning Representation (Poster) (2017)
15.
Zurück zum Zitat Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: Proceedings of International Conference on Learning Representation (2019) Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: Proceedings of International Conference on Learning Representation (2019)
16.
Zurück zum Zitat Li, X., Feng, B., Li, G., Li, T., He, M.: A vulnerability detection system based on fusion of assembly code and source code. Sec. Commun. Netw. 2021 (2021) Li, X., Feng, B., Li, G., Li, T., He, M.: A vulnerability detection system based on fusion of assembly code and source code. Sec. Commun. Netw. 2021 (2021)
17.
Zurück zum Zitat Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Sec. Comput. 19(4), 2821–2837 (2021)CrossRef Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Sec. Comput. 19(4), 2821–2837 (2021)CrossRef
18.
Zurück zum Zitat Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of Annual Computer Security Applications Conference, pp. 201–213 (2016) Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of Annual Computer Security Applications Conference, pp. 201–213 (2016)
19.
Zurück zum Zitat Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Sec. Comput. 19(4), 2244–2258 (2021)CrossRef Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Sec. Comput. 19(4), 2244–2258 (2021)CrossRef
20.
Zurück zum Zitat Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of Annual Network & Distributed System Security Symposium (2018) Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of Annual Network & Distributed System Security Symposium (2018)
21.
Zurück zum Zitat Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: POSTER: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of ACM Conference on Computer and Communications Security, pp. 2539–2541 (2017) Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: POSTER: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of ACM Conference on Computer and Communications Security, pp. 2539–2541 (2017)
22.
Zurück zum Zitat Lin, G.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Indus. Inform. 14(7), 3289–3297 (2018)CrossRef Lin, G.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Indus. Inform. 14(7), 3289–3297 (2018)CrossRef
23.
Zurück zum Zitat Lipp, S., Banescu, S., Pretschner, A.: An empirical study on the effectiveness of static C code analyzers for vulnerability detection. In: Proceedings of ACM International Symposium on Software Testing and Analysis, pp. 544–555 (2022) Lipp, S., Banescu, S., Pretschner, A.: An empirical study on the effectiveness of static C code analyzers for vulnerability detection. In: Proceedings of ACM International Symposium on Software Testing and Analysis, pp. 544–555 (2022)
24.
Zurück zum Zitat Ma, R., Jian, Z., Chen, G., Ma, K., Chen, Y.: ReJection: a AST-based reentrancy vulnerability detection method. In: Proceedings of Chinese Conference on Trusted Computing and Information Security, pp. 58–71 (2020) Ma, R., Jian, Z., Chen, G., Ma, K., Chen, Y.: ReJection: a AST-based reentrancy vulnerability detection method. In: Proceedings of Chinese Conference on Trusted Computing and Information Security, pp. 58–71 (2020)
27.
Zurück zum Zitat Pinconschi, E., Abreu, R., Adão, P.: A comparative study of automatic program repair techniques for security vulnerabilities. In: Proceedings of IEEE International Symposium on Software Reliability Engineering, pp. 196–207 (2021) Pinconschi, E., Abreu, R., Adão, P.: A comparative study of automatic program repair techniques for security vulnerabilities. In: Proceedings of IEEE International Symposium on Software Reliability Engineering, pp. 196–207 (2021)
28.
Zurück zum Zitat Russell, R., et al.: klM.: Automated vulnerability detection in source code using deep representation learning. In: Proceedings of IEEE International Conference on Machine Learning and Applications, pp. 757–762 (2018) Russell, R., et al.: klM.: Automated vulnerability detection in source code using deep representation learning. In: Proceedings of IEEE International Conference on Machine Learning and Applications, pp. 757–762 (2018)
29.
Zurück zum Zitat Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Proceedings of European Semantic Web Conference, pp. 593–607 (2018) Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Proceedings of European Semantic Web Conference, pp. 593–607 (2018)
30.
Zurück zum Zitat Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12(9) (2011) Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12(9) (2011)
31.
Zurück zum Zitat Shimchik, N., Ignatyev, V., Belevantsev, A.: Improving accuracy and completeness of source code static taint analysis. In: Ivannikov Ispras Open Conference, pp. 61–68 (2021) Shimchik, N., Ignatyev, V., Belevantsev, A.: Improving accuracy and completeness of source code static taint analysis. In: Ivannikov Ispras Open Conference, pp. 61–68 (2021)
32.
Zurück zum Zitat Sun, F.Y., Hoffmann, J., Verma, V., Tang, J.: Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: Proceedings of International Conference on Learning Representations (2020) Sun, F.Y., Hoffmann, J., Verma, V., Tang, J.: Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: Proceedings of International Conference on Learning Representations (2020)
33.
Zurück zum Zitat Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: Proceedings of International Conference on Learning Representation (2017) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: Proceedings of International Conference on Learning Representation (2017)
34.
Zurück zum Zitat Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: Proceedings of International Conference on Learning Representation (2019) Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: Proceedings of International Conference on Learning Representation (2019)
35.
Zurück zum Zitat Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: Proceedings of International Conference on Learning Representation (2019) Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: Proceedings of International Conference on Learning Representation (2019)
36.
Zurück zum Zitat Xu, L., Sun, F., Su, Z.: Constructing precise control flow graphs from binaries. The University of California, Davis, Tech. rep. (2009) Xu, L., Sun, F., Su, Z.: Constructing precise control flow graphs from binaries. The University of California, Davis, Tech. rep. (2009)
37.
Zurück zum Zitat Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: Proceedings IEEE Symposium on Security & Privacy, pp. 590–604 (2014) Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: Proceedings IEEE Symposium on Security & Privacy, pp. 590–604 (2014)
38.
Zurück zum Zitat Yamaguchi, F., Lindner, F.F., Rieck, K.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of USENIX Workshop Offensive Technologies, pp. 118–127 (2011) Yamaguchi, F., Lindner, F.F., Rieck, K.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of USENIX Workshop Offensive Technologies, pp. 118–127 (2011)
39.
Zurück zum Zitat Yanardag, P., Vishwanathan, S.: Deep graph kernels. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374 (2015) Yanardag, P., Vishwanathan, S.: Deep graph kernels. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374 (2015)
40.
Zurück zum Zitat You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., Shen, Y.: Graph contrastive learning with augmentations. In: Proceedings of Conference on Neural Information Processing Systems, pp. 5812–5823 (2020) You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., Shen, Y.: Graph contrastive learning with augmentations. In: Proceedings of Conference on Neural Information Processing Systems, pp. 5812–5823 (2020)
41.
Zurück zum Zitat Zhang, H., Wu, Q., Yan, J., Wipf, D., Yu, P.S.: From canonical correlation analysis to self-supervised graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 76–89 (2021) Zhang, H., Wu, Q., Yan, J., Wipf, D., Yu, P.S.: From canonical correlation analysis to self-supervised graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 76–89 (2021)
42.
Zurück zum Zitat Zhou, M., et al.: A method for software vulnerability detection based on improved control flow graph. Wuhan University J. Nat. Sci. 24(2), 149–160 (2019)CrossRef Zhou, M., et al.: A method for software vulnerability detection based on improved control flow graph. Wuhan University J. Nat. Sci. 24(2), 149–160 (2019)CrossRef
43.
Zurück zum Zitat Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 10197–10207 (2019) Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 10197–10207 (2019)
44.
Zurück zum Zitat Zhu, Q., Du, B., Yan, P.: Self-supervised training of graph convolutional networks. In: Proceedings of International Conference on Machine Learning, Online (2020) Zhu, Q., Du, B., Yan, P.: Self-supervised training of graph convolutional networks. In: Proceedings of International Conference on Machine Learning, Online (2020)
Metadaten
Titel
VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes
verfasst von
Mahmoud Zamani
Saquib Irtiza
Latifur Khan
Kevin W. Hamlen
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-57537-2_12

Premium Partner