Skip to main content
Top

2024 | OriginalPaper | Chapter

VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes

Authors : Mahmoud Zamani, Saquib Irtiza, Latifur Khan, Kevin W. Hamlen

Published in: Foundations and Practice of Security

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The first graph masked auto-encoder (GraphMAE) model for software vulnerability detection is designed and developed, with a comparative evaluation against other self-supervised learning (SSL) methods. Evaluation of the domain-specific GraphMAE model (VulMAE) for the vulnerability detection task shows exceptional promise, outperforming all other baseline models in the study. The approach is particularly well-suited for cybersecurity applications where gathering substantial real-world labeled samples is difficult, since graph SSL methods (e.g., contrastive and generative models) offer data classification in AI tasks without requiring vast amounts of labeled data for effective training.
The study fills a key gap in the literature on automated and machine-assisted discovery and patching of software security vulnerabilities, which has become increasingly critical with the dramatic increase in modern software complexity, but for which graph neural network (GNN) approaches are understudied relative to traditional processes, such as manual source code auditing and fuzzing. To conduct the study, the evaluation applies models to source and binary software components sourced from the National Vulnerability Database (NVD). A new dataset is curated by extracting vulnerable code fragments from six applications with NVD-documented security flaws and converting them to four graph types using specialized tools based on code property graphs and binary semantics lifting. The data is used to train contrastive and generative learning models for comparison. VulMAE achieves a weighted F1 score of 0.936 and a weighted Recall of 0.938, which is the highest of all tested methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Booth, H., Rike, D., Witte, G.A.: The national vulnerability database (NVD): Overview. ITL Bulletin, National Institute of Standards and Technology (2013) Booth, H., Rike, D., Witte, G.A.: The national vulnerability database (NVD): Overview. ITL Bulletin, National Institute of Standards and Technology (2013)
2.
go back to reference Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Proceedings of International Conference on Computer Aided Verification, pp. 463–469 (2011) Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Proceedings of International Conference on Computer Aided Verification, pp. 463–469 (2011)
3.
go back to reference Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet. IEEE Trans. Softw. Eng. 48, 3280–3296 (2022)CrossRef Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet. IEEE Trans. Softw. Eng. 48, 3280–3296 (2022)CrossRef
4.
go back to reference Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifi. Intell. Res. 16(1), 321–357 (2002)CrossRef Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifi. Intell. Res. 16(1), 321–357 (2002)CrossRef
5.
go back to reference Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceedings of ACM/IEEE International Symposium Empirical Software Engineering and Measurement (2021) Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceedings of ACM/IEEE International Symposium Empirical Software Engineering and Measurement (2021)
7.
go back to reference Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: Proceedings of International Conference on Machine Learning, pp. 4116–4126 (2020) Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: Proceedings of International Conference on Machine Learning, pp. 4116–4126 (2020)
8.
go back to reference Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: Proceedings of International Conference on Mining Software Repositories, pp. 596–607 (2022) Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: Proceedings of International Conference on Mining Software Repositories, pp. 596–607 (2022)
9.
go back to reference Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: Proceedings of International Conference on Learning Representation (2019) Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: Proceedings of International Conference on Learning Representation (2019)
10.
go back to reference Hohnka, M.J., Miller, J.A., Dacumos, K.M., Fritton, T.J., Erdley, J.D., Long, L.N.: Evaluation of compiler-induced vulnerabilities. J. Aerospace Inform. Syst. 16(10), 409–426 (2019)CrossRef Hohnka, M.J., Miller, J.A., Dacumos, K.M., Fritton, T.J., Erdley, J.D., Long, L.N.: Evaluation of compiler-induced vulnerabilities. J. Aerospace Inform. Syst. 16(10), 409–426 (2019)CrossRef
11.
go back to reference Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: GraphMAE: self-supervised masked graph autoencoders. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022) Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: GraphMAE: self-supervised masked graph autoencoders. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022)
12.
go back to reference Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48(1), 312–320 (2005)CrossRef Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48(1), 312–320 (2005)CrossRef
14.
go back to reference Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conferen on Learning Representation (Poster) (2017) Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conferen on Learning Representation (Poster) (2017)
15.
go back to reference Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: Proceedings of International Conference on Learning Representation (2019) Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: Proceedings of International Conference on Learning Representation (2019)
16.
go back to reference Li, X., Feng, B., Li, G., Li, T., He, M.: A vulnerability detection system based on fusion of assembly code and source code. Sec. Commun. Netw. 2021 (2021) Li, X., Feng, B., Li, G., Li, T., He, M.: A vulnerability detection system based on fusion of assembly code and source code. Sec. Commun. Netw. 2021 (2021)
17.
go back to reference Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Sec. Comput. 19(4), 2821–2837 (2021)CrossRef Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Sec. Comput. 19(4), 2821–2837 (2021)CrossRef
18.
go back to reference Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of Annual Computer Security Applications Conference, pp. 201–213 (2016) Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of Annual Computer Security Applications Conference, pp. 201–213 (2016)
19.
go back to reference Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Sec. Comput. 19(4), 2244–2258 (2021)CrossRef Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Sec. Comput. 19(4), 2244–2258 (2021)CrossRef
20.
go back to reference Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of Annual Network & Distributed System Security Symposium (2018) Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of Annual Network & Distributed System Security Symposium (2018)
21.
go back to reference Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: POSTER: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of ACM Conference on Computer and Communications Security, pp. 2539–2541 (2017) Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: POSTER: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of ACM Conference on Computer and Communications Security, pp. 2539–2541 (2017)
22.
go back to reference Lin, G.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Indus. Inform. 14(7), 3289–3297 (2018)CrossRef Lin, G.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Indus. Inform. 14(7), 3289–3297 (2018)CrossRef
23.
go back to reference Lipp, S., Banescu, S., Pretschner, A.: An empirical study on the effectiveness of static C code analyzers for vulnerability detection. In: Proceedings of ACM International Symposium on Software Testing and Analysis, pp. 544–555 (2022) Lipp, S., Banescu, S., Pretschner, A.: An empirical study on the effectiveness of static C code analyzers for vulnerability detection. In: Proceedings of ACM International Symposium on Software Testing and Analysis, pp. 544–555 (2022)
24.
go back to reference Ma, R., Jian, Z., Chen, G., Ma, K., Chen, Y.: ReJection: a AST-based reentrancy vulnerability detection method. In: Proceedings of Chinese Conference on Trusted Computing and Information Security, pp. 58–71 (2020) Ma, R., Jian, Z., Chen, G., Ma, K., Chen, Y.: ReJection: a AST-based reentrancy vulnerability detection method. In: Proceedings of Chinese Conference on Trusted Computing and Information Security, pp. 58–71 (2020)
27.
go back to reference Pinconschi, E., Abreu, R., Adão, P.: A comparative study of automatic program repair techniques for security vulnerabilities. In: Proceedings of IEEE International Symposium on Software Reliability Engineering, pp. 196–207 (2021) Pinconschi, E., Abreu, R., Adão, P.: A comparative study of automatic program repair techniques for security vulnerabilities. In: Proceedings of IEEE International Symposium on Software Reliability Engineering, pp. 196–207 (2021)
28.
go back to reference Russell, R., et al.: klM.: Automated vulnerability detection in source code using deep representation learning. In: Proceedings of IEEE International Conference on Machine Learning and Applications, pp. 757–762 (2018) Russell, R., et al.: klM.: Automated vulnerability detection in source code using deep representation learning. In: Proceedings of IEEE International Conference on Machine Learning and Applications, pp. 757–762 (2018)
29.
go back to reference Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Proceedings of European Semantic Web Conference, pp. 593–607 (2018) Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Proceedings of European Semantic Web Conference, pp. 593–607 (2018)
30.
go back to reference Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12(9) (2011) Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12(9) (2011)
31.
go back to reference Shimchik, N., Ignatyev, V., Belevantsev, A.: Improving accuracy and completeness of source code static taint analysis. In: Ivannikov Ispras Open Conference, pp. 61–68 (2021) Shimchik, N., Ignatyev, V., Belevantsev, A.: Improving accuracy and completeness of source code static taint analysis. In: Ivannikov Ispras Open Conference, pp. 61–68 (2021)
32.
go back to reference Sun, F.Y., Hoffmann, J., Verma, V., Tang, J.: Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: Proceedings of International Conference on Learning Representations (2020) Sun, F.Y., Hoffmann, J., Verma, V., Tang, J.: Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: Proceedings of International Conference on Learning Representations (2020)
33.
go back to reference Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: Proceedings of International Conference on Learning Representation (2017) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: Proceedings of International Conference on Learning Representation (2017)
34.
go back to reference Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: Proceedings of International Conference on Learning Representation (2019) Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: Proceedings of International Conference on Learning Representation (2019)
35.
go back to reference Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: Proceedings of International Conference on Learning Representation (2019) Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: Proceedings of International Conference on Learning Representation (2019)
36.
go back to reference Xu, L., Sun, F., Su, Z.: Constructing precise control flow graphs from binaries. The University of California, Davis, Tech. rep. (2009) Xu, L., Sun, F., Su, Z.: Constructing precise control flow graphs from binaries. The University of California, Davis, Tech. rep. (2009)
37.
go back to reference Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: Proceedings IEEE Symposium on Security & Privacy, pp. 590–604 (2014) Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: Proceedings IEEE Symposium on Security & Privacy, pp. 590–604 (2014)
38.
go back to reference Yamaguchi, F., Lindner, F.F., Rieck, K.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of USENIX Workshop Offensive Technologies, pp. 118–127 (2011) Yamaguchi, F., Lindner, F.F., Rieck, K.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of USENIX Workshop Offensive Technologies, pp. 118–127 (2011)
39.
go back to reference Yanardag, P., Vishwanathan, S.: Deep graph kernels. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374 (2015) Yanardag, P., Vishwanathan, S.: Deep graph kernels. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374 (2015)
40.
go back to reference You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., Shen, Y.: Graph contrastive learning with augmentations. In: Proceedings of Conference on Neural Information Processing Systems, pp. 5812–5823 (2020) You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., Shen, Y.: Graph contrastive learning with augmentations. In: Proceedings of Conference on Neural Information Processing Systems, pp. 5812–5823 (2020)
41.
go back to reference Zhang, H., Wu, Q., Yan, J., Wipf, D., Yu, P.S.: From canonical correlation analysis to self-supervised graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 76–89 (2021) Zhang, H., Wu, Q., Yan, J., Wipf, D., Yu, P.S.: From canonical correlation analysis to self-supervised graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 76–89 (2021)
42.
go back to reference Zhou, M., et al.: A method for software vulnerability detection based on improved control flow graph. Wuhan University J. Nat. Sci. 24(2), 149–160 (2019)CrossRef Zhou, M., et al.: A method for software vulnerability detection based on improved control flow graph. Wuhan University J. Nat. Sci. 24(2), 149–160 (2019)CrossRef
43.
go back to reference Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 10197–10207 (2019) Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 10197–10207 (2019)
44.
go back to reference Zhu, Q., Du, B., Yan, P.: Self-supervised training of graph convolutional networks. In: Proceedings of International Conference on Machine Learning, Online (2020) Zhu, Q., Du, B., Yan, P.: Self-supervised training of graph convolutional networks. In: Proceedings of International Conference on Machine Learning, Online (2020)
Metadata
Title
VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes
Authors
Mahmoud Zamani
Saquib Irtiza
Latifur Khan
Kevin W. Hamlen
Copyright Year
2024
DOI
https://doi.org/10.1007/978-3-031-57537-2_12

Premium Partner