skip to main content
10.1145/3190645.3190692acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
research-article

Malware classification using deep learning methods

Published:29 March 2018Publication History

ABSTRACT

Malware, short for Malicious Software, is growing continuously in numbers and sophistication as our digital world continuous to grow. It is a very serious problem and many efforts are devoted to malware detection in today's cybersecurity world. Many machine learning algorithms are used for the automatic detection of malware in recent years. Most recently, deep learning is being used with better performance. Deep learning models are shown to work much better in the analysis of long sequences of system calls. In this paper a shallow deep learning-based feature extraction method (word2vec) is used for representing any given malware based on its opcodes. Gradient Boosting algorithm is used for the classification task. Then, k-fold cross-validation is used to validate the model performance without sacrificing a validation split. Evaluation results show up to 96% accuracy with limited sample data.

References

  1. Mihai Christodorescu and Somesh Jha. 2003. Static Analysis of Executables to Detect Malicious Patterns. In Proceedings of the 12th Conference on USENIX Security Symposium - Volume 12 (SSYM'03). USENIX Association, Berkeley, CA, USA, 12--12. http://dl.acm.org/citation.cfm?id=1251353.1251365 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. George E Dahl, Jack W Stokes, Li Deng, and Dong Yu. 2013. Large-scale malware classification using random projections and neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 3422--3426.Google ScholarGoogle Scholar
  3. Jake Drew, Tyler Moore, and Michael Hahsler. 2016. Polymorphic malware detection using sequence classification methods. In Security and Privacy Workshops (SPW), 2016 IEEE. IEEE, 81--87.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jerome H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine, In Annals of Statistics. Annals of Statistics 29, 1189--1232. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.9093Google ScholarGoogle ScholarCross RefCross Ref
  5. Wenyi Huang and Jack W Stokes. 2016. MtNet: a multi-task neural network for dynamic malware classification. In Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 399--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. "IDA". 2013. "Ida : Disassembler and debugger. https://www.hexrays.com/products/ida/". (2013).Google ScholarGoogle Scholar
  7. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111--3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Razvan Pascanu, Jack W Stokes, Hermineh Sanossian, Mady Marinescu, and Anil Thomas. 2015. Malware classification with recurrent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 1916--1920.Google ScholarGoogle ScholarCross RefCross Ref
  9. Igor Popov. 2017. Malware detection using machine learning based on word2vec embeddings of machine code instructions. In Data Science and Engineering (SSDSE), 2017 Siberian Symposium on. IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  10. Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G. Bringas. 2013. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences 231 (2013), 64 -- 82. Data Mining for Information Security. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Joshua Saxe and Konstantin Berlin. 2015. Deep neural network based malware detection using two dimensional binary program features. In Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on. IEEE, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alexander Statnikov and Constantin F Aliferis. 2007. Are random forests better than support vector machines for microarray-based cancer classification?. In AMIA annual symposium proceedings, Vol. 2007. American Medical Informatics Association, 686.Google ScholarGoogle Scholar
  13. A. H. Sung, J. Xu, P. Chavez, and S. Mukkamala. 2004. Static Analyzer of Vicious Executables (SAVE). In Proceedings of the 20th Annual Computer Security Applications Conference (ACSAC '04). IEEE Computer Society, Washington, DC, USA, 326--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Momina Tabish, M. Zubair Shafiq, and Muddassar Farooq. 2009. Malware Detection Using Statistical Analysis of Byte-level File Content. In Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics (CSIKDD '09). ACM, New York, NY, USA, 23--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Yadegari, B. Johannesmeyer, B. Whitely, and S. Debray. 2015. A Generic Approach to Automatic Deobfuscation of Executable Code. In 2015 IEEE Symposium on Security and Privacy. 674--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yanfang Ye, Tao Li, Donald Adjeroh, and S Sitharama Iyengar. 2017. A survey on malware detection using data mining techniques. ACM Computing Surveys (CSUR) 50, 3 (2017), 41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey, and Uday Tupakula. 2017. Autoencoder-based feature learning for cyber security applications. In Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 3854--3861.Google ScholarGoogle ScholarCross RefCross Ref
  18. Mikhail Zolotukhin and Timo Hamalainen. 2014. Detection of zero-daymalware based on the analysis of opcode sequences. (01 2014), 386--391.Google ScholarGoogle Scholar

Index Terms

  1. Malware classification using deep learning methods

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ACMSE '18: Proceedings of the ACMSE 2018 Conference
          March 2018
          246 pages
          ISBN:9781450356961
          DOI:10.1145/3190645
          • Conference Chair:
          • Ka-Wing Wong,
          • Program Chair:
          • Chi Shen,
          • Publications Chair:
          • Dana Brown

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 March 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          ACMSE '18 Paper Acceptance Rate34of41submissions,83%Overall Acceptance Rate178of377submissions,47%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader