File Content-based Malware Classification

Abstract

Malicious Software (MALWARE) is a serious threat to system security the moment any electronic gadget or ‘Thing’ is connected to the World Wide Web (WWW). The malware - stealthy software that is used to collect sensitive information gains access to private systems and can disrupt device operation. Thus, malware acts against the user requirement and is a threat to all operating systems (OS), but more to Windows and Android systems, as those are the most widely used OS. Malware developers try to invade the system by means of viruses, adware, spyware, ransomware, botware, Trojans, etc. Developers try different anti-forensic techniques so that malware cannot be detected or investigated. Malware developers typically play ‘peekaboo’ with the malware investigators. The result is that investigating such attacks becomes more complex, and many times it fails because of immature forensics methodology or a lack of appropriate tools. This chapter is the first step towards analysing malware. The process started with malware dataset collection and understanding the same. ML has two basic blocks, i.e., feature extraction and classification. In the case of supervised learning, this feature plays a significant role. This asks for understanding features and their effect on classification, which was a major task. Two separate experimental processes were explored. The first one involved extracting n-grams from the binary files using the kfNgram tool, and the second one used a shell script to parse the assembly files for method calls to external API libraries. Several supervised machine learning classifiers like Decision Trees, SVM, and Naive Bayes were used to classify the malware family based on extracted features. We proposed a method to classify malware into nine families as per the Kaggle dataset. It analyses the n-gram of the malware file to generate the feature vector. Here, the value of ’n’ in n-gram is selectable; presently, it is four. The objective was to extract highly probable n-grams from the binary files after pre-processing, i.e., calculating the IG parameter. The present threshold for selecting n-gram from the top-most lists is five hundred. It has been observed that SVM and Decision trees provide accuracy on the scale of 98%. Nevertheless, there are chances of improvement as there is a probability of selecting irrelevant n-grams due to the sequential selection of n-grams. This method is considered a starting point for malware classification.

Keywords: Dynamic, Malware, Machine learning, Random forest, Static, Support vector machine.

Cite as