Unknown malicious code detection-practical issues Academic Article uri icon

abstract

  • The recent growth in Internet usage has motivated the creation of new malicious code for various purposes, including information warfare. Today's signature-based anti-viruses can detect accurately known malicious code but are very limited in detecting new malicious code. New malicious codes are being created every day, and their number is expected to increase in the coming years. Recently, machine learning methods, such as classification algorithms, were used successfully for the detection of unknown malicious code. These studies were based on a test collection with a limited size of less than 3,000 files, and the proportions of malicious and benign files in both the training and test sets were identical. These test collections do not correspond to real life conditions, in which the percentage of malicious files is significantly lower than that of the benign files. In this study we present a methodology for the detection of unknown malicious code. The executable binary code is represented by n-grams. We performed an extensive evaluation using a test collection of more than 30,000 files, in which we investigated the imbalance problem. Five levels of Malicious Files Percentage (MFP) in the training set (16.7, 33.4, 50, 66.7 and 83.4%) were used to train classifiers. 17 levels of MFP (5, 7.5,.5 and 95%) were set in the test set to represent various benign/malicious files ratio during the detection. Our evaluation results suggest that varying classification algorithms react differently to the various benign/malicious files ratio. For 10% MFP in the test set, representing real life conditions, in general the highest performance achieved for the use of less than 33.3% MFP in the training set, and in specific classifiers was above 95% of accuracy was achieved. Additionally we present a chronological evaluation, in which the dataset from 2000 to 2007 was divided to training sets and tests sets. Evaluation results show that an update in the training set is needed.

publication date

  • January 1, 2008