Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

A Systematic Approach for Finding and Profiling Malware Source Code in Public Archives

Creative Commons 'BY-NC' version 4.0 license
Abstract

How can we find malware source code and establish the similarity, influence, and phylogeny of these malware? This question is motivated by a real need: there is a dearth of malware source code, which impedes various types of security research. Our work is driven by the following insight: public archives, like GitHub, have a surprising number of malware repositories. This thesis spans three interrelated problems in this space. First, we address the problem of scarcity of malware source code. We propose, SourceFinder, a supervised-learning approach to identify repositories of malware source code efficiently. We evaluate and apply our approach using 97K repositories from GitHub. Second, we propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a)metadata, (b) the structure of the repository, and (c) the source code. It enables ML techniques for similarity identification, clustering, and classification tasks. We evaluate our approach with 1013 java repositories to find similarities and clusters among them. Third, we propose PIMan, a systematic approach to quantify the influence among the repositories in a software archive by focusing on the social level interactions. We introduce the concept of Plausible Influence which considers three types of information: (a) repository level interactions, (b) author level interactions, and (c) temporal considerations. We evaluate and apply our method using 2089 malware repositories from GitHub spanning approximately 12 years. In our thesis, we use the data from GitHub and security forums. We show that our approach, SourceFinder identifies malware repositories with 89% precision and 86% recall using a labeled dataset. We use SourceFinder to identify 7504 malware source code repositories, which arguably constitute the largest malware source code database. Second, we show that our method outperforms previous methods in terms of precision (93%vs 78%), with nearly twice as many Strongly Similar repositories and 30% fewer False Positives. We show how Repo2Vec provides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98% precision and 96% recall in distinguishing malware and benign repositories. We study the social level interaction between two repositories and establish a plausible influence network among them. We find that there is a significant collaboration and influence among the repositories in our dataset. We argue that our approach and our large repository of malware source code can be a catalyst for research studies, which are currently not possible.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View