Skip to main content
eScholarship
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Code Clone Detection using Code2Vec

Abstract

Code Clone detection is important in software engineering as it aims at solving various problems like code maintenance, identification, code reuse, scalability, and plagiarism. Software development revolves around implementing logic using tools and technologies where every developer has a different coding style and logical approach to reach required goals. However, the end result of many implementations can be the same. This is where the need for code maintainability, reusability, and optimization arises. Code clone detection can help to leverage the immensely large source codes available on the web to attenuate code writing time by reusing sources available online. Clone detection in source code is based on the similarity of the program content or similarity in the program functionality. There are many techniques that have been tried and tested in the literature. However, these naïve approaches do not perform adequately for higher-level clones. In this thesis, I am exploring deep learning based technique Code2Vec. In order to identify, compare, and reuse an existing piece of code, deep learning techniques can help to predict if a similar implementation source code exists in a codebase or a dataset of codes. In this thesis, the approach of representing code in the form of vectors and applying Natural Language Processing for code clone detection has been discussed. The scope of the thesis is to devise an approach for the detection of similar functional methods in Java GitHub code repositories, expanding on the Code2Vec model [1]. I demonstrate the capabilities of applying the Code2Vec model to Java source code in order to determine the path vectors for method similarity detection. Furthermore, I discuss the design, architecture, and usage of the model. An in-depth analysis of preprocessing mechanisms, data collection, data preprocessing is also highlighted. In this thesis, I apply a normalization technique to update the variable names in the Code2Vec approach and compare the baseline Code2Vec model with the normalized model. The comparison shows improved precision results in the normalized code clone detection approach. Finally, the benefits of my approach and a detailed analysis of results on a dataset with Java methods are presented. The results are evaluated on the basis of Recall and Precision. I evaluate the recall with the help of BigCloneBench and precision using InspectorClone and MeasurePrecision open-source tools.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View