Functional Clone Detection in Intelligent Software Components
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Functional Clone Detection in Intelligent Software Components

Abstract

Similarity detection in software systems, also known as clone detection, has been a focus of software engineering in the past years. Cloning can be defined based on simple similarity concepts, such as similarity in pure syntactic features, or based on more sophisticated notions of similarity, where the ultimate functionality (or behavior), with less or no syntactic signals, is the focus. The latter type of clones are known as functional clones.A special case of functional clones, which is becoming more prevalent with the advances in artificial intelligence, is concerned with the behavioral similarity among deep neural network (DNN) models. DNN models are functions as they define a portion of a software system's functionality. The wide adoption of these components in software brings new challenges to similarity detection techniques. DNN models are black boxes containing matrices of numbers learned from a training dataset. The training code contains little knowledge about the ultimate model's behavior, and similar training scripts and network architectures may end up producing completely different models. Model comparison, therefore, cannot rely on the structural properties of the models or their training scripts. Instead, it must compare the models' outputs on canonical inputs which generally are the training or testing datasets. However, such datasets may not always be available due to reasons such as the independence of models' deployment from their datasets, and privacy or security concerns. These issues motivate the need for an approach that can automatically detect functional similarity among DNN models in the absence of canonical inputs. This dissertation starts by looking into the problem of functional clone detection by presenting Oreo, a code clone detection tool focused on clones with diminished syntactic but high semantic similarity. It then presents a systematic study on precision of code clone detection tools, highlighting the importance of looking into clone types (and more importantly, types with decreased syntactic similarity) when measuring the precision. The dissertation continues by formulating the problem of DNN functional similarity detection in the absence of canonical inputs. To solve this problem, it then introduces RICA, a technique that works by generating random inputs to be used instead of canonical inputs for the purpose of similarity detection. Three similarity metrics that can be used with RICA are presented and their strengths and weaknesses are highlighted. The evaluation of RICA is done by performing extensive experiments using a dataset of more than 56K classifiers collected from GitHub. RICA's evaluation shows that it has high precision and recall, and highlights the effectiveness of the similarity metrics used by it. Running RICA on the entire dataset of 56K classifiers results in performing more than 7 million comparisons and finding a cloning percentage of 26% among the analyzed models. This is followed by showing how RICA's applicability can be extended beyond classifiers: applying RICA on a regression task demonstrates its effectiveness in finding clones of regression models. Furthermore, a sensitivity analysis reveals how certain model and training properties affect the performance of RICA. Finally, a taxonomy of DNN clone types is proposed which helps in specifying the ultimate capabilities and limitations of RICA, as well as being helpful in future studies of DNN similarity detection.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View