Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Novel Structure Similarity-Based Methods for Identifying Drug-Like Compounds

Abstract

The prediction of biologically active compounds is of great importance for high-throughput screening (HTS) approaches in drug discovery and chemical genomics. Many computational methods in this area focus on measuring the structural similarities between chemical structures. However, traditional similarity measures are often too rigid or they consider only global similarities between structures. This study introduces two new alternative search approaches that overcome most of these limitations. First, the maximum common substructure (MCS) approach provides a more promising and flexible alternative for predicting bioactive compounds. A new backtracking algorithm for MCS is proposed here and compared to global similarity measurements. Our algorithm provides high flexibility in the matching process and is very efficient in identifying local structural similarities. To apply the MCS-based similarity measure in predictive models of biological activity of compounds, the concept of basis compounds is introduced to enable researchers to easily combine the MCS-based and traditional similarity measures with modern machine learning techniques. Our experiments on real compound datasets demonstrate that MCS complements the wellknown atom pair descriptor-based similarity measure. By combining these two measures, we propose an SVM-based algorithm for predicting the biological activities of chemical compounds with high specificity and sensitivity.

In similarity search and clustering applications of very large compound sets, most methods are limited in efficiency and scalability and cannot handle today's large compound datasets with several million entries. This is particularly true for MCS-based methods and the computation complexity renders MCS infeasible for large compound dataset. The second main topic of this study addresses this time performance issue by introducing a new method for greatly accelerating similarity search and clustering of very large compound sets using embedding and indexing techniques. The method, which can be used with MCS-based as well as traditional similarity measures, embeds compounds in a high-dimensional Euclidean space and searches this space using an efficient index-aware nearest neighbor search method based on Locality Sensitive Hashing. The method can also be used to accelerate cluster analysis of large compound sets. When applied to similarity search in compound datasets as large as PubChem, we found that the method was 40-200 times faster than sequential search methods, while maintaining comparable recall rates. It also made MCS-based similarity search tractable for large compound datasets. When applied to the clustering of such compound datasets, it helped to reduce the computation time from several months to only a few days.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View