Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Design and development of a semantic music discovery engine

Abstract

Technology is changing the way in which music is produced, distributed and consumed. An aspiring musician in West Africa with a basic desktop computer, an inexpensive microphone, and free audio editing software can record and produce reasonably high-quality music. She can post her songs on any number of musically-oriented social networks (e.g., MySpace, Last.fm, eMusic) making them accessible to the public. A music consumer in San Diego can then rapidly download her songs over a high-bandwidth Internet connection and store them on a 160-gigabyte personal MP3 player. As a result, millions of songs are now instantly available to millions of people. This 'Age of Music Proliferation' has created the need for novel music search and discovery technologies that move beyond the "query-by- artist-name" or "browse-by-genre" paradigms. In this dissertation, we describe the architecture for a semantic music discovery engine. This engine uses information that is both collected from surveys, annotation games and music -related websites, and extracted through the analysis of audio signals and web documents. Together, these five sources of data provide a rich representation that is based on both the audio content and social context of the music. We show how this representation can be used for various music discovery purposes with the Computer Audition Lab (CAL) Music Discovery Engine prototype. This web application provides a music query-by-description interface for music retrieval, recommends music based on acoustic similarity, and generates personalized radio stations. The backbone of the discovery engine is an autotagging system that can both annotate novel audio tracks with semantically meaningful tags (i.e. a short text-based token) and retrieve relevant tracks from a database of unlabeled audio content given a text-based query. We consider the related tasks of content-based audio annotation and retrieval as one supervised multi- class, multi-label problem in which we model the joint probability of acoustic features and tags. For each tag in a vocabulary, we use an annotated corpus of songs to train a Gaussian mixture model (GMM) over an audio feature space. We estimate the parameters of the model using the weighted mixture hierarchies Expectation Maximization algorithm. When compared against standard parameter estimation techniques, this algorithm is more scalable and produces density estimates that result in better end performance. The quality of the music annotations produced by our system is comparable with the performance of humans on the same task. Our query by- semantic-description system can retrieve appropriate songs for a large number of musically relevant tags. We also show that our audition system is general by learning a model that can annotate and retrieve sound effects. We then present Listen Game, an online, multiplayer music annotation game that measures the semantic relationship between songs and tags. In the normal mode, a player sees a list of semantically related tags (e.g., genres, instruments, emotions, usages) and is asked to pick the best and worst tag to describe a song. In the freestyle mode, a user is asked to suggest a tag that describes the song. Each player receives real-time feedback (e.g., a score) that reflects the amount of agreement amongst all of the players. Using the data collected during a two-week pilot study, we show that we can effectively train our autotagging system. We compare our autotagging system and annotation game with three other approaches to collecting tags for music (conducting a survey, harvesting social tags, and mining web documents). The comparison includes a discussion of both scalability (financial cost, human involvement, and computational resources) and quality (cold start problem, popularity bias, strong vs. weak labeling, tag vocabulary structure and size, and annotation accuracy). Each approach is evaluated using a tag-based music information retrieval task. Using this task, we are able to quantify the effect of popularity bias for each approach by making use of a subset of more popular (short head) songs and a set of less popular (long tail) songs. Lastly, we explore three algorithms for combining semantic information about music from multiple data sources: RankBoost, kernel combination SVM, and a novel algorithm which is called Calibrated Score Averaging (CSA). CSA learns a non- parametric function that maps the output of each data source to a probability and then combines these probabilities. We demonstrate empirically that the combining of multiple sources is superior to any of the individual sources alone, when considering the task of tag -based retrieval. While the three combination algorithms perform equivalently on average, they each show superior performance for some of the tags in our vocabulary

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View