Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Models and Algorithms for Crowdsourcing Discovery

Abstract

The internet enables us to collect and store unprecedented amounts of data. We need better models for processing, analyzing, and making conclusions from the data. In this work, crowdsourcing is presented as a viable option for collecting data, extracting patterns and insights from big data. Humans in collaboration, when provided with appropriate tools, can collectively see patterns, extract insights and draw conclusions from data. We study different models and algorithms for crowdsourcing discovery.

In each section in this dissertation a problem is proposed, the importance of it is discussed, solutions are proposed and evaluated. Crowdsourcing is the unifying theme for the projects that are presented in this dissertation. In the first half of the dissertation we study different aspects of crowdsourcing like pricing, completion times, incentives, and consistency with in-lab and controlled experiments. In the second half of the dissertation we focus on Opinion Space\footnote{opinion.berkeley.edu} and the algorithms and models that we designed for collecting innovative ideas from participants. This dissertation specifically studies how to use crowdsourcing to discover patterns and innovative ideas.

We start by looking at the CONE Welder project\footnote{Available at http://cone.berkeley.edu/ from 2008 to 2011} which uses a robotic camera in a remote location to study the effect of climate change on the migration of birds. In CONE, an amateur birdwatcher can operate a robotic camera at a remote location from within her web browser. She can take photos of different bird species and classify different birds using the user interface in CONE. This allowed us to compare the species presented in the area from 2008 to 2011 with the species presented in the area that are reported by Blacklock in 1984 \cite{Blacklock:1984}. Citizen scientists found eight avian species previously unknown to have breeding populations within the region. CONE is an example of using crowdsourcing for discovering new migration patterns.

Crowdsourcing can also be used to collect data on human motor movement. Fitts' law is a classical model to predict the average movement time for a human motor motion. It has been traditionally used in the field of human-computer interaction (HCI) as a model that explains the movement time from an origin to a target by a pointing device and it is a logarithmic function of the width of the target ($W$) and the distance of the pointer to the target ($A$). In the next project we first present the square-root variant of the Fitts' law similar to Meyer et el. \cite{meyer1988optimality}. To evaluate this model we performed two sets of studies, one uncontrolled and crowdsourced study and one in-lab controlled study with 46 participants. We show that the data collected from the crowdsourced experiment accurately follows the results from the in-lab experiments. For Homogeneous Targets the Square-Root model ($T= a + b \sqrt{\frac{A}{W}}$) results in a smaller ERMS error than the two other control models, LOG ($T = a +b\log{\frac{2A}{W}}$) and LOG' ($T = a +b\log{\frac{A}{W}+1}$) for $A/W<10$. Similarly for Heterogeneous Targets the Square-Root model results in a significantly smaller ERMS error when compared to the LOG model for $A/W<10$. The LOG model resulted in significantly smaller ERMS error in the $A/W>15$. In the Heterogeneous Targets the LOG' model consistently resulted in a significantly smaller error for $0

Opinion Space is a system that directly elicits opinions from participants for idea generation. It uses both numerical and textual data and we look at methods to combine these two sets of data. Canonical Correlation Analysis, CCA, is used as a method to combine both the textual and numerical inputs from participants. CCA seeks to find linear transformation matrices that maximize the lower dimension correlation between the projection of numerical rating ($Xw_x$) and textual comments onto the two dimensional space ($Yw_y$). In other words it seeks to solve the following problem $argmax_{w_x,w_y} corr(Xw_x, Yw_y)$ in which $X$ and $Y$ are representations of the numerical rating and textual comments of participants in high dimensions and $Xw_x$ and $Yw_y$ are their lower dimension representations. By using participants' numerical feedbacks on each others' comments, we then develop an evaluation framework to compare different dimensionality reduction methods. In our evaluation framework a dimensionality reduction is the most appropriate for Opinion Space when the value of $\gamma = -corr(r,D)$ has the largest value. In $\gamma = -corr(R,D)$, $R$ is the set of $r_{ij}$ values. $r_{ij}$ is the rating that the participant $i$ is giving to the textual opinion of participant $j$. Similarly $D$ is the set that contains $d_{ij}$ values. $d_{ij}$ is the Euclidean distance between the locations of participant $i$ and $j$. In this dissertation we provide supporting argument as to why this evaluation framework is appropriate for Opinion Space. We have compared different variations of CCA and PCA dimensionality reductions on different datasets. Our results suggests that the $\gamma$ values for CCA are at least $\%169$ larger than the $\gamma$ values of PCA, making CCA a more appropriate dimensionality reduction model for Opinion Space.

A product review on an online retailer website is often accompanied with numerical ratings for the product on different scales, a textual review and sometimes information on whether or not the review is helpful. Generalized Sentiment Analysis looks at the correlation between the textual comment and numerical rating and uses that to infer the numerical ratings on different scales from the textual comment. We provide the formulations for using CCA for solving such a problem. We compare our CCA model with Support Vector Machine, Linear Regression, and other traditional machine learning models and highlight the strengths and weaknesses of this model. We found that training the CCA formulation is significantly faster than SVM which is traditionally used in this context (the fastest training time for SVM in LibSVM was 1,126 seconds while CCA took only 33 seconds for training). We also observed that the Mean Squared Error for CCA was smaller than other competing models (The MSE for CCA with tf-idf features was 1.69 while this value for SVM was 2.28). Linear regression was more sensitive to the featurization method. It resulted in larger MSE when used on multinomial ($MSE = 8.88$)and Bernoulli features ($MSE = 4.21$) but smaller MSE when tf-idf weights were used ($MSE=1.47$).

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View