Leveraging Transitive Relations for Crowdsourced Joins

The development of crowdsourced query processing systems has recently attracted a significant attention in the database community. A variety of crowdsourced queries have been investigated. In this paper, we focus on the crowdsourced join query which aims to utilize humans to find all pairs of matching objects from two collections. As a human-only solution is expensive, we adopt a hybrid human-machine approach that first uses machines to generate a candidate set of matching pairs, and then asks humans to label the pairs in the candidate set as either matching or non-matching.

Given the candidate pairs, existing approaches will publish all pairs for verification to a crowdsourcing platform. However, they neglect the fact that the pairs satisfy transitive relations. As an example, if o1 matches with o2, and o2 matches with o3, then we can deduce that o1 matches with o3 without needing to crowdsource (o1, o3). To this end, we study how to leverage transitive relations for crowdsourced joins.

We present a hybrid transitive-relations and crowd-sourcing labeling framework which aims to crowdsource the minimum number of pairs to label all the candidate pairs. We propose a heuristic labeling order and devise a parallel labeling algorithm to efficiently crowdsource the pairs following the order. We evaluate our approaches in both sim- ulated environment and a real crowdsourcing platform. Experimental results show that our approaches with transitive relations can save much more money and time than existing methods, with a little loss in the result quality.