Repository logo
 

ACCELERATING TEXT RELATEDNESS COMPUTATIONS ON GENERAL PURPOSE GRAPHICS PROCESSING UNITS

Date

2015

Authors

Angevine, Duffy

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis investigates a novel approach for accelerating document similarity calculations using the Google Trigram Method (GTM). GTM can be performed as either a 1:1 comparison between a pair of documents, a 1:N comparison which occurs between one document and several others, or as an N:N comparison, where all documents within a set are compared against each other. Existing research in this domain has focused on accelerating the GTM on standard processors. In contrast, this thesis focuses on accelerating the performance of an N:N document relatedness calculation using a General Purpose Graphics Processing Unit (GPGPU). Fundamental to our approach is the pre-computation of several static elements. These static elements are the GTM inputs: the documents to be compared, and the Google N-Grams. The Google N-Grams are processed to produce a word relatedness matrix, and the documents are tokenized. They are then saved to disk to allow for recall and are available for calculating document relatedness. The mapping of the GTM to a GPGPU requires analysis to establish an effective system to transfer documents to the GPGPU, the data structures to be used in the GTM calculations, as well as an investigation into how to effectively implement GTM on the GPGPU's unique architecture. Having designed a set of GPGPU methods we systematically evaluate their performance. In this thesis, the GPGPU methods are compared to a multi-core Central Processing Unit (CPU) method that acts as a baseline. In total, two different CPU methods and four different GPGPU methods are evaluated. The CPU hardware platform is a workstation with a pair of 8 core Intel Xeon processors, retailing for approximately $10,000. The GPGPU platform is a Nvidia GeForce 660 GTX, worth approximately $200 at the time of purchase. We observe across a wide range of data sets that the GPGPU achieved between 40% and 80% of the performance observed on the multi-core workstation, at one fiftieth of the cost

Description

Keywords

GPGPU, Text Relatedness, GTM, GPU

Citation