Probabilistic models for focused Web crawling.

Liu, Hongyu.

dc.contributor.author	Liu, Hongyu.	en_US
dc.date.accessioned	2014-10-21T12:33:48Z
dc.date.available	2007
dc.date.issued	2007	en_US
dc.identifier.other	AAINR31499	en_US
dc.identifier.uri	http://hdl.handle.net/10222/54962
dc.description	A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information gleaned from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as current observations.	en_US
dc.description	The goal of this research has been to design a robust method for the focused crawling problem, capable of collecting as many pages as possible that are relevant to the given topics. To address this challenge, we propose a new approach for focused crawling to capture sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimate the distance of this document from a target.	en_US
dc.description	Within this framework, we further investigate three probabilistic models for focused crawling. With Hidden Markov Models (HMMs), we focus on semantic content analysis with sequential pattern learned from the user's browsing behavior on specific topics. We extend our work to take advantage of richer representations of multiple features extracted from Web pages. With Maximum Entropy Markov Models (MEMMs), we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With Linear-chain Conditional Random Fields (CRFs), a form of undirected graphical models, we further focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations.	en_US
dc.description	We conclude the thesis with an experimental validation and comparison of HMMs, MEMMs and CRFs for focused crawling. The experimental results of our model show significant performance improvement over Best-First crawling (BFC).	en_US
dc.description	Thesis (Ph.D.)--Dalhousie University (Canada), 2007.	en_US
dc.language	eng	en_US
dc.publisher	Dalhousie University	en_US
dc.publisher		en_US
dc.subject	Computer Science.	en_US
dc.title	Probabilistic models for focused Web crawling.	en_US
dc.type	text	en_US
dc.contributor.degree	Ph.D.	en_US

Find Full text

Files in this item

Name:: NR31499.PDF
Size:: 7.168Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record