Show simple item record

dc.contributor.authorLiu, Hongyu.en_US
dc.date.accessioned2014-10-21T12:33:48Z
dc.date.available2007
dc.date.issued2007en_US
dc.identifier.otherAAINR31499en_US
dc.identifier.urihttp://hdl.handle.net/10222/54962
dc.descriptionA focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information gleaned from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as current observations.en_US
dc.descriptionThe goal of this research has been to design a robust method for the focused crawling problem, capable of collecting as many pages as possible that are relevant to the given topics. To address this challenge, we propose a new approach for focused crawling to capture sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimate the distance of this document from a target.en_US
dc.descriptionWithin this framework, we further investigate three probabilistic models for focused crawling. With Hidden Markov Models (HMMs), we focus on semantic content analysis with sequential pattern learned from the user's browsing behavior on specific topics. We extend our work to take advantage of richer representations of multiple features extracted from Web pages. With Maximum Entropy Markov Models (MEMMs), we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With Linear-chain Conditional Random Fields (CRFs), a form of undirected graphical models, we further focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations.en_US
dc.descriptionWe conclude the thesis with an experimental validation and comparison of HMMs, MEMMs and CRFs for focused crawling. The experimental results of our model show significant performance improvement over Best-First crawling (BFC).en_US
dc.descriptionThesis (Ph.D.)--Dalhousie University (Canada), 2007.en_US
dc.languageengen_US
dc.publisherDalhousie Universityen_US
dc.publisheren_US
dc.subjectComputer Science.en_US
dc.titleProbabilistic models for focused Web crawling.en_US
dc.typetexten_US
dc.contributor.degreePh.D.en_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record