dc.contributor.author | Liu, Hongyu. | en_US |
dc.date.accessioned | 2014-10-21T12:33:48Z | |
dc.date.available | 2007 | |
dc.date.issued | 2007 | en_US |
dc.identifier.other | AAINR31499 | en_US |
dc.identifier.uri | http://hdl.handle.net/10222/54962 | |
dc.description | A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information gleaned from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as current observations. | en_US |
dc.description | The goal of this research has been to design a robust method for the focused crawling problem, capable of collecting as many pages as possible that are relevant to the given topics. To address this challenge, we propose a new approach for focused crawling to capture sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimate the distance of this document from a target. | en_US |
dc.description | Within this framework, we further investigate three probabilistic models for focused crawling. With Hidden Markov Models (HMMs), we focus on semantic content analysis with sequential pattern learned from the user's browsing behavior on specific topics. We extend our work to take advantage of richer representations of multiple features extracted from Web pages. With Maximum Entropy Markov Models (MEMMs), we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With Linear-chain Conditional Random Fields (CRFs), a form of undirected graphical models, we further focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations. | en_US |
dc.description | We conclude the thesis with an experimental validation and comparison of HMMs, MEMMs and CRFs for focused crawling. The experimental results of our model show significant performance improvement over Best-First crawling (BFC). | en_US |
dc.description | Thesis (Ph.D.)--Dalhousie University (Canada), 2007. | en_US |
dc.language | eng | en_US |
dc.publisher | Dalhousie University | en_US |
dc.publisher | | en_US |
dc.subject | Computer Science. | en_US |
dc.title | Probabilistic models for focused Web crawling. | en_US |
dc.type | text | en_US |
dc.contributor.degree | Ph.D. | en_US |