David Bourget (Western Ontario)
David Chalmers (ANU, NYU)
Rafael De Clercq
Ezio Di Nucci
Jack Alan Reynolds
Learn more about PhilPapers
Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from largescale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We ﬁnd that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically signiﬁ- cant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.
|Keywords||No keywords specified (fix it)|
No categories specified
(categorize this paper)
Setup an account with your affiliations in order to access resources via your University's proxy server
Configure custom proxy (use this if your affiliation does not provide a proxy)
|Through your library||
References found in this work BETA
No references found.
Citations of this work BETA
No citations found.
Similar books and articles
David Hall & Christopher D. Manning, Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora.
Dan Klein & Christopher D. Manning, From Instance-Level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering.
Dan Klein & Christopher D. Manning, Interpreting and Extending Classical Agglomerative Clustering Algorithms Using a Model-Based Approach.
Brian Riordan & Michael N. Jones (2011). Redundancy in Perceptual and Linguistic Experience: Comparing Feature-Based and Distributional Models of Semantic Representation. Topics in Cognitive Science 3 (2):303-345.
Helen Kennedy (2012). Net Work: Ethics and Values in Web Design. Palgrave Macmillan.
Paul R. Smart (2012). The Web-Extended Mind. Metaphilosophy 43 (4):446-463.
Harry Halpin (2011). Sense and Reference on the Web. Minds and Machines 21 (2):153-178.
Marsha Woodbury (1998). Defining Web Ethics. Science and Engineering Ethics 4 (2):203-212.
Knut Borch-Johnsen, Jørgen H. Olsen & Thorkild I. A. Sørensen (1994). Genes and Family Environment in Familial Clustering of Cancer. Theoretical Medicine and Bioethics 15 (4).
Paolo Bouquet, Heiko Stoermer & Massimiliano Vignolo (2012). Web of Data and Web of Entities: Identity and Reference in Interlinked Data in the Semantic Web. Philosophy and Technology 25 (1):5-26.
Alexandre Monnin & Harry Halpin (2012). Toward a Philosophy of The Web. Metaphilosophy 43 (4):361-379.
Added to index2010-12-22
Total downloads25 ( #160,567 of 1,911,834 )
Recent downloads (6 months)6 ( #116,720 of 1,911,834 )
How can I increase my downloads?