Clustering the Tagged Web

Abstract
Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from largescale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically signifi- cant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.
Keywords No keywords specified (fix it)
Categories No categories specified
(categorize this paper)
Options
 Save to my reading list
Follow the author(s)
My bibliography
Export citation
Find it on Scholar
Edit this record
Mark as duplicate
Revision history Request removal from index Translate to english
 
Download options
PhilPapers Archive


Upload a copy of this paper     Check publisher's policy on self-archival     Papers currently archived: 10,612
External links
Setup an account with your affiliations in order to access resources via your University's proxy server
Configure custom proxy (use this if your affiliation does not provide a proxy)
Through your library
References found in this work BETA

No references found.

Citations of this work BETA

No citations found.

Similar books and articles
Paul R. Smart (2012). The Web-Extended Mind. Metaphilosophy 43 (4):446-463.
Harry Halpin (2011). Sense and Reference on the Web. Minds and Machines 21 (2):153-178.
Marsha Woodbury (1998). Defining Web Ethics. Science and Engineering Ethics 4 (2):203-212.
Analytics

Monthly downloads

Added to index

2010-12-22

Total downloads

4 ( #251,636 of 1,098,399 )

Recent downloads (6 months)

1 ( #284,872 of 1,098,399 )

How can I increase my downloads?

My notes
Sign in to use this feature


Discussion
Start a new thread
Order:
There  are no threads in this forum
Nothing in this forum yet.