|
Lately I have been working on several social software projects. One of which required me to display a tag cloud. For the Web 2.0-challenged, a Tag Cloud is essentially just a weighted list. In the case of most social bookmarking applications, the weight is typically the number of bookmarks associated with a given tag.
There are two main problems that I encountered when trying to build the perfect tag cloud. The first was with data volume. The more tags you have, the more you need to accomodate within the summary clouds. The problem is that if you have to scroll down to see the rest of the cloud, the relative weight of tags within the cloud loses that ability to immediately grasp meaning. The other problem is with font size distribution: how do you create the perfect distribution of tags across a fixed number of rendered font sizes? To deal with these questions I considered several approaches for limiting and filtering data and other algorithms for font size distribution such as linear algorithms, logarithmic algorithms, and even k-Means clustering.
As a result of my research into building tag clouds, I've created a Whitepaper : In Search of the Perfect Tag Cloud. Its a pretty hefty paper that comes in around 13 pages in PDF format so I can't blog the whole thing. However, I have uploaded the Whitepaper to my blog host and you can download that PDF here.
I'd love to know what people think of the analysis I've done so far. Hopefully some of the information contained in the Whitepaper will be useful to some people and will help them get further along in their Web 2.0 / social software projects.
Again, click here to download the Whitepaper PDF.
Nice paper but it kind of fell down at the end where you covered clustering
algorithms. Surely, there's a fast heuristic clustering algorithm that
while not perfect is good enough. But your conclusion is probably fine too
- logarithmic is 'good enough.'
The algorithms that I discuss are purely about font size distribution.
Filtering the data that goes into a cloud is relatively trivial. I am
currently using the logarithmic cloud algorithm to "cloud" popular tags,
not-so-popular tags, my tags, someone else's tags... you get the idea.
Kevin, it might be worth checking that algorithm for the log(size). The
original pseudo code at echochamber.code was incorrect. In a quick test for
me it looked like it just goes back to linear, plus the top band never
deviates etc...