Why Scale Matters in Tagging Systems

Why and how scale in social tagging systems can leverage the Wisdom of Crowds (much like Google does with links) to make the incorrect tags less influential than certain Aristotelians would have us believe. Ok, so I got into hot water for my Thoughts on the Impending Death of Information Architecture post… But I’m completely […]

Why and how scale in social tagging systems can leverage the Wisdom of Crowds (much like Google does with links) to make the incorrect tags less influential than certain Aristotelians would have us believe.

Ok, so I got into hot water for my Thoughts on the Impending Death of Information Architecture post…

But I’m completely fascinated by this subject. In that piece I referenced a work by Elaine Petersen entitled Beneath the Metadata: Some Philosophical Problems with Folksonomy. Elaine eloquently argues that since tagging systems can contain incorrect information (non-Aristotelian, she calls it brilliantly), they will eventually fail to serve our needs. She says:

“Although folksonomy advocates are beginning to correct some linguistic and cultural variations when applying tags, inconsistencies within the folksonomic classification scheme will always persist. There are no right or wrong classification terms in a folksonomic world, and the system can break down when applied to databases of journal articles or dissertations.”

This argument, as I’ve mentioned before, is one about relativism. Is it OK to have systems which contain misinformation, even if it happens to be the way someone thinks and tags?

Let me put it more bluntly: Do people have the right to think how they want?

If we re-ask the question in this way, the answer is clear. (And no, I don’t think it’s ridiculous to equate this argument with allowing people to think what they want. At some level it *is* about that, in a weird science-fiction way)

So, of course we have the right to think what we want, at least most people think so. (insert analogous religious argument here about actions and beliefs)

Anyway, if you’ve read Bokardo for any period of time (go here to win prizes) you know that I believe our systems should model our behaviors and thoughts, not the other way around. We shouldn’t have to map what’s in our head to some other idea set every time we use software if we don’t have to.

If I want to tag the New York Yankees as “the best team money can buy”, and someone else thinks that’s just plain wrong, then tough for them. That’s how I want to tag it, that’s how I want to re-find it, and that’s how I think about the Bronx Bombers (or was it the Yankees?). In folksonomies the view of the system is *my* view…warts and all.

Moreover, other folks in Red Sox Nation might tag it similarly, thus propagating the potential falsity in the system for Yankees fans to find (except, of course, the Yankees are the best team money can buy). Note, though, that their version of the system will have their version of tags for the Yankees…we still have a problem, according to Elaine…there is information in the system that doesn’t agree with other information in the system.

Geez…sometimes I don’t even agree with myself.

Scale is the Great Equalizer

But the thing is, and this is where Elaine underestimates folksonomies, scale matters. Even if a few people tag things incorrectly, most people won’t. This doesn’t have to do with the fact that most people are Good, it’s just that if we ask enough people the same question or have them observe the same phenomenon, where their experiences overlap will tend to be the reality of the situation.

At this point, we could go many ways with this topic. One way would be to tie in James Surowiecki’s brilliant book The Wisdom of Crowds, which makes a lengthy dissertation on the subject of aggregating individual viewpoints. If, under certain conditions, we aggregate the individual decisions of many people, the result tends to be equal to or better than an expert’s view. Here’s the Wikipedia entry for the Wisdom of Crowds, which gives a quick but good overview, and is no doubt a great irony in and of itself…(the crowd writing about the Wisdom of…itself…in a relativistic system with no authoritative voice except the accumulated voice of all its members)

Another way we could go with this topic is where Dan Stewart went. Dan, commenting on Dave Weinberger’s lengthy reply to Elaine, points to another, relatively important document Bokardoans should all be familiar with by now (I’ve talked about it enough):

“Elaine makes the argument that if an item on the web is tagged with words that do not describe it, then the system breaks down. In The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page the authors state, “Also, it is interesting to note that metadata efforts have largely failed with web search engines, because any text on the page which is not directly represented to the user is abused to manipulate search engines. There are even numerous companies which specialize in manipulating search engines for profit.”

So Dan ties in the Google PageRank algorithm to the folksonomy argument. Cool! However, at this point you may be thinking that Dan is a proponent of tagging systems. Alas, no, he is not. He goes on to say:

“Metadata is data about data, and tagging a page on the internet is essentially adding metadata. For the same reason that search engines no longer rely on metadata, social bookmarking could be abused and eventually become worthless.”

I think Dan has this second bit all wrong because he fails to distinguish where the metadata comes from and who is using it. If it comes from the expert, it’s expert-supplied metadata. This is exactly the type of metadata that Brin and Page were talking about, and in particular the <meta> tags of HTML. Those are defined by the author of the page (the expert) in the head portion of the HTML document.

As the Brin/Page quote points out, meta tags weren’t shown to the user of the page. This meant that document authors weren’t writing them for their users and thus had little incentive to make them accurate. Instead, their primary use was to tell user agents (search engines) what the page is about.

Because there is no personal use, meta tags get abused. If it doesn’t make a difference to the author what the meta tags say, then they’ll manipulate them away from what best describes their page to what best gets search engines to return them high in the results. This is the inflection point: at this point they become, essentially, SPAM.

However, tags are not defined by authors. They’re supplied by users. They’re user-supplied metadata. As a result, they’re used by the very people who created them. And, it is in that person’s best interest to keep them useful. Even though they can be incorrect like SPAM, they are not like SPAM in that someone actually has incentive to keep them valuable for human use.

BTW: this all seems to follow The Del.icio.us Lesson.

Further, what is the best example of user-supplied metadata on the Web? Links, of course. Links are essentially references to other documents. Links are created by authors but differ from meta tags because people actually use the links, following them and learning from them. Whereas manipulated meta tags didn’t hurt the user experience, manipulated links seriously kills it. If you are putting up bad links on your pages, people respond negatively…and swiftly. They just won’t come back. It’s definitely in the author’s interest to keep links valuable to users.

…and what does Google use to model how we value content? Links!

And we know why we can aggregate links in this way…because we have a large enough set of them to weed out the inconsistencies even as they continue to exist. We’ve got scale, baby!

This isn’t to say that SPAM isn’t a huge problem…it is. I certainly don’t envy the SPAM harvesters at Google. But if we look at all the people making links…the vast majority are creating valuable, non-spammy ones.

So where Dan sees a divergence and a route away from tagging, I see a convergence and a route toward tagging. Not only are tags user-supplied, personal-use metadata (and that will be their primary reason for being), but they also scale really well on a social level because they’re like links…if we have enough of them the incorrect ones (created by spammers and non-spammers alike) actually get lost in the Crowd…

And what does that leave?

Wisdom, I hope.

Published: November 23rd, 2006