Controlled Vocabularies Cut Off the Long Tail

Controlled vocabularies are, by their very definition, exclusive. The Long Tail paradigm, however, relies on inclusion. So, doesn’t that leave us with an easy choice?

In my last post, I ended with the suggestion that within folksonomies we learn all the way down the Long Tail. I would like to say more about that.

An interesting property of folksonomies, or emergent taxonomies culled from tags written by everyone (in the system), is that they’re inclusive. They include everyone’s words, from the popular pundit’s propaganda to the has-been hermit’s hash. Nothing is spared, nothing taken out. It’s like eating chocolate mousse made with heavy cream, a 90% cocoa chocolate bar, and 1 billion calories.

The Power of Including Everyone

The power of folksonomies is what we learn when everyone’s words are aggregated. This is actual user behavior that can provide insights that previously we had little access to. We learn that most people tag items with just one or two words. The combinations of these words can lend insights into how ideas begin to form, what ideas are currently popular, and what people find valuable over time as the system changes. We can also discover Long Tail topics: new or offbeat ideas that normally don’t get much attention or only get attention from a small fraction of the total population.

For example, did you know that lucid dreaming is dreaming while being aware of it? I didn’t until I happened to have the Lucid Dreaming Frequently Asked Questions recommended to me in my del.icio.us inbox (del.icio.us has an inclusive, uncontrolled vocabulary). Now I know some things about lucid dreaming that I’ve often wondered about but never knew.

Discovery vs. Finding

The Long Tail paradigm is about the discovery of information, not just the finding of it. The distinction I’m making here between discovery and finding is that users who discover information didn’t need to know it was there to begin with, and so couldn’t have been trying to find it. In a word: serendipity.

Controlled vocabularies, on the other hand, are mostly about finding information. This results from users having one, often strikingly static, vocabulary to work with, and so their opportunities to discover new information underneath the same old set of categories is small. If the categories always stay the same, doesn’t all the content underneath them, too? It can certainly feel that way.

Controlled Vocabularies are a Best Guess

Controlled vocabularies are, at their finest, a best guess, not an artifact of actual behavior. However, we have many techniques to make controlled vocabularies more user-friendly as we design them: card-sorting, advanced card sorting, synonym dictionaries, variant terms, best bets, mental models. The problem with these techniques, however, is that they’re too deliberate to be inclusive. Someone has to deliberately program the system to show “jeans” when a user types in the variant term “dungarees”. Not only is this expensive and selective, in most systems there is little or no support for it.

The inability to be inclusive also bleeds through in categories like “miscellaneous” and “other” and “solutions”, all practically meaningless terms but useful buckets to throw orphan content in. Unfortunately, these buckets soon become the wasteland of the unknown: nobody ever desires “miscellaneous” information.

Probability is not on the Side of Controlled Vocabularies

Without including everyone’s words it is unlikely that a system with ever-increasing information can meet everyone’s needs. So probability-wise, some knowledge will fall by the wayside.

This is how controlled vocabularies cut off The Long Tail. To borrow a term frequently used to question folksonomies, controlled vocabularies marginalize many opinions, works, and ideas that live in the Long Tail simply by making them extremely hard to discover, or even worse, by excluding them.

Who’s Being Left Out?

For the majority of people going after the majority of information, this might be OK. Or is it? Just how many people and ideas are out there on the Long Tail? How many people’s words and ideas are being marginalized? Well, here is one example of the size of the long tail of books, from Chris Anderson’s The Long Tail article at Wired.com:

“What’s really amazing about the Long Tail is the sheer size of it. Combine enough nonhits on the Long Tail and you’ve got a market bigger than the hits. Take books: The average Barnes & Noble carries 130,000 titles. Yet more than half of Amazon’s book sales come from outside its top 130,000 titles. Consider the implication: If the Amazon statistics are any guide, the market for books that are not even sold in the average bookstore is larger than the market for those that are…”

In other words, the subset of books that Barnes & Noble stocks in its physical stores (its controlled vocabulary) is less than half of what it could sell right now–and that’s with the controlled vocabulary dictating what people are introduced to when they walk in the store! For books, this situation results from a physical limitation of space. In the online world, however, there is no such limitation…

So instead of being exclusive and controlling vocabularies for users, let’s help them discover new things by building inclusive vocabularies. Let’s save the Long Tail.

Published: March 9th, 2005

Folksonomies and What’s At Stake

IBM’s Taxonomies and Comparing Knowledge Systems (Notes)

bokardo