Controlled Vocabularies Cut Off the Long Tail

by Joshua Porter  |   March 9th, 2005  |  shortlink: http://bokardo.com/p/50

In my last post, I ended with the suggestion that within folksonomies we learn all the way down the Long Tail. I would like to say more about that.

An interesting property of folksonomies, or emergent taxonomies culled from tags written by everyone (in the system), is that they’re inclusive. They include everyone’s words, from the popular pundit’s propaganda to the has-been hermit’s hash. Nothing is spared, nothing taken out. It’s like eating chocolate mousse made with heavy cream, a 90% cocoa chocolate bar, and 1 billion calories.

The Power of Including Everyone

The power of folksonomies is what we learn when everyone’s words are aggregated. This is actual user behavior that can provide insights that previously we had little access to. We learn that most people tag items with just one or two words. The combinations of these words can lend insights into how ideas begin to form, what ideas are currently popular, and what people find valuable over time as the system changes. We can also discover Long Tail topics: new or offbeat ideas that normally don’t get much attention or only get attention from a small fraction of the total population.

For example, did you know that lucid dreaming is dreaming while being aware of it? I didn’t until I happened to have the Lucid Dreaming Frequently Asked Questions recommended to me in my del.icio.us inbox (del.icio.us has an inclusive, uncontrolled vocabulary). Now I know some things about lucid dreaming that I’ve often wondered about but never knew.

Discovery vs. Finding

The Long Tail paradigm is about the discovery of information, not just the finding of it. The distinction I’m making here between discovery and finding is that users who discover information didn’t need to know it was there to begin with, and so couldn’t have been trying to find it. In a word: serendipity.

Controlled vocabularies, on the other hand, are mostly about finding information. This results from users having one, often strikingly static, vocabulary to work with, and so their opportunities to discover new information underneath the same old set of categories is small. If the categories always stay the same, doesn’t all the content underneath them, too? It can certainly feel that way.

Controlled Vocabularies are a Best Guess

Controlled vocabularies are, at their finest, a best guess, not an artifact of actual behavior. However, we have many techniques to make controlled vocabularies more user-friendly as we design them: card-sorting, advanced card sorting, synonym dictionaries, variant terms, best bets, mental models. The problem with these techniques, however, is that they’re too deliberate to be inclusive. Someone has to deliberately program the system to show “jeans” when a user types in the variant term “dungarees”. Not only is this expensive and selective, in most systems there is little or no support for it.

The inability to be inclusive also bleeds through in categories like “miscellaneous” and “other” and “solutions”, all practically meaningless terms but useful buckets to throw orphan content in. Unfortunately, these buckets soon become the wasteland of the unknown: nobody ever desires “miscellaneous” information.

Probability is not on the Side of Controlled Vocabularies

Without including everyone’s words it is unlikely that a system with ever-increasing information can meet everyone’s needs. So probability-wise, some knowledge will fall by the wayside.

This is how controlled vocabularies cut off The Long Tail. To borrow a term frequently used to question folksonomies, controlled vocabularies marginalize many opinions, works, and ideas that live in the Long Tail simply by making them extremely hard to discover, or even worse, by excluding them.

Who’s Being Left Out?

For the majority of people going after the majority of information, this might be OK. Or is it? Just how many people and ideas are out there on the Long Tail? How many people’s words and ideas are being marginalized? Well, here is one example of the size of the long tail of books, from Chris Anderson’s The Long Tail article at Wired.com:

“What’s really amazing about the Long Tail is the sheer size of it. Combine enough nonhits on the Long Tail and you’ve got a market bigger than the hits. Take books: The average Barnes & Noble carries 130,000 titles. Yet more than half of Amazon’s book sales come from outside its top 130,000 titles. Consider the implication: If the Amazon statistics are any guide, the market for books that are not even sold in the average bookstore is larger than the market for those that are…”

In other words, the subset of books that Barnes & Noble stocks in its physical stores (its controlled vocabulary) is less than half of what it could sell right now–and that’s with the controlled vocabulary dictating what people are introduced to when they walk in the store! For books, this situation results from a physical limitation of space. In the online world, however, there is no such limitation…

So instead of being exclusive and controlling vocabularies for users, let’s help them discover new things by building inclusive vocabularies. Let’s save the Long Tail.

Make them Care! - Struggling to communicate the value of your product or service? I'm writing a new book that shows you how to make people care about your product or service by clearly communicating the most important bits. For designers and marketers creating product web sites. Find out more.

Links to this Post

Comments

1.  Mike Steckel 11:13am, Wed 9th, 2005

Websites are rarely intended to meet the needs of “everyone.” They are intended to meet the needs of a particular audience segment. “Inclusive” or “not inclusive” just doesn’t mean much when you are talking about the vast majority of websites.

CVs are absolutely the artifacts of actual behavior. A real CV is constantly being updated based on user feedback. B&N rearranges their stores based on feedback also. For them, the books that are “excluded” do not warrant the costs of inclusion. They won’t make their money back by physically restocking them in their stores.

Folksonomies could potentially help with improving the effectiveness of a main CV. They are not mutually exclusive. You are too quick to divide this topic into “good” and “bad.”

I love UIE and have enjoyed other things you have written, but this one seems to be so black and white in its thinking that I remain entirely unconvinced.

2.  Josh 12:38pm, Wed 9th, 2005

Mike, you’re entirely correct, most web sites aren’t built to help everyone. But are the ones that are built satisfying all those folks they should be? Are there folks in the system’s Long Tail being excluded in some way? This is my concern.

In my experience, many CVs aren’t “real” in the sense that you talk about. Most simply stay static over the long term, with little updating. This is unfortunate, not bad.

You’re right to call me on my bias. I’m skeptical of CVs, and always have been, because they’re not inclusive. Perhaps I am being too dismissive, but I have a feeling that those CVs that are continually updated, the “real” ones, are few and far between. In other words, I hesitate to use the “ideal” CV to represent the “usual” CV.

I do disagree, however, with your claim that CVs are artifacts of real behavior. They are not direct artifacts. The changes made to CVs as the result of user feedback are only interpretations of that feedback.

Folksonomies, on the other hand, are direct artifacts. That’s why I’m pushing for them, and, admittedly, giving them the benefit of the doubt at this early stage.

By the way, my opinions and writing here do not necessarily reflect those of UIE. I’m trying out ideas here, and I enjoy productive arguments…

3.  Bud Gibson 5:36pm, Wed 9th, 2005

Josh, I just spent the weekend at the IA Summit in Montreal. My cut, some of the theoretical debate is overblown. People who have to actually manage large controlled vocabularies are looking for help, any help, in maintaining them. They may be “taxonomists” (I actually know one whose nom de guerre is taxonomist), but they are quite willing to resort to folksonomy.

It then becomes an issue of the extent to which folksonomy can help them solve their real-world problems. I think, as you point out, those issues become:

1. Making the taxonomy more friendly to users.
2. Incorporating change into the taxonomy.

Believe it or not, I have a rather gushy post that I wrote on IBM’s treatment of the issue and how they might use folksonomy in a new project. They have a 3700 node *uber* taxonomy (their term) that they are wrestling with.

You can see my post here:

http://thecommunityengine.com/home/archives/2005/03/ibms_intranet_a.html

4.  jim wilde 12:17pm, Thu 31st, 2005

“On the fly” tagging is important because many new ideas (half-baked ones) are hard to explain. Tagging is not categorizing but a way for us to informally describe how we feel about something in our own words. Malcolm Gladwell, author of “The Tipping Point ” and “Blinks” offers some anecdotes on why we can’t trust people’s opinions — because we don’t have the language to express our feelings

5.  Teddie 11:04am, Tue 28th, 2006

Joshua, cutting off the long tail isn’t just a problem with tagging. Many search engine marketing agencies are finding controlled vocabularies to be an increasing problem with some mechanisms within Pay Per Click search advertising.

You might be interested in the Cutting off the Long Tail of Search post on the Search Engine War blog that explains it.

PS. I borrowed and credited a quote from your post.

6.  order effexor 7:53am, Fri 13th, 2009

Very useful website, thanks for this article and very interesting design, thanks.