Mining the Two Types of User-Supplied Content

Sitting in my chiropractor’s office the other day I read a fascinating article in the offline version of Businessweek: Math will Rock Your World.

In addition to finding out that using a laptop 12-14 hours a day can affect my spine, I also found out about the amazing rise of math in business, from analyzing clickstreams to tracking blog conversations. It seems Google and Yahoo already have next year’s math grads lined up for jobs. They simply cannot get enough brain power to do what they want to do.

Sitting in my chiropractor’s office the other day I read a fascinating article in the offline version of Businessweek. Here’s the online version: Math will Rock Your World.

In addition to finding out that using a laptop 12-14 hours a day can affect my spine, I also found out about the amazing rise of math in business, from analyzing clickstreams to tracking blog conversations. It seems Google and Yahoo already have next year’s math grads lined up for jobs. They simply cannot get enough brain power to do what they want to do.

What do they want to do? Mine data, of course. From the mountains of search queries in Google to the ever-increasing purchase histories at Amazon, we have more data than we know what to do with. Even at the relatively tiny UIE we have more than we can handle. I simply cannot fathom what millions of users could do to a database.

Here’s an interesting bit about Yahoo:

‘At the Sunnyvale (Calif.) campus of Yahoo, chief researcher Prabhakar Raghavan heads a team of 100 mathematicians and computer scientists. Scribbling on a white board covered with equations, Raghavan describes Yahoo’s immense pool of data, featuring the online activity of 200 million registered customers, as Yahoo’s most precious resource. There is a whole world of uninvented businesses, he believes. They’ll come into being as Yahoo discovers new ways to satisfy the urges, curiosities, and desires of this customer base. The hints of these future businesses float in the oceans of Yahoo’s data. Raghavan’s mandate is to sift through that data and form new connections among consumers, e-marketers, and advertisers. Better algorithms, he says, “are critical to survival.”‘

In general, there are two kinds of user-supplied content which can be mined:

  1. User-added content:
    Intentional content. That content which users input themselves. This includes blog posts, comments, reviews, ratings, links, RSS subscriptions, podcasts, and video.
  2. User-generated content:
    Unintentional content. That content which accrues as a byproduct of the actions of users. This includes clickstreams, purchase history, RSS read stats, search history, and other artifacts of behavior. User-generated content serves as evidence that a user passed that way, like footprints.

This distinction may or may not be important. I don’t know. But we are seeing a tremendous amount of work in the area of aggregating these types of content in an effort to build recommendation systems out of them.

In general, though, I think we’re learning some basic rules of thumb. Recommendation systems seem to work better if they are built out of user’s direct preferences, like ratings or reviews. If you try to build them out of say, clickstreams, you won’t get the intentional feedback that you need. For example, Amazon gives recommendations built out of searches on their web site, even if it is something that you’ve only looked at as a gift for someone else. I recently did a search on knitting for my wife and now I’m stuck with knitting books for a while. However, their recommendations built on top of my wish list are much more valuable to me, and I actually find them useful.

Going back to the article, I liked this quote:

“People are complicated,…If you have a system, they figure out how to game it. Machines never do.”

Published: January 31st, 2006