Mining the Two Types of User-Supplied Content

by Joshua Porter  |   10 Comments

Sitting in my chiropractor’s office the other day I read a fascinating article in the offline version of Businessweek. Here’s the online version: Math will Rock Your World.

In addition to finding out that using a laptop 12-14 hours a day can affect my spine, I also found out about the amazing rise of math in business, from analyzing clickstreams to tracking blog conversations. It seems Google and Yahoo already have next year’s math grads lined up for jobs. They simply cannot get enough brain power to do what they want to do.

What do they want to do? Mine data, of course. From the mountains of search queries in Google to the ever-increasing purchase histories at Amazon, we have more data than we know what to do with. Even at the relatively tiny UIE we have more than we can handle. I simply cannot fathom what millions of users could do to a database.

Here’s an interesting bit about Yahoo:

‘At the Sunnyvale (Calif.) campus of Yahoo, chief researcher Prabhakar Raghavan heads a team of 100 mathematicians and computer scientists. Scribbling on a white board covered with equations, Raghavan describes Yahoo’s immense pool of data, featuring the online activity of 200 million registered customers, as Yahoo’s most precious resource. There is a whole world of uninvented businesses, he believes. They’ll come into being as Yahoo discovers new ways to satisfy the urges, curiosities, and desires of this customer base. The hints of these future businesses float in the oceans of Yahoo’s data. Raghavan’s mandate is to sift through that data and form new connections among consumers, e-marketers, and advertisers. Better algorithms, he says, “are critical to survival.”‘

In general, there are two kinds of user-supplied content which can be mined:

  1. User-added content:
    Intentional content. That content which users input themselves. This includes blog posts, comments, reviews, ratings, links, RSS subscriptions, podcasts, and video.
  2. User-generated content:
    Unintentional content. That content which accrues as a byproduct of the actions of users. This includes clickstreams, purchase history, RSS read stats, search history, and other artifacts of behavior. User-generated content serves as evidence that a user passed that way, like footprints.

This distinction may or may not be important. I don’t know. But we are seeing a tremendous amount of work in the area of aggregating these types of content in an effort to build recommendation systems out of them.

In general, though, I think we’re learning some basic rules of thumb. Recommendation systems seem to work better if they are built out of user’s direct preferences, like ratings or reviews. If you try to build them out of say, clickstreams, you won’t get the intentional feedback that you need. For example, Amazon gives recommendations built out of searches on their web site, even if it is something that you’ve only looked at as a gift for someone else. I recently did a search on knitting for my wife and now I’m stuck with knitting books for a while. However, their recommendations built on top of my wish list are much more valuable to me, and I actually find them useful.

Going back to the article, I liked this quote:

“People are complicated,…If you have a system, they figure out how to game it. Machines never do.”

Comments ( 10 Responses so far )

1.  Jonathan on January 31st, 2006 (Comment) #

Your comment about being stuck with knitting books is an interesting addition to a post I read a year back called My TiVo thinks I’m gay. TiVo has somewhat of a disquieting ability to discern your viewing preferences from the things you ask it to record. TiVo uses those perceived preferences to thoughtfully record other stuff it ‘thinks’ you might enjoy. The article talks about how one owner’s TiVo started recording shows that clearly indicated it thought that he was gay. To compensate, the owner started recording programs about war and other ‘manly’ subjects. His TiVo then began overcompensating, thinking his tastes were more in line with those of a WWII Nazi official. In the parlance of show biz; Wackiness ensued!

2.  Josh on February 1st, 2006 (Comment) #

That’s a great pointer! And a great headline, too! Thanks, Jonathan. I love the guy’s reaction and strategy. Brilliant.

3.  Nir Ben-Dor on February 1st, 2006 (Comment) #

Wow, this post really hit the spot for me. I just wrote an article in Linkadelic Magazine earlier today about the problems of the web. Here is a small part:

What does it mean for the future?

Wherever there is something wrong and an ongoing change process, good will eventually take place. I think that the web is still very much in its infancy, and that there is going to be a gradual change which will make the web a better place for its users. This may be likened to the collapse of a bad regime. Users will jump on new services which put the user in the center and empower the user by taking the preferences of the individual as the main consideration. Not the makers of the web sphere, not the “democratic” groups representing the users, but the user as an individual entity.

the rest is at There’s something very wrong with today’s internet

4.  Jared Spool on February 3rd, 2006 (Comment) #

Isn’t #2 (User-generated content) what Gillmor keeps insisting are “Gestures“?

5.  evano on February 3rd, 2006 (Comment) #

I’m not sure about Gillmor’s Gestures, but #2 also reminds me of “Attention” or the artifacts of Attention. Or am I just missing the point? (BTW — your comments preview function is something I hope to see replicated everywhere there’s a comment box!)

6.  Josh on February 3rd, 2006 (Comment) #

Jared, I’m not sure where of if Gillmor’s line of gestures would be drawn…interesting question.

7.  Dewayne Mikkelson on February 6th, 2006 (Comment) #

This is a great quote and it sounds like the service MINT has hit the same problem.
“People are complicated,…If you have a system, they figure out how to game it. Machines never do.”

Pingback: B:datenbrei » Blog Archive » readinglist backlog as of today

8.  Pari Sportifs on January 5th, 2007 (Comment) #

That bit about TIVO was also on King of Queens, where Spence`s TIVO recorded figure skating and all kinds of musicals for him, very funny…

9.  Realtor on September 21st, 2007 (Comment) #

Read, intresting!

Add Your Comment

Accepted tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> .

Preview...

If your comment contains links, or if it is your destiny, your comment may not show up immediately. I'll approve it as soon as I can. (I delete dozens of comment spams per day)

Get updated when someone posts a comment: Comment Feed


ABOUT

Bokardo is the blog of Joshua Porter, a web designer/developer, researcher, and writer. I live in Newburyport, MA, USA.

WHAT IS SOCIAL DESIGN?

Social design is design that focuses on the social lives of users. It deals with the activities, behaviors, and motivations of people who work and play together through software interfaces. It is built on the observation that many of the decisions we make are greatly affected by those we surround ourselves with in our social lives: our family, friends, and colleagues. Exploring our motivations and how to design interfaces to support them is what the Bokardo blog is all about.

Designing for the Social Web

Building a social web site or application? I wrote a book just for you!

designing for the social web

Find out more or order from Peachpit or Amazon

Upcoming Speaking Events