Mining the Two Types of User-Supplied Content
Sitting in my chiropractor’s office the other day I read a fascinating article in the offline version of Businessweek. Here’s the online version: Math will Rock Your World.
In addition to finding out that using a laptop 12-14 hours a day can affect my spine, I also found out about the amazing rise of math in business, from analyzing clickstreams to tracking blog conversations. It seems Google and Yahoo already have next year’s math grads lined up for jobs. They simply cannot get enough brain power to do what they want to do.
What do they want to do? Mine data, of course. From the mountains of search queries in Google to the ever-increasing purchase histories at Amazon, we have more data than we know what to do with. Even at the relatively tiny UIE we have more than we can handle. I simply cannot fathom what millions of users could do to a database.
Here’s an interesting bit about Yahoo:
‘At the Sunnyvale (Calif.) campus of Yahoo, chief researcher Prabhakar Raghavan heads a team of 100 mathematicians and computer scientists. Scribbling on a white board covered with equations, Raghavan describes Yahoo’s immense pool of data, featuring the online activity of 200 million registered customers, as Yahoo’s most precious resource. There is a whole world of uninvented businesses, he believes. They’ll come into being as Yahoo discovers new ways to satisfy the urges, curiosities, and desires of this customer base. The hints of these future businesses float in the oceans of Yahoo’s data. Raghavan’s mandate is to sift through that data and form new connections among consumers, e-marketers, and advertisers. Better algorithms, he says, “are critical to survival.”‘
In general, there are two kinds of user-supplied content which can be mined:
- User-added content:
Intentional content. That content which users input themselves. This includes blog posts, comments, reviews, ratings, links, RSS subscriptions, podcasts, and video. - User-generated content:
Unintentional content. That content which accrues as a byproduct of the actions of users. This includes clickstreams, purchase history, RSS read stats, search history, and other artifacts of behavior. User-generated content serves as evidence that a user passed that way, like footprints.
This distinction may or may not be important. I don’t know. But we are seeing a tremendous amount of work in the area of aggregating these types of content in an effort to build recommendation systems out of them.
In general, though, I think we’re learning some basic rules of thumb. Recommendation systems seem to work better if they are built out of user’s direct preferences, like ratings or reviews. If you try to build them out of say, clickstreams, you won’t get the intentional feedback that you need. For example, Amazon gives recommendations built out of searches on their web site, even if it is something that you’ve only looked at as a gift for someone else. I recently did a search on knitting for my wife and now I’m stuck with knitting books for a while. However, their recommendations built on top of my wish list are much more valuable to me, and I actually find them useful.
Going back to the article, I liked this quote:
“People are complicated,…If you have a system, they figure out how to game it. Machines never do.”
Previous
On Moving Forward
Links to this Post
Comments
1. Jonathan 10:44pm, Tue 31st, 2006
Your comment about being stuck with knitting books is an interesting addition to a post I read a year back called My TiVo thinks I’m gay. TiVo has somewhat of a disquieting ability to discern your viewing preferences from the things you ask it to record. TiVo uses those perceived preferences to thoughtfully record other stuff it ‘thinks’ you might enjoy. The article talks about how one owner’s TiVo started recording shows that clearly indicated it thought that he was gay. To compensate, the owner started recording programs about war and other ‘manly’ subjects. His TiVo then began overcompensating, thinking his tastes were more in line with those of a WWII Nazi official. In the parlance of show biz; Wackiness ensued!
2. Josh 6:41am, Wed 1st, 2006
That’s a great pointer! And a great headline, too! Thanks, Jonathan. I love the guy’s reaction and strategy. Brilliant.
3. Nir Ben-Dor 5:14pm, Wed 1st, 2006
Wow, this post really hit the spot for me. I just wrote an article in Linkadelic Magazine earlier today about the problems of the web. Here is a small part:
the rest is at There’s something very wrong with today’s internet
4. Jared Spool 8:45am, Fri 3rd, 2006
Isn’t #2 (User-generated content) what Gillmor keeps insisting are “Gestures“?
5. evano 4:52pm, Fri 3rd, 2006
I’m not sure about Gillmor’s Gestures, but #2 also reminds me of “Attention” or the artifacts of Attention. Or am I just missing the point? (BTW — your comments preview function is something I hope to see replicated everywhere there’s a comment box!)
6. Josh 11:24pm, Fri 3rd, 2006
Jared, I’m not sure where of if Gillmor’s line of gestures would be drawn…interesting question.
7. Dewayne Mikkelson 10:41am, Mon 6th, 2006
This is a great quote and it sounds like the service MINT has hit the same problem.
“People are complicated,…If you have a system, they figure out how to game it. Machines never do.â€
8. Pari Sportifs 9:28am, Fri 5th, 2007
That bit about TIVO was also on King of Queens, where Spence`s TIVO recorded figure skating and all kinds of musicals for him, very funny…
9. Realtor 5:49pm, Fri 21st, 2007
Read, intresting!