How Does Amazon Scale Behavior Modelling?

In thinking about how the most successful sites model human behavior (my current meme), Amazon kept coming to mind. Amazon is amazing at modelling how we talk about books. Notice that I’m not just talking about buying books, I’m talking about how we talk about books. In other words, they don’t just make it easy […]

In thinking about how the most successful sites model human behavior (my current meme), Amazon kept coming to mind. Amazon is amazing at modelling how we talk about books.

Notice that I’m not just talking about buying books, I’m talking about how we talk about books. In other words, they don’t just make it easy to buy books, they make it easy to find and decide what books are right for us. This is not something that comes to mind when you think about web commerce: the first thing is usually “how can people buy stuff from us online?”.

I think this is a major differentiator in how Web 2.0 companies succeed. They provide great services by modelling complete processes, not just the endgame. And I suspect that scaling it is as much a problem as doing it in the first place.

So, a core competency of Web 2.0 companies like Amazon seem to be:

  1. The ability to successfully model human behavior over the complete decision-making process
  2. The ability to provide recommendations based on that modelling, in real time
  3. The ability to provide multiple facets to personalization
  4. The ability to scale this for millions of users

For folks in the usability industry, this modelling business sounds pretty obvious. Their job is all about modelling human behavior, first by conducting field research into how people actually do something and then by building that into the system. The problem is, usability like this is incredibly hard to scale.

The metadata involved in modelling human behavior is astounding. Not only do you need to know what each person is doing, you need to record the trends that happen between each of those people.

Think about how much it must cost to model how we talk about books. Imagine the backend system that Amazon must have, to include millions of reviews, wishlists, the page you made, people who bought this also bought that, listmania, so you’d like to, etc… Each of these requires database room that probably dwarfs 99% of the rest of the sites on the web. This is something that only the upper echelon of techies could ever hope to do. Most sites would have scalability problems way before they reach the size of an Amazon.

(warning: wild estimates, including very high numbers, follow)

If we look at Metcalfe’s Law, which suggests that the value of a network equals approximately the square of the number of users of it (n²-n), we see how difficult modelling must be. For example, if we have a network of 5 people shopping for books, then the number of combinations we need to investigate for trends is 20, because each person might be making some interesting trend with every other person (each person is “connected” with 4 others). Amazon has about 30 million customers, putting the possible combinations at a little less than 900,000,000,000,000 (9 hundred trillion). Needless to say, Amazon has a database that could be mined until the end of time without divulging all of its secrets.

90 trillion is a mighty big number. However, to leverage the Long Tail you’ve got to do something like that. Let’s hypothesize about what has to be done in order to implement the “people who viewed this also viewed that” feature. So, if you’re building the new iPod page, you have to perform a query to find out all those folks who viewed the Nano (I wonder how long that query takes to return). Then, you have to perform another query to see what they viewed (the number of things that people viewed[not just bought] must be in the billions). Then you have to sort out the most popular products from that list. Then you have to index it so you’re not doing this query real time. And then you present those indexed answers to the user in less a second millions of times per day.

(If someone is familiar with how Amazon actually does this I would be very interested in hearing how brilliant their solution is compared with my brute force silliness). I really, really doubt I’ll hear from someone, because this is the stuff that separates them from, well, people who wonder about it.

So, at first, this modelling sounds relatively straight-forward, but you can bet that the size of the datasets makes this some of the most difficult problems on the Web today. Imagine being the system administrator for this system! You have to do all this, and you also have the added little side-constraint of NO DOWNTIME. Wow, I thought my job was stressful at times.

So, what’s new about this? Am I making a novel observation here? I don’t think so, but I do think it is important to focus on what I call the “cliche companies” (Amazon, eBay, Google, Yahoo) in order to figure out what exactly sets them apart. We all know they’re the cream of the crop, but what exactly is it that these companies are doing that others aren’t? Is it only a technological advantage they hold over their competitors? Or is it more algorithmic: a combination of building user-centered systems that model behavior better than others, while being able to scale that to simply astounding proportions? I think this is an important question of Web 2.0.

Having the whole world as a marketplace suddenly doesn’t seem so easy…

Published: October 22nd, 2005