Amazon: A Data Driven Company
Amazon is way nutty with data. They have buckets and buckets of it.
They've got more than 10-million customers. They've got more than
10-million catalog items. They've got data on everything.
Groups within Amazon propose items to go into various "slots" on
the Amazon homepage. Their software automatically gives half of
the new sessions the proposed item and half of the new sessions
the previous item. If the experimental group buys more than the
control group during their sessions, the new item gets the slot
for everyone.
Amazon calculates, per item, the sparse vector of what other items
people who bought that item also bought. They weight the information
so that if people bought two items together, that's more important
than if they bought them in separate years. They also squelch things
that are really popular so that "Customers who bought this item, also
bought 'Harry Potter and the Order of the Phoenix'" doesn't show up
for every item. They do these calculations offline, but they query
them in realtime.
Every element that goes onto a page has goals for its timing. Finding
and organizing the reviews can only take x milliseconds. Finding and
organizing the "also bought" information can only take y milliseconds.
Finding user book lists that contain your book (or books like it) can
only take z milliseconds. If pages start to trend out of those "contracts",
pagers start going off... people are summoned... panic ensues.
Of more direct interest is this paper:
--
PatrickStein - 24 Mar 2005