Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers (5 page)

Read Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers Online

Authors: John MacCormick,Chris Bishop

BOOK: Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers

9.57Mb size Format: txt, pdf, ePub

An example of a cycle of hyperlinks. Pages
A
,
B
, and E form a cycle because you can start at
A
, click through to
B
, then
E
, and then return to your starting point at
A
.

The problem caused by cycles.
A
,
B
, and E are always out of date, and their scores keep growing forever.

Calculating authority scores this way creates a chicken-and-egg problem. If we knew the true authority score for A, we could compute the authority scores for
B
and E. And if we knew the true scores for
B
and
E
, we could compute the score for
A.
But because each depends on the other, it seems as if this would be impossible.

Fortunately, we can solve this chicken-and-egg problem using a technique we'll call the
random surfer trick.
Beware: the initial description of the random surfer trick bears no resemblance to the hyperlink and authority tricks discussed so far. Once we've covered the basic mechanics of the random surfer trick, we'll do some analysis to uncover its remarkable properties: it combines the desirable features of the hyperlink and authority tricks, but works even when cycles of hyperlinks are present.

The random surfer model. Pages visited by the surfer are darkly shaded, and the dashed arrows represent random restarts. The trail starts at page
A
and follows randomly selected hyperlinks interrupted by two random restarts.

The trick begins by imagining a person who is randomly surfing the internet. To be specific, our surfer starts off at a single web page selected at random from the entire World Wide Web. The surfer then examines all the hyperlinks on the page, picks one of them at random, and clicks on it. The new page is then examined and one of its hyperlinks is chosen at random. This process continues, with each new page being selected randomly by clicking a hyperlink on the previous page. The figure above shows an example, in which we imagine that the entire World Wide Web consists of just 16 web pages. Boxes represent the web pages, and arrows represent hyperlinks between the pages. Four of the pages are labeled for easy reference later. Web pages visited by the surfer are darkly shaded, hyperlinks clicked by the surfer are colored black, and the dashed arrows represent random restarts, which are described next.

There is one twist in the process: every time a page is visited, there is some fixed
restart probability
(say, 15%) that the surfer does
not
click on one of the available hyperlinks. Instead, he or she restarts the procedure by picking another page randomly from the whole web. It might help to think of the surfer having a 15% chance of getting bored by any given page, causing him or her to follow a new chain of links instead. To see some examples, take a closer look at the figure above. This particular surfer started at page
A
and followed three random hyperlinks before getting bored by page
B
and restarting on page
C.
Two more random hyperlinks were followed before the next restart. (By the way, all the random surfer examples in this chapter use a restart probability of 15%, which is the same value used by Google cofounders Page and Brin in the original paper describing their search engine prototype.)

It's easy to simulate this process by computer. I wrote a program to do just that and ran it until the surfer had visited 1000 pages. (Of course, this doesn't mean 1000 distinct pages. Multiple visits to the same page are counted, and in this small example, all of the pages were visited many times.) The results of the 1000 simulated visits are shown in the top panel of the figure on the next page. You can see that page
D
was the most frequently visited, with 144 visits.

Just as with public opinion polls, we can improve the accuracy of our simulation by increasing the number of random samples. I reran the simulation, this time waiting until the surfer had visited one million pages. (In case you're wondering, this takes less than half a second on my computer!) With such a large number of visits, it's preferable to present the results as percentages. This is what you can see in the bottom panel of the figure on the facing page. Again, page
D
was the most frequently visited, with 15% of the visits.

What is the connection between our random surfer model and the authority trick that we would like to use for ranking web pages? The percentages calculated from random surfer simulations turn out to be exactly what we need to measure a page's authority. So let's define the
surfer authority score
of a web page to be the percentage of time that a random surfer would spend visiting that page. Remarkably, the surfer authority score incorporates both of our earlier tricks for ranking the importance of web pages. We'll examine these each in turn.

First, we had the hyperlink trick: the main idea here was that a page with many incoming links should receive a high ranking. This is also true in the random surfer model, because a page with many incoming links has many chances to be visited. Page
D
in the lower panel on the next page is a good example of this: it has five incoming links-more than any other page in the simulation—and ends up having the highest surfer authority score (15%).

Second, we had the authority trick. The main idea was that an incoming link from a highly authoritative page should improve a page's ranking more than an incoming link from a less authoritative page. Again, the random surfer model takes account of this. Why? Because an incoming link from a popular page will have more opportunities to be followed than a link from an unpopular page. To see an instance of this in our simulation example, compare pages
A
and
C
in the lower panel above: each has exactly one incoming link, but page
A
has a much higher surfer authority score (13% vs. 2%) because of the quality of its incoming link.

Random surfer simulations. Top: Number of visits to each page in a 1000-visit simulation. Bottom: Percentage of visits to each page in a simulation of one million visits.

Surfer authority scores for the scrambled egg example on page 29. Bert and Ernie each have exactly one incoming link conferring authority on their pages, but Bert's page will be ranked higher in a web search query for “scrambled eggs.”

Notice that the random surfer model simultaneously incorporates both the hyperlink trick and authority trick. In other words, the quality and quantity of incoming links at each page are all taken into account. Page
B
demonstrates this: it receives its relatively high score (10%) due to three incoming links from pages with moderate scores, ranging from 4% to 7%.

The beauty of the random surfer trick is that, unlike the authority trick, it works perfectly well whether or not there are cycles in the hyperlinks. Going back to our earlier scrambled egg example (page 29), we can easily run a random surfer simulation. After several million visits, my own simulation produced the surfer authority scores shown in the figure above. Notice that, as with our earlier calculation using the authority trick, Bert's page receives a much higher score than Ernie's (28% vs. 1%)—despite the fact that each has exactly one incoming link. So Bert would be ranked higher in a web search query for “scrambled eggs.”

Now let's turn to the more difficult example from earlier: the figure on page 30, which caused an insurmountable problem for our original authority trick because of the cycle of hyperlinks. Again, it's easy to run a computer simulation of random surfers, producing the surfer authority scores in the figure above. The surfer authority scores determined by this simulation tell us the final ranking that would be used by a search engine when returning results: page
A
is highest, followed by
B
, then
E
, with
C
and
D
sharing last place.

Surfer authority scores for the earlier example with a cycle of hyperlinks (page 30). The random surfer trick has no trouble computing appropriate scores, despite the presence of a cycle (A –> B –> E –> A).

PAGERANK IN PRACTICE

The random surfer trick was described by Google's cofounders in their now-famous 1998 conference paper, “The Anatomy of a Large-scale Hypertextual Web Search Engine.” In combination with many other techniques, variants of this trick are still used by the major search engines. There are, however, numerous complicating factors, which mean that the actual techniques employed by modern search engines differ somewhat from the random surfer trick described here.

Other books

The Consequence of Secrets - Part Two: A Priest Romance by Cates, Eve

Agency Rules - Never an Easy Day at the Office by Khalid Muhammad

Jaguar Princess by Clare Bell

Ripples Through Time by Lincoln Cole

Stalin Ate My Homework by Alexei Sayle

Polity Agent by Neal Asher

The End of Cheap China: Economic and Cultural Trends That Will Disrupt the World by Shaun Rein

The Tryst (Annotated) (Grace Livingston Hill Book) by Hill, Grace Livingston

Back in the Saddle by Catherine Hapka

Wilderness Trail of Love (American Wilderness Series Romance Book 1) by Dorothy Wiley