The pi searcher uses a combination of two search techniques: Linear search and indexed search. We'll see why in a few minutes.
Before Pi day 2006, this was the only way the pi searcher operated. The pi searcher has a file with the first 200M digits of pi. When a visitor entered a search, the program would open the file and go through it digit by digit, comparing them against the search string. For example, if you search for 9265, the program would do the following:
Pi: 141592653589793238462643... Query: 9265 (nope) Pi: 141592653589793238462643... Query: 9265 Pi: 141592653589793238462643... Query: 9265 Pi: 141592653589793238462643... Query: 9265 Pi: 141592653589793238462643... Query: 9265 (yep, it matches!)
For a computer, having a lot of Pi is like a human having a lot of pie: It makes it fat and slow. If you store Pi in a normal, human-readable text file (like you can create in Notepad, Emacs, vi, or whatever your favorite editor is), then Pi requires one byte per digit. So 200,000,000 digits of pi takes up 200,000,000 bytes, or 190 megabytes (computer scientists use the term megabyte to mean 2^20 bytes, or 1048576, instead of just 1 million). Even with nicely fast computers, it can still take a few seconds to read 190 megabytes of data from a hard disk, and a significant fraction of a second even to read through it when it's already in memory!
The pi searcher uses half this amount of space. Note that Pi only contains 10 different characters (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), unlike completely general files that can contain 256 different characters (0-9, a-z, A-Z, and all of the punctuation characters, and then a big set of "unprintable" characters). Computers store information as bits - ones and zeros. Normally, one character is stored as a byte, or 8 bits. (2^8 = 256, the number of different characters). But we only have 10 possibilities, so the Pi searcher can store those using 4 bits each (2^4 = 16, just a little more than we need).
So, the solution: The pi searcher stores the digits of Pi packed two digits per byte. As a result, the file of Pi is totally unreadable to a human, but it's half the size of the human-readable version!
If you were counting carefully, you'll have noticed that we use 4 bits (which can hold 16 values) to store 10 different values. So there's actually more room for savings: We technically need only log2 (10) = 3.322 bits to store each number. Unfortunately, saving that extra .6 bits per digit makes the code a lot more complicated, and slower.
For longer search strings, strings, the Pi Searcher gets a little bit more clever and packs the first four digits of the search string the same way it packs the digits of Pi. It then goes through and compares the packed digits.
Pi: 14 15 92 65 35 89 79 32 38 46 26 43...
Query: 65 35
^^ both checked in one step, because they're just one byte
Pi: 14 15 92 65 35 89 79 32 38 46 26 43...
Query: 6 53 5
^^^ still has to do multiple comparisons because they bytes don't line up
With this optimization, every other step, the Pi searcher only has to do one comparison to check two digits, instead of one comparison per digit. We also throw in a few little optimizations (like comparing four digits in a row before checking to see if any of the comparisons worked) that make the linear search faster on a modern CPU. This happens because it makes the resulting code have fewer branches (if-then choices), which can be fairly slow.
When developing the pi searcher, I benchmarked a number of cool, fancy linear search algorithms that are supposed to be faster than the basic algorithm. In general, the linear search I described above - even the Pi searcher's optimized one - takes time proportional to the number of digits of Pi and the number of digits in the search string: Pretend you're comparing two strings: 11110 and 11111. You compare the first digits -- match! So then you compare the second digits, and the third... and you don't realize that they're different until you get to the last digit.
There are improved search algorithms, such as Boyer-Moore or Knuth-Morris-Pratt, that require fewer comparisons. With the Pi searcher, it turns out that the complexity of the "faster" algorithms actually makes them slower for searching Pi. They are, however, fairly cool algorithms; Wikipedia's String searching algorithm entry has more information about them if you're curious.
All of the optimizations I mentioned above help, but in the worst case, it still takes a few hundred million comparisons to figure out whether a string was found, which took about a second or two on our current hardware. A second isn't bad, but during the busiest periods of Pi day, the pi searcher handles about 5 searches every second. Clearly, taking one second is way too long.
The pi searcher now has a pre-computed index of the location of all of the substrings of Pi. The index is computed using the suffix array code from an implementation by Sean Quinlan and Sean Dorward. (Cool trivia: Suffix arrays were invented in 1990 by Udi Manber, who subsequently was chief scientist at Yahoo!, then the chief algorithms officer at Amazon, and who is now at Google.) The index takes up 8 times the space of the compressed digits, but that's only about a gigabyte. This pre-computed index lets us jump almost immediately to the place where a particular search string is. But how can we index every search string, when people could search for anything?
The answer is the beauty of suffix arrays. Let's say we're indexing the first 10 digits of pi:
1415926535
The suffix array maintains a list in lexicographical order of where strings start in pi. If we look at those 10 digits, we see that the "smallest" string is the one that starts with "141", and the next smallest is "159", and then "265", and so on. So the suffix array stores the positions at which each of those substrings begins: 0, 2, 5, 8, 1, 9, 7, 3, 6, 4, which correspond to all possible substrings of the first ten digits of pi, in increasing order:
0: 1415926535 2: 15926535 5: 26535 8: 35 1: 415926535 9: 5 7: 535 3: 5926535 6: 6535 4: 926535
So now, if someone asks for a string starting with "5", we can look in the index and find it quickly, because the index is in sorted order. A technique known as binary search does the trick nicely: We look in the middle of the list. If the value at the place we look is larger than what we're looking for, we know that what we want is in the first half, otherwise it's in the second half. And so on, and so on. It's much like how you might search through a phone book to find someone's number.
If you were sharp-eyed, you noticed that there were possibly many things in the index that match a particular query. For instance, if you searched for "5", you would have three possibilities. To make things worse, the entries in the index are stored in lexicographical order; the order has nothing to do with where in Pi the entries occur. Note how the first "5" in there is actually the one at the very end! But the Pi searcher needs to return the results in order.
Resolving this isn't too hard: Once we find the values in the index, we do a linear search through all of them that match the search query and take the smallest. Unfortunately, that could be a lot of values. For example, 10\% of the substrings of Pi begin with the digit "5". So if someone wants to find the first 5 in Pi, we might have to scan through twenty million entries still. :( That's no good. So this is where we combine linear search and our index search:
The observant will note that I didn't split things up optimally. The best way to do this would be to perform linear search on 4 digit strings (10,000 comparisons), and then do an index search where you then have to scan the remaining 2,000 entries, making the maximum number of comparisons smallest overall. (roughly the square root of the number of digits). The reason it's unbalanced is because the 5 digit linear search is really, really optimized, and the linear search in the index is quite a bit slower. So the time balances out better.
The pi searcher runs on a 1.7Ghz pentium 4 server with 512M of RAM and a very fast 80GB Seagate SCSI hard drive. The computer was moderately nice in 2004, but isn't all that any more. The Pi Searcher is hosted by Xmission in their very cool co-location facility in Salt Lake City, UT. (Their fingerprint scanner for door access rocks.) The server runs an older version of FreeBSD and uses the Apache web server. The Pi Searcher itself is written in C.
Last updated: Tue Dec 29 21:43:46 GMT 2009 [validate xhtml]
dga - at - pobox dot com.