Google's Matt Cutts posted an interesting video today, responding to a user-submitted question:
"What resources (textbooks, online PDFs etc) would you recommend to people interested in learning more about LSI, search engine algorithms, etc?"
Cutts first suggests checking out the original PageRank papers. "So there's a whole bunch of different stuff about the anatomy of a large-scale hypertext search engine and then also a bunch of papers about PageRank," he says .
Here's The Anatomy of a Large-Scale Hypertextual Web Search Engine by Google co-founders Larry Page and Sergey Brin.
Here's "The PageRank Citation Ranking: Bringing Order to the Web" (pdf).
Cutts also recommends some textbooks. "One is Modern Information Retrieval," he says. "That's got a lot of good stuff about the scoring and the science and thinking about that. And then there's also one called Managing Gigabytes. I think Ian Witten wrote that one. And that one is just a little bit more about the logistics and being able to horse around that much data and thinking about some of the machine's issues and how does a large scale engine work."
Here are some links:
"So those three together, and then of course, you can always do searches," says Cutts. "Google Research actually has a ton of different papers that we've published. So you might want to look into that a little bit as well. But basically PageRank, the early Google papers, can give you an idea of how to write a very simple search engine that can scale to 100 million documents or so, Managing Gigabytes, and Modern Information Retrieval, and that will give you a pretty good view of the sort of different parts of the space."
Here's a list of all the areas of focus Google Research has papers on: