Google Indexing Scanned Text

Treating Text Imagery Like Text in Search Results

Get the WebProNews Newsletter:

[ Search]

Google is now indexing scanned documents in search results. In other words if you scan a page of text and post it to the web, it will be treated like an actual page of text rather than the image that it truly is (theoretically at least).

Google Indexing Image Text

As Google says, while reading the scanned text may be very easy for a human, it’s a very error-prone process for a computer, so it is unlikely that this will be a flawless endeavor. In a post on the Official Google Blog, Product Manager Erin Levey elaborates a little bit on what Google’s doing:

In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document– so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful.

While we’ve indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however — it is a picture of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages.

Google invites us to take a look at some results that incorporate these listings (noticing the document excerpts):

repairing aluminum wiring
spin lock performance
Mumps and Severe Neutropenia
Steady success in a volatile world

This project ought to save some people a lot of typing. I can’t help but wonder if this will contribute to Google Book Search pages coming up in regular search results in the future. Perhaps this was a big motivator for Google to do this in the first place. Just speculation.

Google Indexing Scanned Text
Top Rated White Papers and Resources
  • http://www.thefreelibrary.com/Cristian+Stan/Contributed-a218585 CristianStan

    WOW…that’s astonishing. Seems impossible to index scanned pages because are pictures right?

    Or there is a difference between a scanned picture and a photo and is possible to know more about the scanned text? Like a signature or anything?


    • http://www.mtuba4u.co.za Mtuba4u

      This is a special technique that is sued to recognise typing or printing of the written word.

      The process is complicated and longwinded, but now that computers are faster and have more memory to play around with these techniques can be used to recover printed text from scanned images.

      It is really quite scary how  m uch computing power is needed to decipher printed  text from a scanned document. spell checks and language composition checks are also employed here to ensure correctness, so it is really a very significant technological feat to be able to perform OCR.

      Well done to the techies, and thanx for a great site….

  • http://www.foursquareinnovations.co.uk/ Internet marketer Leeds

    I guess it might be possible to really optimise PDFs well using tags, meta data etc in order to give them the edge, but I have to wonder how god OCR can get a recognising and acrediting headings, sub headings and other important text from a scan.

    • http://www.mtuba4u.co.za Mtuba4u

      OCR or Optical Character recognition has come a long way since it was first used. it is about time this happened, but yes there is no way of adding special meta tags to the data on the page.

      So what !!!!!

      For some time now many search engines have been implying that they do not use these meta tags anyway.

      as for headings, sub headings etc well that is easy, you can recognise them with your eyes, by tthe fact that they are bold, underlined or on a seperate line.

      Good writing will be rewarded, and bad writing will come off second best. lets hope this works well,but I am sure there will be some teething problems with people using mixed fonts and other formating tricks that are there for visual appeal in the orginal printed document.

      Looking forward to the many extra pages that will now suddenly appear and have stange relevance due to their age and location within web sites.

  • http://letgon.blogspot.com/ letgon

    lets hope this works well,but I am sure there will be some teething problems with people using mixed fonts and other formating tricks that are there for visual appeal in the orginal printed document.

  • Join for Access to Our Exclusive Web Tools
  • Sidebar Top
  • Sidebar Middle
  • Sign Up For The Free Newsletter
  • Sidebar Bottom