Diffbot Uses Robots To Extract Data From E-commerce SitesBy: Chris Crum - July 31, 2013
Diffbot announced today that it is relasing a new API that uses robots to understand and extract data from e-commerce sites.
The robotics company, which uses vision, machine learning and artificial intelligence to analyze and extract data from web pages, appeared at the Bing-sponsored LAUNCH event last year, where it laid out is plans to make the entire web machine-readable. More on that here.
The new API uses computer vision to turn any e-commerce site into a product database, the company says in an email.
“Software developers can use the API to extract a variety of data from the page include product image, SKU code, price, shipping cost, discount price, MSRP, etc.,” a spokesperson for diffbot tells WebProNews. “The API can identify and structure information regardless of a site’s design, layout, markup or language.”
Additionally, diffbot has developed a spider technology, which can analyze an entire site, skipping non-product pages, and extracting just the data from relevant page types.
“Think about Target.com, or Wal-Mart.com, and being able to extract ALL of the product data from all of the product pages,” the spokesperson says.
“E-commerce is one of the most popular activities on the web. With 28% of US internet users shopping on a daily basis, we figured we should teach our robot how to understand products,” said CEO Mike Tung. “The Product API represents our latest advances in pushing the capabilities of automated page extraction. We are one step closer to the imminent goal of making the entire web machine-readable.”
Diffbot believes the entire web can be broken down into about twenty or so page types, such as home pages, article pages, product pages, location pages, social network pages, etc., and says will continue to roll out APIs for new page types until it has tools to index the entire Internet. It already has APIs for home pages, article pages and image pages.
The company is backed by Earthlink founder Sky Dayton, who is part of the board.