Menu
Diffbot organizing Web data for enterprise use

Diffbot organizing Web data for enterprise use

The company claims to have created a structured representation of much of the data on the Web

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Diffbot is trying to reorganize all the data on the Web so it can be put to better use.

The service "converts the existing Web into a structured database-like representation that can essentially be used for all sorts of intelligent applications," said Mike Tung, Diffbot CEO.

On Thursday, Diffbot said it had received $500,000 in funding from Bloomberg Beta, the investment arm of the Bloomberg media company. Andy Bechtolsheim, a founder of Sun MIcrosystems and the first major investor in Google, is also a backer. Diffbot says it already has paying customers for the service, which is being used by Microsoft's Bing, Adobe, Salesforce.com, and eBay.

The service creates an object for each Web page it finds. An object provides structure to a set of related data so that it can be programmatically reused, along with other similar objects, by a query engine or an external application. The software has been copying all the pages it finds on the Web and reorganizing them into objects.

Perhaps the most well-known example of this object-based approach is Google's Knowledge Graph, a Semantic Web project. If a search is done on a particular keyword, such as the name "Johnny Depp," Google will return, along with a standard list of Web pages, a box containing basic information on the actor, such as birth date and height. That box of information is a rendering of the "Johnny Depp" Knowledge Graph object built by Google.

Diffbot, which is based in Palo Alto, California, and was founded in 2008, claims its own collection of objects is superior to Google's.

The 14-person company says it has created an entirely automated system for accurately creating objects. Google's approach is at least partly manual, requiring individuals to edit objects after they have been created, confirmed a Google spokesman.

Google's Knowledge Graph is larger than Diffbot's, containing roughly a billion objects, while Diffbot's global index of the Web now includes 600 million objects. But Google doesn't yet offer a Knowledge Graph API for third-party commercial use, though it is working on one.

Diffbot is based on the idea that businesses could use such a collection of organized information for their own purposes. Nike, for instance, could deploy the service to build a profile of other shoe companies and their offerings, Tung suggested. DiffBot offers a set of APIs (application programming interfaces) that third-party applications can use to query the massive object set.

The company has developed a set of AI algorithms that can identify the context and subject of Web pages, some of which the company is in the process of patenting. One novel AI algorithm relies computer vision, which is not a widely used technique for indexing Web pages, Tung acknowledged. The layout and design of Web pages can provide important clues to help better define objects. "The layout is the signal that helps us determine what kind of page it is," Tung said. An e-commerce site has an entirely different structure than a news site, for instance.

Diffbot is one of a number of companies building such "knowledge graphs," through various sets of technologies, said Dave Schubmehl, an IDC research director who covers content analytics, discovery and cognitive systems. Such technology could be of potential value to any business that relies on understanding large amounts of external data, he said via email.

Another company working in this field is IBM, Schubmehl wrote. Last year, IBM purchased two companies to install similar capabilities in its Watson cognitive computing service. One was AlchemyAPI, which builds taxonomies of data assets, and the other is Blekko, which developed software for indexing Web sites.

Some organizations use other technologies to organize and synthesize large sets of otherwise unstructured information, according to Schubmehl. Neo4J and Oracle both offer graph databases, which are well-suited for identifying the connections across large collections of data. Others rely on semantic Web standards, such as the Sesame Java Framework, which is used for converting data into the structured RDF (Rich Description Framework) format.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com


Follow Us

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags softwareDiffBot

Events

Featured

Slideshows

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners

This year’s Reseller News 30 Under 30 Tech Awards were held as an integral part of the first entirely virtual Emerging Leaders​ forum, an annual event dedicated to identifying, educating and showcasing the New Zealand technology market’s rising stars. The 30 Under 30 Tech Awards 2020 recognised the outstanding achievements and business excellence of 30 talented individuals​, across both young leaders and those just starting out. In this slideshow, Reseller News honours this year's winners and captures their thoughts about how their ideas of leadership have changed over time.​

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners
Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security

Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security

This exclusive Reseller News Exchange event in Auckland explored the challenges facing the partner community on the cloud security frontier, as well as market trends, customer priorities and how the channel can capitalise on the opportunities available. In association with Arrow, Bitdefender, Exclusive Networks, Fortinet and Palo Alto Networks. Photos by Gino Demeer.

Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security
Reseller News welcomes industry figures at 2020 Hall of Fame lunch

Reseller News welcomes industry figures at 2020 Hall of Fame lunch

Reseller News welcomed 2019 inductees - Leanne Buer, Ross Jenkins and Terry Dunn - to the fourth running of the Reseller News Hall of Fame lunch, held at the French Cafe in Auckland. The inductees discussed the changing face of the IT channel ecosystem in New Zealand and what it means to be a Reseller News Hall of Fame inductee. Photos by Gino Demeer.

Reseller News welcomes industry figures at 2020 Hall of Fame lunch
Show Comments