Crawl, Walk, then Run on the Way to Identity and Context Virtualization
After I published my post on identity and context virtualization, Tom Kaczmarek posted an insightful comment in the IAM section on Linked-In—and in doing so, inspired this blogpost (I love the opportunity to riff on these topics, so Thanks, Tom!).
- His first question has to do with the scalability. “How does one make this sustainable as the number of local views increases?”
- His other issue is more fundamental and has to do with soundness of the approach: “A missing piece in the explanation of this framework is a strong semantic model (of the type AI researchers have developed) to act as the glue to bind the pieces.”
Question 1: Is it scalable?
As always, this is a key question. The whole idea is to build a hugely scalable and distributed infrastructure, which serves as the basis for the integration of multiple data sources (the “contextual” part of the story) linked to a potentially huge number of identities/actors (which can be humans or computers). This is a daunting task and as Tom emphasized, it needs to be sustainable as the number of local views increases.
Now, we know better than to try and invent a whole new philosopher’s stone. As Picasso said, “good artists copy, great artists steal.” So we borrowed a pattern that has proven its worth. In this case, we see this pattern working everyday for us; it’s the same one behind the Internet and DNS. The idea is to allow the local sources to publish their own views of the world and then link and aggregate this information as you go. So a local website exposes itself as an HTML document and then the different links and DNS ties up the whole system, turning it into what is now the Internet. And in the same way, the virtualization and publication of structured data will enable a progressive integration of all data silos. Will that scale? Absolutely, in the same way that DNS scales.
Of course, the Internet was not born in a day, with a master plan to publish our worldview in a consistent way. Instead, it was built on this idea of providing the tools for a progressive, grassroots level of aggregation and linking of information. DNS is another example of grassroots aggregation/integration of name services based on zone files and a hierarchical structure.
In fact, think of DNS as being like all those hierarchical zone files, distributed around the world and served by multiple machines, but still able to resolve any name like “www.mycrazyworld.net” into an executable IP address. Push DNS one step further and you are transforming a simple name service into an object naming service—aka, a directory—then push it one step past that by linking related objects into sentences, and those sentences into relevant contexts. Now you’re reaching a world where your data silos are turned into a set of human readable contexts and sentences that can be searched using your favorite Google search box (or Bing box, to keep our friends from Redmond happy…).
This is what we mean by “manage globally, act locally.” Delegate the integration task to the owner of the data source—he knows his mess the best!—and let him publish the view of his data that he wants to share. Then provide a flexible layer to integrate those views in whatever way works for you. This uniform approach unlocks the data from its silo, while allowing a highly scalable and distributed architecture.
Question 2: Is it sound?
We just saw that by following a well-established technical pattern, we could publish, aggregate, and scale a huge amount of context and identity information. But scalability and ease of deployment are not guarantees of consistency or soundness. As you know, one of the main challenges when it comes to data integration is the fact that beyond matters of format or protocols, each data silo represents concepts, entities, and relationships differently. You could have homonymy (same words to designate different concepts), synonymy (different words to designate the same concepts), and everything in between.
The first and absolutely necessary step to support a real data service virtualization layer like RadiantOne is the translation of all local models to a “canonical common data model.” In fact, this is what differentiates us from any current “virtual directory” on the market. Unlike our competition, we try hard to understand your data, by creating a model for it. I won’t belabor the point, but you can learn more here. But Tom makes a good point when he says: “A missing piece in the explanation of this framework is a strong semantic model (of the type AI researchers have developed) to act as the glue to bind the pieces.”
In the semantic web space of logic, ontology/descriptive logics, and computational linguistics, there is a whole “conceptual” revolution cooking. Even today, there are wonderful tools in academia and elsewhere, designed to solve the category of problem Tom outlined in his comment: being sure that a new concept in the global model does not contradict your existing world, and being able to infer new consequences of action based on changes recorded in your data sources.
The problem with this semantic world is not the absence of tools, it is the absence of data. If all data was tagged as RDF and defined in Owl, then we’d have no trouble. The problem, of course, is jumpstarting the process. In many ways, the semantic world is in the same state as the original Internet, when academics began using protocols such as FTP, csh, Gopher, etc. to exchange info. Over time, this sea of info started to build up and when HTML and the web browser came along, suddenly the whole Internet was a tool for the rest of us.
I believe the same things will happen with our structured data, using new tools based on data service virtualization—like RadiantOne!—to unearth the data into readable contexts and sentences. While our transactions take up less than 10% of the volume, they represent at least 90 % of the commercial value. So having searchable, contextual access to this data across silos will be a very big deal, indeed.
But how do we get from here to there? As usual, first we crawl, then we walk, and then we run (or maybe even fly):
- Virtualize, so that you can publish and aggregate and reach information in a uniform way
- Then turn it into a format (RDF) where you can leverage semantic web tools, such as ontologies and inference engines, to create and maintain consistency—as I keep saying, don’t reinvent the wheel, steal it.
- And finally, converge toward a newer, better, and more consistent system.
That will get you to a good brisk walk. As for running and flying, watch this space for more discussion.
And thanks again, Tom, for your insightful comment. I’d love to discuss these topics with you further. Get in touch with me using the email address on the right.
