Saturday, 6 April 2013

Connecting the Dots: Linked Data

Not so long ago, Tim Berners-Lee came up with the notion of a World Wide Web revolutionizing the way people exchange information and documents. We now live in an economy where data plays a central role in decision making. As the implications of utilizing large data sets in research and enterprises are being realized, there is also a certain degree of underlying frustration when it comes to acquiring quality data sets. The Internet has proved to be a phenomenal source of information. However, most of it is unstructured and scattered. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data.

Linked Data Graph
Partial Snapshot of the Linked Data Graph
The difference between the current web and a web of data is best explained with an example. Nowadays a typical online store would consist of pages describing different products. Such a page contains all the information a human reader requires. But this information is represented in a way that makes automatic processing of it hard. In a web of data, the online store would actually publish data resources to the web, which represent the products. These resources can be retrieved in different representations, such that a browser would allow you to view a product page while a web crawler would get a machine understandable representation of the product.

Now comes the part which makes it even more exciting for research. Consider a case of a researcher developing a drug to treat Alzheimer’s.  Suppose, she wants to find the proteins which are involved in signal transduction AND are related to pyramidal neurons. (The question probably makes sense to her).
Searching for the same on Google returns about 2,240,000 results not one of which leads to an answer. Why? Because no one has ever had that idea before! There exists no single webpage on the web with the result. Querying the same on the Linked healthcare database pulls in data from two distinct data sets and produces 32 hits, EACH of which is a protein which has those properties.

The vision of Linked Data is the liberation of raw knowledge and making it (literally) accessible to the world. Indeed, a Web of Ideas.

Let us now take a deeper look into the principles involved and how the same can be applied in a Complex Network domain. Broadly, it entails four basic principles:
  1. Use URIs as names for things. These may include tangible things such as people, places and cars, or those that are more abstract, such as the relationship type of knowing somebody, the set of all green cars in the world, or the color green itself.
  2. Use HTTP URIs, so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs, so that they can discover more things. For example, a hyperlink of the type friend of may be set between two people, or a hyperlink of the type based near may be set between a person and a place.
The power here lies in the linksThere are three important types of RDF links:
  • Relationship Links : point at related things in other data sources
  • Identity Links point at URI aliases used by other data sources to identify the same real-world object or abstract concept
  • Vocabulary Links point from data to the definitions of the vocabulary terms that are used to represent the data
Having a good idea of the basic concepts, we are now ready to see how we can harness Linked Data for research involving Complex Networks.

As I see it, there are two interesting areas here:

As described in this old article, Linked Data is growing at an exponential rate since its modest beginnings back in 2007. Over the years, different data sets have started linking to the global database and a clear emergence of genres can be seen. The Linked Data graph may offer a good opportunity to study the temporal behavior of the largest structured knowledge database ever witnessed by the world.

Then again, as mentioned before, the data offered itself has huge potential in terms of network research. A good question therefore is how to get our hands on this data.

One approach is to use existing systems to get data. Linked data is accessible through browsers like the Disco Hyperdata browser, the Tabulator browser or the more recent LinkSailor. We can also make use of search engines such as, Falcons and SWSE.

For example, research on Citation Analysis may be supplemented with data available on authors (example) or subject areas (example). Linguistic research stands to gain from the relationships relating words and objects with actions, events, facts, etc. (example, example).

As with all generic applications however, the problem is that deployed applications offer little control over how the data is accessed. For specific research, therefore, it is best to develop applications and crawlers from scratch. This website would be a good place to start. It explains in detail about RDF and architectures involving linked data applications.

To conclude, Linked Data provides a more generic, more flexible publishing paradigm which makes it easier for data consumers to discover and integrate data from a large numbers of data sources. Though, still in its infancy period, it has come a long way since its inception. Coupled with the onset of an Internet of Things, Linked Datasets will encompass the physical world, identifying social relationships, behavioral patterns and events. What we do with this information is entirely up to us.

Regarding indexing of text and metadata information, the Solr system:


No comments:

Post a Comment