In order to study some complex networks like the Internet, the Web, social networks or biological networks, one first has to explore them. So this blog gives a partial and biased view of the real object, which is generally assumed to be representative of the whole.
Modern Physics 74, 47, 2002.
Nature, 406:378–382, 2000.
Using the example of the Internet and a rough model of its exploration process, we show that the way a given complex network is explored may strongly influence the observed properties. We therefore insist on the necessity of developing a theory of complex network metrology. Its aim would be to study how the partial and biased view of a network relates to the properties of the whole network. Our global approach is the following: we consider a (known) network G, we simulate an exploration of this network to obtain a view G’ of it, and then we compare the two objects. The final aim is to deduce properties of G from properties of G’’.
We want to simulate an exploration process. In order to do this, we first need a network to explore. There are several natural choices for this. We will choose the most simple and well known model of random networks to generate the topology to explore: the Erdös and Rényi random graph model. This model has two parameters: the number of nodes, n, and the probability of existence of any link, p. A network is then generated by considering that each possible pair of nodes is linked with probability p.
This gives an expected number of links m = p .n.(n-1)/2.
The traceroute tool gives the path followed by messages from a source to a destination. Up to now, very little is known on the properties of such paths. For instance, one may suppose that the aim of network protocols is to deliver information efficiently, and so that the paths they follow are shortest paths (paths of minimal length). It is however known that this is not always the case, but no precise information is currently available on how they differ from shortest paths. Moreover, there exist in general many shortest paths for a given pair of computers, and there is no apriori reason for traceroute to give one of them rather than another. Finally, the paths change during time but again very few is known on their dynamics.In the current state of our knowledge, designing a realistic model of traceroute is therefore impossible. The assumption usually made is that traceroute always gives a shortest path, which will actually be sufficient for our current aim. We will also consider that, during the exploration process, one may use traceroute many times, which lead to the discovery of all the shortest paths between given sources and destinations.We have a model to generate the network to explore, and some models for the traceroute tool. We now need a model for the exploration process itself. As already noticed, we will suppose that it only relies on traceroute. But this is not sufficient: we must say how we will choose sources and destinations, and how many of them we will consider. Our aim being to show that the exploration method may influence the obtained view of the actual network, we will consider several realistic models of the exploration.
All the values we plot are averaged over 1000 instances. The variance is in general negligible (we plotted it in the case of Figure 2). The shortest path computations are done using breadth first search.
Let us denote by Gu(x) the view of G obtained from a given source if we consider x random destinations, with the usp model for traceroute. Let nu(x) be the number of nodes of this view, and mu(x) its number of links. Similarly, we introduce Ga(x), na(x) and ma(x) the results obtained with the asp model for traceroute. The plots of these functions, Figure 1, show how much of the network we obtain, both in terms of nodes and links, as a function of the number of destinations. At various points, these plots fit well the intuition. First, when we consider very few destinations, we obtain a very small part of the network. Then, if the number of destination grows, we see more and more. Finally, we of course see all the nodes when we consider each of them as a destination.
There are however a few remarkable facts. Both nu(x) and na(x) grow rapidly and reach a critical point where they start a linear growth, but the initial growth of na(x) is much more rapid than the one of nu(x). On the contrary, mu(x) and ma(x) grows linearly from the beginning, but the maximal values they reach, mu(n) and ma(n), remain surprisingly low. It means that the exploration misses many links, even if we consider all
the possible destinations, which indicates that the obtained view is very incomplete. This is even more surprising when we consider the optimistic case where all the shortest paths are discovered, and all the nodes are used as destinations.
These behaviors are similar for any values of n and p (the plots presented in Figure 1 always have the same shape). However, the maximal value reached by mu(x) and ma(x), i.e. the maximal proportion of discovered links, varies with the probability p of existence of any link. To know how p influences these values, let us study the proportion of links discovered using one source and all the possible destinations, as a function of p. They are plotted in Figure 2 for the two models of traceroute we consider.
The two plots have some properties in common which can be easily explained. First notice that below a certain value of p, the network is not connected (it is composed of many independent parts). Therefore, below this threshold, any exploration using a small number of sources will give a very small part of the whole. When the network becomes connected, it is almost a tree, in which there is a unique path from the source
to each node. Therefore, the two exploration methods we consider discover almost all the links, which corresponds to the maximal values reached by the plots in Figure 2. On the opposite, when p is almost 1, then almost every possible link exists, and so almost every node is at distance 1 from the source. Therefore, the obtained view, both with the usp and with the asp model, is almost a star. It therefore contains almost
n -1 links, which, compared to the total number of links, almost n·(n−1)/2 , is negligible. The plot for the usp model is easy to understand. Indeed, the exploration using this model gives a tree (it has no cycle), and therefore it contains exactly n - 1 links if p is above log(n)/n.
Since, in this case the network is almost surely connected. The expected total number of links being itself
m = p ·n·(n−1)/2 , the ratio between the number of links discovered during the exploration and the total number of links is then n – 1 / m = 2 / p · n. When p grows, this ratio decays as 1/p, which is confirmed by the simulation. On the contrary, the irregular shape of the plot for the asp model is very surprising: it has many peaks and valleys of high amplitude, which have no obvious interpretation.
This is so surprising that we will name it the camel plot. There is however a natural explanation of this shape, which comes from specific properties of the exploration.
Until now, we have restricted ourselves to explorations using only one source. However, in practical cases, one generally uses several, but few, sources. We investigate here how this may influence the quality of the view we obtain. Again, we only concentrate on the ratio of the total number of discovered links, which previous remarks have shown to be essential.
Figure 4 shows the evolution of this ratio when the number of sources is increasing. Let us first consider the two topmost plots, which correspond to the cases where we use all the possible destinations. As expected, the quality of the view grows rapidly with the number of sources, and one may even be surprised by the rapidity of this growth. Despite our model of Internet exploration is very rough; one may consider this plot as good news since it indicates that one does not need many sources to obtain accurate views of the network. This is important since it is very difficult (and never done) to use many sources in practice.
In this communication, we considered the simplest possible question concerning the quality of a network view obtained by an exploration of a real network: the amount of the total number of nodes and links we obtain. Making natural variations on the way we model the Internet exploration, we show that this amount varies a lot and is very difficult to estimate.
- R. Albert and A.-L. Barab´asi. Emergence of scaling in random networks. Science, 286:509–
- R. Albert and A.-L. Barab´asi. Statistical mechanics of complex networks. Reviews of
Modern Physics 74, 47, 2002.
- R. Albert, H. Jeong, and A.-L. Barab´asi. Diameter of the world wide web. Nature, 401:130–
- R. Albert, H. Jeong, and A.-L. Barab´asi. Error and attack tolerance in complex networks.
Nature, 406:378–382, 2000.