In order to study some complex networks like the Internet, the Web, social networks or biological networks, one first has to explore them. So this blog gives a partial and biased view of the real object, which is generally assumed to be representative of the whole.
References
512, 1999.
Modern Physics 74, 47, 2002.
131, 1999.
Nature, 406:378–382, 2000.
Using the example of the Internet and a rough model of its exploration process, we show that the way a given complex network is explored may strongly influence the observed properties. We therefore insist on the
necessity of developing a theory of complex network metrology. Its aim would be
to study how the partial and biased view of a network relates to the properties
of the whole network. Our global approach is the following: we consider a
(known) network G, we simulate an exploration of this network to obtain a view
G’ of it, and then we compare the two objects. The final aim is to deduce properties
of G from properties of G’’.
Modelling
We want to simulate an
exploration process. In order to do this, we first need a network to explore.
There are several natural choices for this. We will choose the most simple and well
known model of random networks to generate the topology to explore: the Erdös
and Rényi random graph model. This model has two parameters: the number of
nodes, n, and the probability of existence of any link, p. A network is then
generated by considering that each possible pair of nodes is linked with
probability p.
This gives an expected number of links m = p .n.(n-1)/2.
The traceroute tool gives the
path followed by messages from a source to a destination. Up to now, very
little is known on the properties of such paths. For instance, one may suppose
that the aim of network protocols is to deliver information efficiently, and so
that the paths they follow are shortest paths (paths of minimal length). It is
however known that this is not always the case, but no precise information is
currently available on how they differ from shortest paths. Moreover, there
exist in general many shortest paths for a given pair of computers, and there
is no apriori reason for traceroute to give one of them rather than another.
Finally, the paths change during time but again very few is known on their
dynamics.In the current state of our knowledge, designing a realistic model of
traceroute is therefore impossible. The assumption usually made is that
traceroute always gives a shortest path, which will actually be sufficient for
our current aim. We will also consider that, during the exploration process,
one may use traceroute many times, which lead to the discovery of all the
shortest paths between given sources and destinations.We have a model to
generate the network to explore, and some models for the traceroute tool. We
now need a model for the exploration process itself. As already noticed, we will
suppose that it only relies on traceroute. But this is not sufficient: we must
say how we will choose sources and destinations, and how many of them we
will consider. Our aim being to show that the exploration method may influence
the obtained view of the actual network, we will consider several realistic
models of the exploration.
All the values we plot are
averaged over 1000 instances. The variance is in general negligible (we
plotted it in the case of Figure 2). The shortest path computations are done
using breadth first search.
Unique source
Let us denote by Gu(x) the view of G obtained from a given source if we
consider x random destinations, with the usp model for traceroute. Let nu(x) be the
number of nodes of this view, and mu(x) its number of links. Similarly, we introduce Ga(x), na(x) and ma(x) the results
obtained with the asp model for traceroute. The plots of these functions, Figure 1, show how much of the network
we obtain, both in terms of nodes and links, as a function of the number of
destinations. At various points, these plots fit well the intuition. First,
when we consider very few destinations, we obtain a very small part of the
network. Then, if the number of destination grows, we see more and more.
Finally, we of course see all the nodes when we consider each of them as a destination.
There are however a few remarkable facts. Both nu(x) and na(x) grow rapidly and reach a critical point where they
start a linear growth, but the initial growth of na(x) is much more rapid than the one of nu(x). On the
contrary, mu(x) and ma(x) grows linearly
from the beginning, but the maximal values they reach, mu(n) and ma(n), remain
surprisingly low. It means that the exploration misses many links, even if we
consider all
the possible destinations, which indicates that the obtained view is very
incomplete. This is even more surprising when we consider the optimistic case
where all the shortest paths are discovered, and all the nodes are used as
destinations.
These behaviors are similar for any values of n and p (the plots
presented in Figure 1 always have the same shape). However, the maximal value
reached by mu(x) and ma(x), i.e. the
maximal proportion of discovered links, varies with the probability p of
existence of any link. To know how p influences these values, let us study the
proportion of links discovered using one source and all the possible
destinations, as a function of p. They are plotted in Figure 2 for the two
models of traceroute we consider.
The two plots have some properties in common which can be easily
explained. First notice that below a certain value of p, the network is not
connected (it is composed of many independent parts). Therefore, below this
threshold, any exploration using a small number of sources will give a very
small part of the whole. When the network becomes connected, it is almost a
tree, in which there is a unique path from the source
to each node. Therefore, the two exploration methods we consider
discover almost all the links, which corresponds to the maximal values reached
by the plots in Figure 2. On the opposite, when p is almost 1, then almost
every possible link exists, and so almost every node is at distance 1 from the
source. Therefore, the obtained view, both with the usp and with the asp model,
is almost a star. It therefore contains almost
n -1 links, which, compared to
the total number of links, almost n·(n−1)/2 , is negligible.
The plot for the usp model is easy to understand. Indeed, the exploration using
this model gives a tree (it has no cycle), and therefore it contains exactly n - 1 links if p is above log(n)/n.
Since, in this case the network is almost surely connected. The expected
total number of links being itself
m = p
·n·(n−1)/2 , the ratio
between the number of links discovered during the exploration and the total
number of links is then n – 1 / m = 2 /
p · n. When p grows, this ratio decays as 1/p, which is confirmed by the simulation. On the contrary,
the irregular shape of the plot for the asp model is very surprising: it has
many peaks and valleys of high amplitude, which have no obvious interpretation.
This is so surprising that we will name it the camel plot. There is
however a natural explanation of this shape, which comes from specific
properties of the exploration.
Several sources
Until now, we have restricted ourselves to explorations using only one
source. However, in practical cases, one generally uses several, but few,
sources. We investigate here how this may influence the quality of the view we obtain.
Again, we only concentrate on the ratio of the total number of discovered
links, which previous remarks have shown to be essential.
Figure 4 shows the evolution of this ratio when the number of sources is
increasing. Let us first consider the two topmost plots, which correspond to
the cases where we use all the possible destinations. As expected, the quality
of the view grows rapidly with the number of sources, and one may even be
surprised by the rapidity of this growth. Despite our model of Internet
exploration is very rough; one may consider this plot as good news since it
indicates that one does not need many sources to obtain accurate views of the
network. This is important since it is very difficult (and never done) to use
many sources in practice.
Conclusion
In this communication, we considered the simplest
possible question concerning the quality of a network view obtained by an
exploration of a real network: the amount of the total number of nodes and
links we obtain. Making natural variations on the way we model the Internet
exploration, we show that this amount varies a lot and is very difficult to estimate.
- R. Albert and A.-L. Barab´asi. Emergence of scaling in random networks. Science, 286:509–
512, 1999.
- R. Albert and A.-L. Barab´asi. Statistical mechanics of complex networks. Reviews of
Modern Physics 74, 47, 2002.
- R. Albert, H. Jeong, and A.-L. Barab´asi. Diameter of the world wide web. Nature, 401:130–
131, 1999.
- R. Albert, H. Jeong, and A.-L. Barab´asi. Error and attack tolerance in complex networks.
Nature, 406:378–382, 2000.
No comments:
Post a Comment