Sunday, 16 June 2013

There's no such a thing as harmless data

After the leak of spying project Prism, one of the many lame excuses given to justify such an immoral thing is that the data which is being collected is not the content of messages, just the sources, destinations and times.

As I have been working with machine learning for quite some time, I feel compelled to explain that even with this apparently harmless amount of data we can discover a lot about someone.

It is common place that everywhere you buy things on the internet gives you suggestions based on previous items you have bought. The more you buy, the most accurately the suggestions seem to agree with your taste. In this case, however, you think this is reasonable given that you are actually giving them what seems to be the appropriate information to infer just that.

What you don't really know, and usually probably don't really care, is how this is done. There are many ways, in fact, but let me explain one of them. The one that will make you understand how powerful these methods can be.

In fact, I will explain the one I have worked on two years ago in collaboration with Joerg Reichardt, a colleague from Germany who is a specialist in complex networks. Networks are simply formed by objects that are connected by something. Graphically, the objects are represented by dots and, when they have a connection, we draw a line between them. The graph below is an example taken from Wikipedia:

For instance, the dots, or in the above case the circles, can be computers and the lines a physical connection between them. The interesting thing is that the concept of connection can be generalised to any kind of relationship between the objects. One example could be a graph where the dots, also called vertices or nodes, can be any kind of real objects and the lines, technically called edges, could connect any two objects that share some colour in common. 

A very interesting kind of network, from the commercial point of view, is the one relating consumers to films in sites like Lovefilm or Netflix. This is a special case of a network which is represented by a bipartite graph. By that, we mean a graph with two different kinds of nodes. In this case, we will have nodes which are films and nodes which are consumers. But to be considered bipartite, a graph like that must obey one more condition, that nodes never connect to other nodes of the same kind. For our example, that means that we always connect consumers to films, but never consumers to consumers of films to films. 

The ways connections are put are now very clear. If a consumer watches a film, a connection is established. In the end, we have a network like this:

A bit of advertisement here. This was taken from our work, The Interplay between Microscopic and Mesoscopic Structures in Complex Networks, which can also be found in my website. I will explain the work in a bit, but let me now continue with the graph above. You can clearly see now that the name "bipartite" is really appropriate. Consider that the green lozenges are consumers and the blue circles are films. The red line means that, at least once, consumer number 4 watched film g.

This is a very compact way of visualising this kind of who watched what information. Of course the website can do even better and collect other data like the rating each consumer gives to each film, but we can already do a lot without it.

The next step is now to use it to give recommendations. The cool way in which you can do that is the following. You might think that if someone watch a horror film, then we can recommend another horror film to that person, but that is too simplistic. What if the person watched 10 romantic films and only one horror film? We'd do better by suggesting another romantic film, right? Things would be so easy if people were easy to read like that...

What usually happens is that people watch dozens, sometimes hundreds, of films and most of them can watch a huge number of different categories of films. Consider that this happens to thousands or millions of consumers and you can understand why we call it a complex network.

But think about the kinds of person you know. Geeks usually watch a lot of science fiction, adventure and even horror, but much less romantic films. Non-geeks usually watch some science fiction, but would watch much more adventure films, for instance. The trick we want to perform is that of, by looking at our bipartite graph, being able to identify something like that, communities of people with similar interests. If we succeed, our recommendations will surely be more appropriate. This is called community detection or clustering in graphs. It's not very easy to see in bipartite graphs, but it's very clear when you look at graphs like the one below:

This graph was taken from the paper Mixture models and exploratory analysis in networks by Newman and Leicht and represents a network of friendships of members of a karate club which split in two because of internal disputes. You can see that the members are separated in two clusters of nodes. Connections between members are much more frequent inside the clusters than outside. Finding these clusters might be easy for relatively small networks, but is tricky for large ones. The interesting thing is that the authors of the above paper found an algorithm that, given the nodes and edges only, could find the split almost perfectly.

Now you should start to be scared. Imagine that someone constructs this kind of network by connecting two persons if they speak through their mobile phones for, let's say, more than ten minutes. Using the above algorithm, this person can classify people in groups. If this person has some information about the interests inside each group, then a better classification can be achieved. Imagine now that this classification can be geeks, businessmen, religious people or... terrorists. 

Let's come back to our films, because now you can understand how that information can allow for a finer classification than the karate club one above. By using the algorithm that I and my friends developed, for instance, you can cluster people in groups relative to their interests in films, books, products and so on. You don't need access to the text of their messages, nor their ratings and even less their reviews. The simple connection information becomes enough.

The amazing thing about the bipartite example is that people never connect directly with one another, only indirectly by the products. And the algorithm doesn't need to have any prior information about what profile of people buy which product, it can cluster both the persons and the products in groups at the same time!

If you're a professional in the area and look the papers above, and also others in the field, you will see that the results are very good. The amount of information extracted can be impressive. And that was obtained simply by the information contained in the connections of the graph! Of course the algorithms are not perfect, which is a terrible thing actually, because bureaucrats won't care

The bottom line is that every private piece of information about you that leaks from websites or mobiles can reveal much more about you than you can imagine. The paper I published appeared on 2011. These things evolve very fast and, if you have enough computer power, you can do amazing things.

So, don't believe when the agencies say that the data they are collecting is not very informative. First, it is, second, if it wasn't, either they would not be collecting it, or they would be idiots wasting a lot of money and time. And, believe me, they are not.

Monday, 10 June 2013

Western Democracies through a PRISM

The document published by The Guardian, leaked by ex-NSA employee Edward Snowden, is only shocking for those who had been brainwashed enough to think that governments, specially the western democracies modeled after the USA, are moral entities that protect the rights of their citizens above everything. I'm not saying that they're not law-abiding entities, even because they have enough expertise to circumvent any imaginable law, I'm saying that they should not be blindly trusted about anything. And for those who think I'm saying something new, this is a quote by Thomas Jefferson
"Eternal vigilance is the price of liberty."
The above quote, ironically by the 3rd president of the USA, appeared around 1810 for the first time. Since then, it seems, a lot of things have changed.

The PRISM project, which aims to spy on any citizen through mobile phones and internet without any a priori reason to suspect, is just another piece of evidence that shows that the objective of any government in the world is to retain power. Controlling what people think, or in this case what they are interested in or talking about, is just one way of guaranteeing that everybody is having the correct "mindset" for this to happen.

In the same way that happened with Julian Assange, and in a different context with Aaron Swartz, the USA government will do everything it can to get its hands on and try to make an example of Snowden. Surveillance on digital communications is a way to make sure that the example will stick.

Fortunately, we still have newspapers like The Guardian. Not that I trust them blindly either, but they have enough power not to be censored, and that's something. So much that neither the USA nor the UK, which is also involved, could deny the accusations, entering in the "defense mode". As they cannot deny their deeds, they try to justify on the basis that they are fighting terrorism, with the word "terrorism" meaning whatever they need it to mean to justify their acts. Are you skeptical? So read this piece of dialogue which was also published in The Guardian. Pay attention to the answer William Hague, the British foreign secretary, gave to the question of an MP:
Angus MacNeil, the SNP MP, asks if "within the law" always means the same as moral. 
Hague says "within the law" means for the purposes set out, such as preventing terrorism.
I'm not impressed that Hague said that, but I would be very impressed if people fail to read all the implications of this answer. First of all, it implies the old the ends justify the means, because Hague is saying that they are allowed to do whatever they want to fulfill an objective and moral plays no role in that. Another thing which is clear is that Hague is saying that the law is whatever they want to do. I'm pretty sure last time I studied democracy, the word "accountability" was part of it.

It's laughable that these same democracies constantly criticize authoritarian regimes, China mainly, for trying to do exactly the same thing. It's also interesting that Barack Obama was awarded the Nobel Prize of Peace some years ago. 

There is no new lesson here. It's the same old lesson that we all have the duty to pass to the next generations.  The lesson of eternal vigilance. The lesson that governments, rulers, are interested in power and power relies on control. Control relies on limiting freedom, spreading fear and censoring. Once this sets in, the governments can take whatever decision they want. And if you really believe that governments will only do what is good for you, seriously, you must be really stupid. Sorry, but I can't find a better word.

Friday, 7 June 2013

Digital Universe

This is a project by the American Museum of Natural History. According to their own words: 
"The Digital Universe, developed by the American Museum of Natural History’s Hayden Planetarium, incorporates data from dozens of organizations worldwide to create the most complete and accurate 3-D atlas of the Universe from the local solar neighborhood out to the edge of the observable Universe."

Thursday, 6 June 2013

The Universe is fine, thanks.

I am considered neither a high-profile theoretical physicist nor a leading figure in my area, but I do consider myself a theoretical physicist with enough knowledge to weight in some rational thoughts even against the strongest authority arguments.

You have no idea of why I am moaning like that, so let me explain the reason of my indignation. I have just read the following article from Scientific American, which I consider to be a rather good popular science magazine:

It's true that the article is originally from the Simmons Foundation, but SA published it anyway.

Well, now I am going to say what pisses me off in this article. A fair summary would be that it is nothing more than a desperate and kind of arrogant attempt to justify something that should not be a worry at all in science: our own ignorance.

The article starts with a quote from Nima Arkani-Hamed, a high-profile theoretical physicist (have I said already that I hate authority arguments?). At a conference he said that the "universe is impossible". In support, some other high-profile physicists (!) give their statements as well.

Their arguments go like that. All accessible experiments up to date confirm the Standard Model in its pre-string/supersymmetry form with reasonable accuracy. There is, however, a lot of unexplained things. Because the only explanations we could think about do not work, we must assume that the only solution is the multiverse hypothesis, where every kind of universe exists and we happen to be in one of them.

I'm being unfair, of course. The arguments are more complicated than that, but the essence is the same.

For instance, they all make a big deal about what we usually call "naturalness". Naturalness is not a rigid principle of nature, it's more like a hope. A hope that quantities appearing in Nature are not too strange for OUR taste. I guess you are all smart enough to recognise that the catch here is the fact that we are judging how Nature should behave by our standards of symmetry, which in the end is to what it boils down.

Then it comes a series of things that still are unanswered. For some reason, the argument again is that if we could not find a better solution, than the solution is a multiverse. That sounds like desperation to me. Can't we accept that maybe we still don't have enough data to understand the problems? Can't it be that we are missing something? That some of our ideas have problems and must be substituted to work? Of course it can.

There is something even crazier happening. Many versions of the multiverse idea are unfalsifiable. I said this many times and I will repeat it again. One unfalsifiable answer is as good as any other unfalsifiable answer. Be it the multiverse, god or the Matrix. 

Even those versions which are marginally falsifiable, if there is such a thing, are simply jokers. Once you postulate that there are any kind of universe, all problems of why the universe is like it is are solved. Here enters the probable human explanation for why this idea might be becoming so attractive to those who spent so much time trying to find a solution but didn't. If everything goes, it's not their fault that they haven't found one.

Like naturalness, the claims rely in even more concepts which are at best disputable. Take the idea that we live in an extremely unlikely universe as very few variations would support life. That is not true. If I put aside the fact that we have no real agreement about what we want to call "life", it would be fair enough to say that a universe in which anything that would look like a computer program could run would be able to support life. I can imagine an infinitude of variations of physics that keep the mathematics necessary for this to happen intact.

Of course the hypothesis can be true. Solipsism can be true as well. I can be the only thing that exists in the universe. Or maybe you. Shall I say that the current problems of physics support the solipsism idea? It surely can explain physics and a lot more...

Honestly, I didn't like the article at all. A similar thing happened about 110 years ago, although it was in the other side of the spectrum. Around 1900, Lord Kelvin, a high-profile physicist that we all know, said that we had the explanation for everything and that all that remained was some more precise measurements. There were only two insignificant problems to solve. As you know, their solution only reinstated the fact that Nature abhors authority arguments.

More than one century later, we know science and philosophy enough not to fall in the same kind of trap again. Still, humans have a hard time to admit failure even when it's definitely not their fault.

Meanwhile, the "impossible" universe goes on. Apparently, unworried about all the inconsistencies in all our descriptions of it. Thank you.