Big Data: “The Great Disk Drive in the Sky….”

January 27th, 2012

Big Data

Ars technica just published a story on how companies like Google, Amazon or Facebook are dealing with the petabytes of data they produce/consume. It’s a bit long but a must read for anyone interested in big data and the future of storage.

Very insightful aspects of their decision process and implementation details reveal that big data storage will always diverge before converging in the future (if ever). This really gives you a impression of How Big is Big in Big Data!

Is Twitter Evil?

January 27th, 2012

Twitter is turning Evil?

You can’t service all of humanity if you allow the needs of politics to triumph over the needs of the people. And if you can’t service all of humanity, what is your relevance?

via Forbes – Twitter Commits Social Suicide

I don’t think that Twitter will fade out because, like Facebook, Twitter is too big to fail now. The problem is that these kind of measures (some might call them features) are now part of the process and not some kind of CIA tapping on the pipes. By doing automated self-censorship Twitter might be releasing a nest of wasps. That’s sad.

Finding communities in networks with R and igraph

January 21st, 2012

Finding communities in networks is often a common task under the paradigm of complex systems. Doing it in R is very easy and there are several ways to do community partitioning of graphs using very different packages. The one I’m talking here is ipgrah.

igraph is a lovely library to work with graphs. 95% of what you’ll need is available in igraph with the advantage that the libraries are written in C and therefore are fast as hell.

Ok, now for the list of algorithms that you might be interested:

walktrap.community

This algorithm finds densely connected subgraphs by performing random walks. The idea is that random walks will tend to stay inside communities instead of jumping to other communities.

Pascal Pons, Matthieu Latapy: Computing communities in large networks using random walks, http://arxiv.org/abs/physics/0512106

edge.betweenness.community

This algorithm is the Girvan-Newman algorithm. Basically it is a divisive algorithm where at each step the edge with the highest betweenness is removed from the graph. For each division you can compute the modularity of the graph and then choose to cut the dendrogram where the process gives you the highest value of modularity.

M Newman and M Girvan: Finding and evaluating community structure in networks, Physical Review E 69, 026113 (2004)

fastgreedy.community

This algorithm is the Clauset-Newman-Moore algorithm. In this case the algorithm is agglomerative and at each step the merge is decided by the optimization of modularity that it produces as the result of the merge. This is very fast, but has the disadvantage of being a greedy algorithm, so it is might not produce the best overall community partitioning, although I find it very useful and very accurate.

A Clauset, MEJ Newman, C Moore: Finding community structure in very large networks, http://www.arxiv.org/abs/cond-mat/0408187

spinglass.community

This algorithm uses as spin-glass model and simulated annealing to find the communities inside a network.

J. Reichardt and S. Bornholdt: Statistical Mechanics of Community Detection, Phys. Rev. E, 74, 016110 (2006), http://arxiv.org/abs/cond-mat/0603718

M. E. J. Newman and M. Girvan: Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113 (2004)

An example:

# First we load the ipgrah package
library(igraph)
 
# let's generate two networks and merge them into one graph.
g2 <- barabasi.game(50, p=2, directed=F)
g1 <- watts.strogatz.game(1, size=100, nei=5, p=0.05)
g <- graph.union(g1,g2)
 
# let's remove multi-edges and loops
g <- simplify(g)
 
# let's see if we have communities here using the Grivan-Newman algorithm
# 1st we calculate the edge betweenness, merges, etc...
ebc <- edge.betweenness.community(g, directed=F)
 
# Now we have the merges/splits and we need to calculate the modularity for each merge
# for this we'll use a function that for each edge removed will create a second graph, 
# check for its membership and use that membership to calculate the modularity
mods <- sapply(0:ecount(g), function(i){
  g2 <- delete.edges(g, ebc$removed.edges[seq(length=i)])
  cl <- clusters(g2)$membership
  modularity(g2,cl)
})
 
# we can now plot all modularities
plot(mods, pch=20)
 
# Now, let's color the nodes according to their membership
g2<-delete.edges(g, ebc$removed.edges[seq(length=which.max(mods)-1)])
V(g)$color=clusters(g2)$membership
 
# Let's choose a layout for the graph
g$layout <- layout.fruchterman.reingold
 
# plot it
plot(g, vertex.label=NA)
 
# if we wanted to use the fastgreedy.community agorithm we would do
fc <- fastgreedy.community(g)
com <- community.to.membership(g, fc$merges, steps= which.max(fc$modularity)-1)
V(g)$color <- com$membership+1
g$layout <- layout.fruchterman.reingold
plot(g, vertex.label=NA)

try it!

Apple: Imagine this…

January 20th, 2012

Imagine Art

Imagine you are a plastic artist.

Imagine you need to buy some new brushes.

Imagine that with the receipt there was also a small paper saying that
by using these brushes, any painting you made could be offered free to
anyone you wanted, but If you wanted to sell your painting you could
only sell the painting through the brushes company that would require
a cut in the sales price.

Imagine how you’d feel about that.

Now can someone explain me why any Book you make with iBooks Author
software from Apple
comes with this exact condition?

Big data roundup

January 20th, 2012

For future reference in case anyone needs to point someone to Big Data: O’Reilly Radar as a nice roundup of Big Data companies that offer Hadoop solutions.

The coming war on general computation

January 19th, 2012

The video of Cory Doctorow at the 28th Chaos Communication Congress during the last days of December of 2011 is a must see to anyone interested in the Copyright wars and the attack on general computation. After yesterday’s blackout it is important to see this.

Cory Doctorow: The coming war on general computation

The copyright war was just the beginning

The last 20 years of Internet policy have been dominated by the copyright war, but the war turns out only to have been a skirmish. The coming century will be dominated by war against the general purpose computer, and the stakes are the freedom, fortune and privacy of the entire human race.

The problem is twofold: first, there is no known general-purpose computer that can execute all the programs we can think of except the naughty ones; second, general-purpose computers have replaced every other device in our world. There are no airplanes, only computers that fly. There are no cars, only computers we sit in. There are no hearing aids, only computers we put in our ears. There are no 3D printers, only computers that drive peripherals. There are no radios, only computers with fast ADCs and DACs and phased-array antennas. Consequently anything you do to “secure” anything with a computer in it ends up undermining the capabilities and security of every other corner of modern human society.

And general purpose computers can cause harm — whether it’s printing out AR15 components, causing mid-air collisions, or snarling traffic. So the number of parties with legitimate grievances against computers are going to continue to multiply, as will the cries to regulate PCs.

The primary regulatory impulse is to use combinations of code-signing and other “trust” mechanisms to create computers that run programs that users can’t inspect or terminate, that run without users’ consent or knowledge, and that run even when users don’t want them to.

The upshot: a world of ubiquitous malware, where everything we do to make things better only makes it worse, where the tools of liberation become tools of oppression.

Our duty and challenge is to devise systems for mitigating the harm of general purpose computing without recourse to spyware, first to keep ourselves safe, and second to keep computers safe from the regulatory impulse.

The full transcript is also available.

R in the Top 20 of Programming Languages

January 17th, 2012

Programming languages come and go, but its nice to see what’s gaining momentum and what’s not. In the latest Tiobe report for January 2012 we can see some interesting surprises in the top 20 chart of programming languages. C is still highly demanded and closing on Java. Both account for 1/3 of the programming languages panorama.

Other interesting aspect is the fading of Python. Python lost half of it’s market share. Maybe this is because of Python 3 and the incompatibilities with Python 2.x that might have sent many programmers in search other solutions. Another problem might be GIL that hinders thread programming in Python in a time when programming is moving to the concurrent and distributed programming.

Also interesting is the rise of R. R is one of my favorite languages for science. It makes reproducibility of research results very easy (specially if you use Sweave with R) and for any kind of statistical analysis it is almost perfect. It also produces great plots for scientific publications.

Kill the Poor – Don’t show this to our PM

January 16th, 2012

#sopa, #pl118 and #pipa half baked stories…

January 16th, 2012

It looks like SOPA has been put to sleep for while, but that doesn’t mean much when there’s also PIPA (Where do they get these names?). PIPA is not as known as SOPA but basically does the same thing, allowing DNS censorship.

Around the burg the #pl118 is a bit dormant and most of the online traffic as been from retweets of scattered texts. Although there’s many users online (according to some sources Twitter gains 11 new accounts per second), there are lots of trolls and hashtag surfers.

I’m working now on a “social graph” of the #pl118 tag and the main results show a strong hierarchical structure of the network with this theme being highly centralized in very few individuals. Remove those nodes and probably the meme is gone. New individuals need to join the fight against #pl118 to make the conversation more sustained and robust.

(The above pic is from the #pl118 graph. It’s not complete, so in the next few days I’ll maybe have a more details.)

#pl118 – Is it loosing steam? Some stats on the portuguese copy levy.

January 14th, 2012

Count of #pl118 in twitter
The Portuguese private copy levy is still going strong on social networks, but now the discussion seems to have lost a bit of steam. Maybe because the mainstream media hasn’t really picked it up (why?) except for some sporadic entries. In any case it is necessary to keep it strong as the proposal is totally absurd.

Most active users discussing the #pl118

jonasnuts : 412
RuiSeabra : 356
jmcest : 223
ncruz77 : 208
super_nortenho: 173
streetfiteri : 172
paulasimoes : 168
DiogoCMoreira : 160
ZW3I : 122
(Other) :2698

Most used clients to post about the #pl118

web :1617
TweetDeck :1280
HootSuite : 404
Twitter for Mac : 308
Twitter for iPhone : 109
Twitter for iPad : 106
Tweet Button : 103
Echofon : 90
Seesmic : 79
(Other) : 596

(Note: Data for Jan 14th not final, updated the lists to include a few more users and clients.)