February 25, 2016

Data analysis of strikes in Italy

Hi guys!

As few of you probably know, I spent a few years of my life living in Milan. There, public transportation is very weird. There are public transportation's strikes unexpectedly often. Moreover, these strikes are unusually frequent on Friday or Monday., and I couldn't came up with a rational explanation for that </sarcasm>.

I was really puzzled by this strange behavior, and I wanted to investigate more. Long story short, a few months ago I (and my friend Andrea) had this idea of forecasting strikes in Italy; we had this hypotheses that strikes can be predicted. because they follow regular monthly pattern (or so).

Our idea was to answer the following questions:
  • Can we cluster strikes in different categories? How many kind of strikes are there? Can we create a distance between those strikes and place them on a metric space?
  • Are there any kind of correlation with other data (how is the economy going? stock markets, GDP, and so on..)
  • Which are the conditional probabilities? That is: what is the conditional probability of the event X given that Y happened previously?)  
  • It is true that strikes happen regularly, and only on specific days of the week?
  • Can we forecast the next strike given the previous strikes for that sector/region/company?
  • Can we make a heatmap of the probability density of strikes for each sector/region/company along the 365 days of the year? (I imagine that as a 12x31 heatmap)

It took a couple of calls with the very kind Commissione di Garanzia Sciopero which gave us the necessary data in a proper MySQL format. (Data is public but we wanted the .sql)


We selected then a couple of tools that we could have been used to answer question, or simply that we would have liked to use or learn. Among the many:
  • MySQL, R
  • orange
  • scikit-learn
  • IBM analytics
  • Some implementation of neural network algorithms (pybrains perhaps)
  • statistical model developed from scratch by us. 

So what? Time runs quickly and we were pretty busy at that time, and we let this project aside.  I have release all the data oh GitHub, plus a few notes I made back in those day. I apologize: some of them might be in Italian. 

Data is as much as accurate as possible: because of our current law, CGSSE must record all the strikes that occurs, and every syndicate must inform this organization of strikes.

It might be an interesting project for whoever wants to learn any of these tools (or want an answer to those questions). If anyone is willing to play with this kind of data, please let me know!  I won't exclude that in the next months some followups might appear on this blog: I'm currently struggling on other urgent/important projects, but I won't exclude a weekend or two in the following months dedicated to this analysis.

A SELECT from sql db


Trenitalia we love you





P.S.
Now is 2016 and data for 2015 should be available as a test set :P

P.P.S.
Even if you can find any significant correlation on this kind of data, you cannot prove any malice in the organization of strikes. It can be expected that strikes are organized when it is more convenient, under various aspects of convenience, and you couldn't prove malice or anything else (like that people do strikes because they are lazy)


Stay tuned (in drop D).