People today aspire to use Big Data in elections, sports, healthcare, business, national planning, and where not. Michael Lewis’ 2003 book Moneyball , depicted how the manager of Oakland Athletics built up a successful baseball team by using data and computer analytics to recruit new players. ‘Moneyball’ culture soon began to dominate every bit of our life. And Silicon Valley entered into it with a new set of professionals — data scientists — who, according to Harvard Business Review , is the most attractive job of the 21st century.
People today expect data science to devise profit-making business strategy, come up with winnable election tactics, or generate world cup-triumphing playing mechanisms. Often, data scientists also aspire to do so by using the heart-beat of data. But, can they succeed? Big Data Analytics maybe like ‘churning the ocean’ in search of ‘nectar’ hidden deep in it, as depicted in the great epic Mahabharata . That’s a gigantic project, for sure. One needs a lot of efforts and expertise to obtain the nectar, but there’s every chance to get deceived by other substances — including deadly poison — obtained in the process of churning.
The ongoing pandemic, however, provided a golden opportunity for data science to exhibit its strength. It was its litmus test as well. As early as April 2020, a
Misleading predictions
As Covid-19 yielded loads of freely available data, various data scientists came up with lots of predictions and strategies — that of the eventual number of infected, eventual number of deaths, duration of lockdown needed to control the pandemic, etc.
In fact, forecasting the trajectory of the disease over time became almost a fashionable exercise to many. No wonder, in many cases, these were even contradictory in nature, and eventually most of these predictions proved to be utterly wrong, misleading and useless.
Predicting the future course of events by using the techniques of data science reminds one of the Tom Cruise starrer 2002 Spielberg movie Minority Report , where the PreCrime police force of Washington DC in 2054 even predicts future murders using data mining and predictive analyses!
In practice, data science often use statistical models and techniques, which are based on various underlying assumptions. Often, the real data doesn’t satisfy the assumptions of these models.
For example, for analysing the data of the pandemic, models such as SIR, SEIR or some of their variants were widely used. But, the dynamics of a new and unknown disease maybe far more complicated and unpredictable, and it’s most likely that they would fail to satisfy the assumptions of those classical models or their tweaks. Thus, serious error is bound to occur, which would get compounded with loads of data. Then, running routine software packages for analysing big data is never adequate, and is often incorrect.
With the ever-expanding horizon of ‘Internet of Things’, data is growing exponentially. The size of the digital universe was predicted to double every two years beyond 2020. The ongoing pandemic might have induced a higher rate of increase!
However, unless some event like Cambridge Analytica breaks, we can’t usually understand that our every footstep is added to the ocean of data. The world has become data-addicted. But, with so much data, the needle is bound to come in an increasingly larger haystack.
In 2008, Google launched the web service ‘ Google Flu Trends ’ project, with an objective to make accurate predictions about outbreaks of flu by aggregating Google Search queries. The project, however, failed — people often search for disease symptoms that are similar to flu, but are not actually flu. And when the much-hyped ‘ Google Flu Trends ’ project turned to a disastrous failure, people came to understand that big data might not be the holy grail.
Also, current computational equipment are certainly inadequate to handle millions of variables and billions of data points. The number of pairs of variables showing significant ‘spurious’ or ‘nonsense’ correlation would increase in the order of the ‘square of the number of variables’, which are almost impossible to identify.
Thus, churning the ocean of big data may yield both nectar and poison. Separating them out is a daunting task. Statistics is still in its infancy in this context, and is not equipped yet to handle these kinds of problems. Let’s be honest to admit that.
Overall, data science, being reliant on ‘statistics’ for its models and analyses, may not be ready yet for complex predictions such as the complicated yet verifiable trajectory of Covid-19. For the time being, data science’s best bet maybe to get engaged into open-ended unverifiable problems.
The writer is Professor of Statistics, Indian Statistical Institute, Kolkata