Big Data in Life Sciences – Why ‘doing things the old way’ is the Biggest Barrier to Progress

2020 was a record year for many reasons, not least because it saw the world producing more data than ever before. In the life sciences sector, COVID-19 has certainly played a significant part here with, for example, researchers from around the world publishing more than 87,000 papers about coronavirus between the start of the pandemic and October 2020.1

We’re likely to break that record again as global data creation is projected to grow to more than 180 zettabytes by 2025.2

That’ll be why big data is such a ‘hot topic’ then, right? Well, yes – partly – but let me put it to you that what we should really be talking about is the big questions that individual researchers and organizations are trying to answer before we talk about big data challenges and tools. In my experience, the pharma research industry is very pragmatic – if they can answer the questions they have with ‘smaller data’ then they will.

Big questions

So, it is difficult questions that drive the need for more and more data. Questions like: how many individuals in the biobank have the exact genetic mutation that is connected to my drug target – and what lab measurements, lifestyle data, and clinical and family history can I access for these individuals?

Or: as our digital transformation accelerates, how can I make sure I have our data organized and rapidly accessible in a scalable computational platform that is future-proofed to meet requirements we don’t know about today?

Or: how can I give my bioinformatics and R&D bench scientists access to easy-to-use APIs and GUI tools to cost effectively compute, visualize, and report their findings within a research platform that is not siloed, outdated, and inefficient in handling very large datasets? I want both computational efficiency along with rigorous statistical methods for multi-omics analysis and human genetics.

Old school vs modern computational platforms

As I mentioned above, pragmatism often rules. A recent conversation I had with a long-time industry expert summed things up like this – pharma research is good at collecting and analyzing data that is focused on today’s problem. If they need the same data set again later (sometimes much later), without a doubt, it’s easier to simply collect it again, perhaps including additional data, rather than search for it in a data archive.

Furthermore, as techniques allow new data types to be captured, and we understand more and more about what multi-modal data can add to our understanding, the costs of data collection continue to rise dramatically. Millions of dollars are now routinely spent each year collecting data, and the cycle of data silo creation continues.

But, of course, real insight comes, not from data collection, but from intelligent data curation, computation, and application.

It is interesting to note how many people I speak with who are convinced that this industry inertia needs to change. They are willing to take a step back from old school wisdom and consider new methods and new approaches.

References:

Cai, X., Fry, C.V. & Wagner, C.S. International collaboration during the COVID-19 crisis: autumn 2020 developments. Scientometrics 126, 3683–3692 (2021). https://doi.org/10.1007/s11192-021-03873-7
Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025 (Accessed 12 October 2021) • Total data volume worldwide 2010-2025 | Statista