While doing the research for this article, I found numerous references from dark data content to astrophysical terms like dark energy and dark matter. I almost forgot that most things in our universe are actually dark. 68% dark energy and 27% dark matter. Leaving us with 5% normal matter. Comparably, not much to work with but as humans we have done quite alright so far with much less.
When it comes to data though, I guesstimate the situation darker. 99.9999999% darker. In this post, I want to explain why this is, the problem it will quickly pose for us humans and how we can enlighten the situation.
Defining Dark Data
Within the literature, you will find numerous definitions of the term “dark data”. Gartner sees it as “information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” This focus, on ignored or abandoned data within an organization, is quite common in scientific publications.
Yet, it falls short in many cases for machine learning or other large-scale analytics. Quite often you need data that is not already accessible within an organization. Furthermore, with this perspective we wouldn’t get close to the 99.9+% guesstimate.
The best definition I could find that has a huge overlap to what we will be discussing comes from the famous statistics professor David J. Hand. He literally wrote the book about dark data . I can highly recommend reading it. The simple definition by Hand for dark data is “data you don’t have”.
Within his book, Hand describes 15 different types of dark data, why you don’t have it and, therefore, experience suboptimal results. No worries, this post will not be a review of his book or a summary.
We will be taking a rather practical, solution-oriented viewpoint to a subset of dark data. Namely, data you theoretically could have but don’t have access to. Data that doesn’t exist is also dark but why bother?
A Universe of Data
No one can know exactly how much digital data already exists. But latest predictions from the International Data Corporation (IDC) amount to 64 zettabytes in 2020 (growing to 90ZB in 2025) . For comparison, this is 660 billion blu-ray discs, 33 million human brains or 330 million of the world’s largest hard drives. So quite a bit.
Now ask yourself: How much of this data do you have access to? If you come to the same conclusion as I did, then you might also think more towards 99.9+% of data being dark to you.
However, it is rather improbable that most of the data would be useful. But nevertheless, there is a huge amount of value out in the data universe that you cannot access today.
There are some first initiatives to simplify data access. Open data initiatives are visible in many countries already and a growing number of organizations contribute. NASA, CERN, Wikipedia, GitHub, OpenStreetMap are just a few examples where collaboration on data is made open and simpler.
But, in many cases, data that would help you in a machine learning project, does not necessarily reside within your organizational boundaries nor is it available as open data.
Let’s assume you are researching a cure for a rare disease. A rare disease is most commonly defined as a life-threatening disease less than 1 in 2000 people have. Within a single country, it is unlikely that you will find a sufficiently large enough dataset to train your model. But a combined effort of clinics across the world could get to large enough dataset.
For instance, in Europe and the USA could theoretically aggregate a dataset of up to ~390,000 patients. A federated learning approach across these clinics could make a lot of sense here to avoid centralizing data while still training models as close to the data as possible.
But within such a project, one would be faced by one or all the following roadblocks, preventing direct access because:
Data is protected by privacy-regulations like GDPR, CCPA, HIPAA or others.
Data is the intellectual property of someone else and by providing direct access, the custodian would lose this intellectual property and its value.
Data is with a direct competitor, so why give a potential competitive edge away.
To make it even more realistic, a machine learning model is also the intellectual property of you or your company. Sending your model to external servers to comply, for example, with GDPR or HIPAA is maybe not the best idea.
So, let’s think through applicable requirements letting your models learn from such dark data without compromising your IP or privacy needs.
Requirements for Enlightening Dark Data
So far, we established that a lot of valuable data you need is dark to you and that different roadblocks can prevent you from direct access. To achieve optimal accuracy and minimize loss, we need granular access to large, high-quality datasets for our algorithms. But do we really need full access for training?
As you might already know, the good news is we don’t. The only thing needed are the patterns we harness from datasets and, therefore, computational access to external data would suffice. To meet real-world requirements, computational access needs to be governable and mapped to compliance requirements.
Evolving Regulations. Adaptive Compliance.
As always, it is helpful to see the world with different eyes. For our case here, let’s explore the perspective of the entity holding the data, aka the data custodian.
When running computations on sensitive or privacy-protected data, the data custodian might be bound to internal compliance procedures or external regulations. GDPR, HIPAA and other data-related regulatory frameworks are explicit about data access by third parties.
For instance, GDPR defines different rules per use case and differentiates between access for research or commercial purposes. Most recently the European Union published the AI Act for discussion, the White House a strategy paper on AI and data privacy and the UK is boosting their AI capability with a dedicated taskforce and regulatory framework. For our rare disease project, this means a data custodian and our ML project need to comply already with three very different legal ecosystems to enlighten the data we need.
It is unlikely that regulators will stop there, so a general solution for making data accessible across boundaries is necessary.
Governance but Computational
For some, it might be a dry topic and complex as well. But someone contributing data to a project needs the capability of computational governance managing the access of each algorithm running on data while considering who sent this computation request.
Necessary data might also be IP-sensitive, and a data custodian needs to stay in full control of what computations should be allowed to mitigate the risk of IP leaks or reverse engineering of data. This requires control for computations. If you want to do it right, on a functional level.
Furthermore, a data custodian might also be required to know at any given time who did what with the data, think HIPAA or PCI regulations, so traceability and auditability play a key role in allowing any kind of access in the first place.
In a nutshell: A data custodian needs a highly adaptive solution enabling and managing all these things for us in an automated fashion.
Signs of Enlightenment Today
The potential of privacy-preserving computations accessing dark data is huge. Across industries, we can already see organizations collaborating with their partners, customers, or even competitors to shed light on the dark data relevant to them. Let’s look at a few examples.
MELLODDY is a collaborative project where 10 pharma companies came together to train ML models on their distributed, highly sensitive, small molecule data.
In the financial sector training and running privacy-preserving federated machine learning models across jurisdictional data silos are used to detect financial crime in real-time to respond quickly and effectively. Recently, Microsoft even partnered with the London Stock Exchange Group (LSEG), to gain access to datasets of financial data suitable for machine learning. By training ML models on this data, Microsoft can develop solutions to problems facing the financial industry today.
A final example is a packaging company, Greiner, which reduced its fixed asset investments by 35% by collaborating with its suppliers on their data along the whole supply chain.
The value of accessing dark data may not be immediately apparent and varies across use cases. However, these examples show that collaborating on data can unlock immense value.
A Glimpse Into the Future
Personally, I think the challenges of humanity being a global force and highly interconnected are way too big for not collaborating on a global data level. With machine learning and artificial intelligence becoming an integrated part of our daily lives, we better leverage the best, most diverse data sources out there to train these crucial tools. To create a global data economy, we need to find a way for organizations to collaborate not only across regulatory boundaries but also overcome competitive gatekeeping. The technology is already there today. In Apheris and elsewhere.
In the previous section, we discussed projects in which this future is already a reality today. Our founders said it right when they said during a conference “Your own data is just a small proportion of what is available – leverage what is out there.” Let’s enlighten as much dark data as possible.
 Gartner (2023): https://www.gartner.com/en/information-technology/glossary/dark-data
 Dark Data: Why What You Don’t Know Matters; Hand, D.J. 2022
 The Digitization of the World; Reinsel, Gantz, Rydning; IDC White Paper 2018
Image Sources (Thank you so much Kayla & Alexander)
Title photo by Kayla Farmer
“Europe by night” by Alexander Gerst ESA