When trying to answer a question about the value of data, it is not difficult to end up in philosophical discussions. But as vast and complex as it may seem, it boils down to the simple answer of supply versus demand: there is only value in data if it can be used, and if there is appropriate access to that data when it is needed. A recent report by PWC UK, putting a value on data, described it with the following statement: “All things being equal, a consumer will choose the most accessible dataset.” Or, more bluntly in the words of Henry Ford, “If it doesn’t add value, it’s waste”.
We are facing more uncertainty and global challenges than ever, such as the Covid-19 pandemic, supply chain disruptions, and the climate crisis. At the same time, companies are making massive strides in the industrialization and operationalization of machine learning (ML) and artificial intelligence (AI). Both trends lead to a sharp increase in demand for research-grade, high-quality data, and result in more data sharing. If manufacturers and suppliers could share their production data, they could create highly valuable AI models that help to reduce wastage or optimize their supply chain for carbon impact. If pharmaceutical companies could get reliable and repeatable access to data from healthcare providers or clinical research organizations, they could find new treatments and improve patient outcomes. No matter the industry, organizations that operate in value chains own different pieces of data sets that could be combined to present a more accurate picture of reality and therefore speaks to value in complementary data.
From federated data to complementary strategic data assets
Today there are many different terms being used to describe data sitting in silos of different companies: Federated data, decentralized data, or distributed data. They all describe the same - data sets which are currently difficult or even impossible to access, due to data sensitivity, IP protection reasons or technical complexity of integration. Nonetheless, their business value only becomes effective when they can be leveraged in collaboration with third parties. Data collaboration helps to overcome these complexities, so federated or decentralized data sets can become strategic, complementary data assets that fuel innovation.
7 Value Drivers of Complementary Data
To understand the value of complementary data, you must look at it holistically and in the context of data collaboration. In reality, data is distributed and sits in silos under the ownership of different organizations. Today’s challenge is to accept this federated data reality and to address that challenge. In this article, we help you to assess your data against seven key value drivers (which are based on the aforementioned PwC report), and show how your organization can optimize your data's potential by using federated approaches to accelerate innovation and generate future economic benefits with partners.
1. Data Value Driver: Exclusivity
“The more unique the dataset, the more valuable the data is.”
“The extent of the impact of exclusivity is dependent on the use case.”
The exclusivity of data sets is both a curse and a blessing. On the one hand, the value of data increases the more unique it is. On the other hand, the need for security measures to protect IP also increases. Often a company's most valuable data cannot be shared or leveraged at all due to its sensitivity and requirements for IP protection.
A federated approach would allow authorized users to perform queries on the data within a network of organizations. The results retrieved from each partner would then be aggregated and returned to the user who submitted the query – while the data never leaves the organization that holds it. By layering additional privacy-enhancing technologies, such as Differential Privacy, and adding encryption methods to the computation and data flows, analyses can be performed even on the most sensitive datasets.
2. Data Value Driver: Timeliness
“The more timely and up to date the data is, the more valuable it is.”
“However, timeliness is a relative measure, which is dependent on the use case.”
How long does it take you to onboard new data partners? In our experience, depending on the use case, it can take up to a year and 20-30 calls to align all compliance teams for integrating new data partners into an existing pharma data sharing network (and this excludes the technical integration which we will cover later in this article).
Especially when trying to conduct longitudinal data analysis, for example in clinical trials when trying to examine the effect of treatments on the disease process over time, the costs of adapting data sharing frameworks and infrastructures can become unpredictable.
In federated data networks, the onboarding of new data partners is much more streamlined than before. By using contractual blueprints in combination with privacy-enhancing technologies, the need for manual and very detailed data-sharing agreements can be circumvented.
3. Data Value Driver: Accuracy
“Accuracy concerns the degree to which the data correctly describes the ‘real world’”
“Provenance is a critical aspect of accuracy, as it lets the user understand the history of data and account for errors”
Virtually every organization in a value chain collects data - for example, in the production process of a particular product. This data exists in silos from different companies and under different ownership. If this complementary data could be combined into a large data set, a much more accurate depiction of reality would be obtained. Eventually, this leads to better ML models or other data applications that have the potential to optimize the entire production chain. Of course, this description is massively simplified and does not reflect reality: It is impossible for compliance and privacy reasons to create large, centralized data sets from individual ones that belong to different organizations. This would dilute data ownership rights and entails downstream consequences, for example when trying to audit said data applications.
Data provenance is becoming a more and more relevant topic due to the rise of Ethical AI. Especially in the collaboration with sensitive data, there must be robust safeguards in place that ensure transparency, reproducibility, and governance. Capgemini states in a report that 47% of data sharing companies are conducting frequent audits of data partners to ensure that ethical guidelines are being followed. AstraZeneca put the internal costs of an ethics-based auditing process for a single longitudinal industry case at 2,000 person-hours described in this paper.
Federated approaches can achieve both: Data analysis and data science on accurate and granular data while supporting accountable governance frameworks that balance the dynamics between data risks and data value.
4. Data Value Driver: Completeness & Consistency
“Generally, the more complete a dataset, the more valuable it is as it reduces bias.”
“Completeness of the data must be determined subject to the use case.”
“Data is consistent if it confirms to the syntax of its definition.”
Only in very rare cases, data sets can be used right away. Often, they are missing a data structure or an alignment toward a Common Data Model. They are of questionable quality, full of errors, have missing values and selection distortions. There is always a need for good data engineering, which not only requires domain expertise and knowledge of the business case, but also a much deeper level of collaboration between all parties than was previously possible.
Federated data indeed requires more investment upfront to achieve interoperability, and the agreement on harmonized data standards should ideally occur as early in the establishment of the data collaboration as possible. But since the standardization efforts of data are directly connected with a shared value proposition and business cases in mind, a positive return on investment can be achieved much faster than previously before.
5. Data Value Driver: Usage restrictions
“The less restricted the use of the data is, the higher its value.”
Protecting data privacy, data sovereignty and intellectual property are far from trivial problems. The real challenge lies in scaling privacy and security measures for different data applications. As there is no universal data-sharing mechanism, it is impossible to apply a specific privacy-enhancing technology for one use case, and then try to solve similar challenges with the same technology. In data collaboration scenarios it is more an orchestration of different PETs, combined with access controls and additional security layers.
Adopting data sharing agreements or technical frameworks quickly becomes a huge overhead for all teams involved. Federated approaches allow for simplifying contractual frameworks and technical implementation, as data doesn’t have to be moved and stays under the control of the owner.
6. Data Value Driver: Interoperability & Accessibility
“Interoperability of the data is often critical for the consumer to use the data.”
Seamless interoperability with existing tech stacks and accessibility of data becomes increasingly relevant with the adoption and industrialization of MLOps. To drive innovation with AI, enterprises are seeking scalable patterns and practices, and to close the loop between data acquisition, experimentation, and model deployment. Currently, this is hardly possible to scale across company borders, as every data partner has slightly different data needs. This requires massive investment into customization and into data and machine learning engineering.
In their report Data Sharing Masters: How smart organizations use data ecosystems to gain an unbeatable competitive edge, Capgemini recommends addressing specific data capability gaps such as the integration of data from external sources into internal MLOps workflows and systems with specialized partners or platforms from within or outside the ecosystem.
7. Data Value Driver: Liabilities and risks
“Potential liability and risks associated with the data will reduce its value.”
Reputational consequences and financial penalties for breaching data regulations such as GDPR can be severe. To prevent this, companies are crafting detailed data sharing agreements that prescribe any form of use of the data in advance, to exclude any possible liability. This significantly reduces the flexibility of the use of the data, and thus also its value.
Approaches to establishing trust in scenarios with federated data need to cater to the technical, legislative, and business needs of all different partners. Here again, organizations have the possibility to close these capability gaps with third-party services and platforms. Data privacy, intellectual property protection, regulatory compliance, and security must play an integral part in realizing the full value of complementary data.
Unlock complementary data value
We need to rethink how we approach data sharing and how we innovate with external partners on AI and machine learning. By adopting a value-driven and data-centric perspective, you can solve one of today’s biggest bottlenecks in data science, and scale ML innovation across the value chain to ensure your company’s future success.
Adapt to the reality of federated data
In the past, data collaboration was meant to combine different datasets into one centralized environment. As seen above, the traditional way of centralizing data has heavy security and privacy challenges. Leveraging federated approaches are a critical component of powering sustainable, transparent, and collaborative data ecosystems. At Apheris, we believe that the widespread use of federated data can stimulate innovation by allowing reliable access to sensitive data while minimizing risks and unintended consequences. Virtually every data value driver can be enhanced, you add more flexibility to how you operate and innovate on data and AI, and you ultimately drive more business value.
|Data Value Driver||Centralized Approaches||Federated Approaches|
|Completeness & Consistency|
|Liabilities & Risks|