At Apheris, we predict six trends that will have a major influence on how frontrunners will build the next breakthrough data products of this decade. Read on for our assessment and what conclusions you can draw.
Trend #1: It will be easier to build and operationalize data products, but harder to maintain a competitive edge
AI & data technologies are constantly evolving. While core data processing systems have remained relatively stable over the past years, the number of supporting tools and applications have increased rapidly. In the recently updated article ‘Emerging Architectures for Modern Data Infrastructure’ by analysts from Andreessen & Horowitz, they describe three different reference architectures with the current best-in-class stack for modern business intelligence, multimodal data processing and AI and machine learning.
While all companies rely on the same kind of blueprints for their data analytics architecture, a technical differentiation with data products, ranging from analytic to operational systems becomes harder to achieve. New model architectures are being developed mostly in open and academic settings, while pre-trained models are available from open-source libraries, and model parameters can be optimized automatically.
The bottom line: Your competitors are starting to build products on similar infrastructures and setups. To defend your competitive edge, you must find new ways to differentiate, for example by gaining exclusive access to data that was inaccessible before. Unlocking data can be viewed from different viewpoints: Either it was previously too risky from a legal perspective, too complex from a technical standpoint, or generally too expensive to be a valid business case.
Trend #2: Broader adoption of AI requires the shift from model-centric to data-centric AI
In March 2021, Andrew Ng coined the term “data centric AI”, which introduces a shift in MLOps frameworks to the systematic engineering of data used to build an AI system. This responds to the shortcomings encountered so far when trying to implement AI into industrial scenarios.
According to Ng, “the dominant paradigm over the last decade was to download the data set while you focus on improving the code. Thanks to that paradigm, […] deep learning networks have improved significantly, to the point where for a lot of applications the code – the neural network architecture – is basically a solved problem. For many practical applications, it’s now more productive to hold the neural network architecture fixed, and instead find ways to improve the data.”
His view has been further supported by Google researchers in the paper “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI published in May 2021. The paper illustrates very clearly how much significance data quality carries in AI due to its heightened downstream impact, which heavily influences predictions like cancer detection or loan allocations.
Ng further states “In the last decade, the biggest shift in AI was […] to deep learning. I think it’s quite possible that in this decade the biggest shift will be to data-centric AI. With the maturity of today’s neural network architectures, I think for a lot of the practical applications the bottleneck will be whether we can efficiently get the data we need to develop systems that work well.”
Trend #3: Data quality rises across industries, which sets the foundation for cross-company data sharing
The shift to data-centric AI not only booms internal data quality in form of advanced data labeling or data augmentation techniques, but it also makes quick progression on the entire industry level. Across verticals, new data standards, data interoperability and common data models such as CDISC, OMOP, OPC-UA, IDS or FAIR data are high-priority topics and set the foundation for cross-company data sharing. Interoperability naturally further increases the value of data, which also influences how we need to think about generating ROI from the previous data standardization investments.
Trend #4: Distributed data mesh frees data from internal silos and creates decentralized governance models
The most common way to fight data silos was to centralize them into data lakes. However, data platforms based on the data lake architecture can lead to unfulfilled promises at scale. To address this issue, a very promising paradigm shift in how we manage analytical data has been developed and described in detail in the articles How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh and Data Mesh Principles and Logical Architecture by Zhamak Dehghani, Director of Emerging Technologies at Thoughtworks.
In the articles, Dehghani describes how the paradigm of Data Meshes is based on four principles:
Domain-oriented decentralization of data ownership and architecture
Domain-oriented data served as a product
Self-serve data infrastructure as a platform to enable autonomous domain-oriented data teams
Federated governance to enable ecosystems and interoperability
Each of these principles drives a new logical view of the technical architecture and organizational structure – and the ownership over the data and the entire data products shifts more towards the line of business.
We see this trend as an indicator that data sharing partners of the future are not only going to be organizations per se, but they will include several lines of businesses within an enterprise. This will massively enhance the amount of data sharing opportunities available and increase the quality of partnerships and the speed at which they can be formed.
Trend #5: Privacy-enhancing Technologies achieve market readiness, but they must be applied thoughtfully
There are reasons for these established, albeit cumbersome and expensive, manual data sharing processes: There weren’t enough potential data sharing opportunities that made automation necessary. If an entire vertical only includes one or two well-known data providers that own production-grade data, it is relatively straight-forward to address privacy, trust and regulatory requirements. Developing governance around data sharing usually includes setting up data sharing agreements, defining API specifications, and building pseudonymization or anonymization pipelines for when sensitive data must be shared.
But something becomes clear very quickly: One company alone can’t build, integrate, maintain, and govern hundreds, or potentially thousands, of secure data pipelines to fuel a broad portfolio of data products.
The traditional way of secure data sharing is impossible to scale above a handful of use cases, as it becomes unreasonably expensive and complex. The complexity compounds and increases friction, costs, and risks until it becomes impossible to efficiently and securely establish or future-proof data collaborations at a larger scale.
Privacy-enhancing Technologies, such as Differential Privacy, Synthetic Data or Federated Learning, aim to automate and simplify previous complex approaches, and in recent years, there have been a consistently growing number of open-source frameworks and proprietary tools.
Yet, the implementation of such technologies into production is still very complex, as the trend toward bespoke mechanisms and privacy-preserving data pipelines prevails. These mechanisms again only increase the complexity instead of reducing it.
In our whitepaper “Beyond MLOps – How Secure Data Collaboration Unlocks the Next Frontier of AI Innovation”, we go into more detail about this topic and explain how Privacy-enhancing Technologies can be safely and universally managed with enterprise controls across companies.
Trend #6: Involvement in More Complex, Collaborative and Secure Data Ecosystems Will Be a Key Differentiator
Thought leaders in management and technology consultancies, such as McKinsey, BCG, Gartner and Deloitte, are in agreement that data ecosystems are the means to further extend the competitive edge. While many organizations are already engaging in simple data sharing ecosystems today, the real value lies in more complex and highly collaborative data ecosystems. In their extensive report “Data Sharing Masters”, Capgemini describes winning data ecosystems as those that proactively address privacy, ethics, trust and regulatory requirements and allow for the sharing of heterogenous data between multiple parties to ultimately enable the development of highly differentiable data products.
Building breakthrough data products in this decade requires the enterprise capability to securely collaborate on data and AI at scale
Challenging the status quo has never been easy, but that is what separates leaders from the rest. According to Capgemini, a central point in kicking off collaborative data ecosystems is to run small-scale pilots that align on required capabilities and processes and then define new internal ways of working. In the following phases, this advantage can then be further sustained by scaling up use cases to their full potential and moving up the data and insights value chain. If you want to learn more about how to implement such a solution in your organization and start to build on the next-gen data products today, we have compiled tons of valuable insights in our Buyer’s Guide to Secure Data Collaboration.