Hey, your data is on mute

Reaching universal agreement on how to regulate AI is impossible. But a carefully designed cyclical process can lead to efficiencies in an enterprise's journey towards trustworthy, accountable, and sustainable AI. In this blog, I navigate the regulation space through the eyes of a data custodian.
José-Tomás (JT) Prieto, PhD
AI Program Lead at Apheris
Published 29 June 2023

Regulation is one of those topics that resonates with everyone. We've all to a certain extent created, followed (and disobeyed) rules. And when a sensitive, hot topic comes up, people will often gain general awareness quickly and become strongly opinionated on how to control the new shiny and scary thing.

Ask your neighbors about how they feel about AI regulation. You will see how different their perspectives are. "It will end humanity" or "we should care about other more pressing problems" are answers I've gotten in my recent travels for work. It's like talking about institutional systems with friends: which one is better, a liberal albeit corrupt or an authoritarian yet benevolent regime? Depending on who you are and where you live, you will have your own views shaped by your environment and conditioned by your priors.

In this post, I want to make a simple yet contentious point: reaching universal agreement on how to regulate AI is impossible. But a carefully designed cyclical process powered by norms, regulations, and recent tech innovations can lead to efficiencies in an enterprise's important journey towards trustworthy, accountable, and sustainable AI.

My framework

In 2011, Brousseau et al.  published about the concept of "a constitutional theory of development" where they explain how political systems in the world have continually evolved in bargaining cycles, triggered by someone calling out the problems with the status quo. In their theory, the most important stakeholders are citizens and the ruling elite; it's a helpful oversimplification. Thinking of the AI world, the three most important stakeholders are those who own and produce the data, those who build AI models, and those who use the models. Under Brousseau et al's framework, we're currently witnessing how these actors are calling out the problems with the status quo.

Conceptual diagram of main stakeholders and misaligned incentives

Academics, the private sector, and the general public flagged their perceived risks of AI more vehemently after seeing applications, such as chatGPT, that are based on foundation models (see more about foundation models). Many of these flags were not necessarily well informed and, thinking retrospectively, looked more like PR stunts. Importantly, however, these flags consolidate that initial thinking needed to begin a revision cycle of the AI ecosystem status quo.

Today, discussions around what the right regulation of foundation models should be span three broad areas that have been well articulated by Dr. Philipp Hacker in a series of recent papers. There is the AI model regulation domain: for deep learning models trained on large amounts of data, think of their parameter weight values that would be ready for customization in downstream applications. But there is also the content regulation domain: think of the inaccurate or fake information that can be created in downstream applications and how they can affect human lives with bullying, reputational damage, and more severe outcomes. And none of this makes sense without data protection regulations: think of the safeguards defined by policy makers to ensure that private and sensitive personal information is not improperly used to train AI models in the first place.

People talk about these three areas interchangeably. After all, the regulation debate is a bit all over the place and difficult to navigate. Putting everything under the same umbrella allows us to talk about everything at the same time, but the devil is in the details. Incentives for regulation are simply not aligned because stakeholders have different behavioral imperatives as shown in the diagram above. I will focus here on one of these three important stakeholders.

The AI regulation oxymoron through the eye of the data custodian

A few years ago, leveraging data without sharing it was an oxymoron, but not today.

Those who produce, own, or control access to data are critical stakeholders in the AI world. In GDPR (General Data Protection Regulation), they're known as data controllers (with a slight nuance that I am purposely ignoring). Let's call them data custodians because the term encompasses more than determining the purpose and means of the processing of personal data. You are a data custodian. Governments are data custodians too; they protect the information of their citizens. And a company has data custodians making sure their sensitive data – which is not always personal data – continues to be owned and used the way it should.

For the data custodian, AI regulation is both a need and a problem. A need because there has to be a structure that ensures governance, security, and privacy in respect to their confidential information[1]. A problem because to be compliant with institutional rules they must ensure that private, confidential information is shared in accordance with broad regulations. The best solution in many cases is simply not to do anything and confine the data.

In a very unfortunate paradox, data custodians in their core role of protecting sensitive information stop their organizations from benefitting from machine learning workflows. They become blockers - gatekeepers that effectively put the data on mute. For data custodians, the holy grail is an institutional setup (supported by regulations and technical tools) that allows them to share the potential benefits of their data without ever sharing it.

The bargaining cycle for data custodians

Traditionally, data custodians in their effort of protecting their data have been left with just a few options:

  • Giving blanket access to the raw data by approving exemptions on laws or regulations

  • Sharing an idea of what the data may be so that people can work hoping that the real data will match

  • Sharing the data after transforming it (common processes are data minimization, differential privacy, encryption, for example)

None of these options truly solve the problem. Data custodians feel organizational pressure that the end goal far exceeds the risks of sharing confidential data and ultimately must give in and share the data, not just the potential of it. And just like that, data custodians failed in their core role without being able to callout the problems. No bargaining cycles. If the data is just a bunch of drawings no one cares about, not a big deal. But what if this data showed that my friends are struggling with their health at a time they are looking for a new job? What if this data is confidential protein information a pharma company is currently developing and protecting? What if the data from governments must leave the boundaries of their geographies?

In today’s world, data custodians should not succumb to pressure. But they have a more difficult job because AI models make it complex to control and protect confidential data across the different levels of regulation:

  • In companies, data custodians are now tasked with the tough job of devising and enforcing both normative and positive behaviors for the possible uses of their data. Normative behaviors are guided by regulatory frameworks and organization policies, and positive behaviors must be executed aided by efficient legal protections and sophisticated technical tools.

  • Data custodians need to ensure that the data stays protected. But can they actually evaluate an entire ML pipeline ex ante to make sure no process is leaking private, confidential data and that the pipeline is compliant with company or country specific rules, such as ensuring that private data is not moving to a public cloud hosted in a different country?

  • Data custodians need to be sure that what they share is used within the limits of their approvals. Sharing weights of models after training might seem like a good idea to protect and leverage the data. Data custodians would not be sharing the data, only the outcomes of the models. But some models are prone to inference attacks, model reconstruction, and sophisticated manipulations that can reveal the confidential data they were trying to protect in the first place. So how can data custodians be sure that what they share is not used to reconstruct the data they were trying to protect?

  • After enforcing reasonable safeguards to protect sensitive data at time t0, it is possible that at t1 new models such as foundation models exhibit emergent behavior that was just impossible to foresee and that reconstructs the data that was initially protected at t0. Data custodians must be empowered to be able to proactively and reactively respond to the protean vulnerability space in AI.

Data custodians cannot give in because the job is tough either. It actually feels like a renaissance age for them. New tools empower data custodians by letting them be more efficient collaborators in an organization's journey to trustworthy AI. Some of these tools are legal or compliance constructions. For instance, companies like OpenAI should be compelled to establish notice and action mechanisms: if they receive a complaint about hateful or dangerous content, they should act upon it. Others, however, are ex-ante technical tools: products like the Apheris Compute Gateway ensure only approved computations can be launched on sensitive data and power federation across organizational boundaries, allowing AI-powered insights with no need to centralize data.​


I often think of Brousseau et al. 2011 because many things in the world evolve in bargaining cycles that are triggered by spontaneous calls for change. In the current AI context, however, spontaneous calls for change cannot be expected to happen; some topics are too esoteric. For that reason, data custodians should adopt frameworks that allow them to have a significant effect in bargaining cycles.

The EU AI Act – at the time I'm writing this – is evolving.  A few weeks ago, new provisions were added to account for the new perceived risks emerging with foundation models. The European Parliament's position on the Artificial Intelligence Act (AI Act) was adopted a few days ago. Experts are now calling for additional changes because the regulatory framework in the Act may be detrimental to European innovation and does not consider sustainability. This is great. This constant evolution of provisions is what we need. But it should not end with the enactment of a law.

Historically, data custodians have had very few options to do their jobs in the AI world. That's not the case today. Sensitive and confidential data should be protected, "on mute" in a way. As we saw here, that does not mean that your data potential must be muted: you can still enable collaboration on it with current tools.

I joined Apheris for three fundamental reasons. One of them is the elegance of the solution that the company has created to solve part of the regulation paradox for data custodians. At Apheris, we all see the importance of having secure, private, and governed access to data for ML and analytics through a powerful product that allows organizations to grow in their impending AI journeys.

[1] Funny thing here is that if you are reading this (with notable differences between Europe and US), you will care more or less about personal data because of Cross-Cultural Privacy Differences but you will align with your data custodian peers if you see this through the eye of intellectual property (IP) or your business, or even through the eye of a parent because that's when you feel the need to mitigate all risks.

Working at Apheris
Data & analytics
Machine learning & AI
Federated learning & analytics
Share blog post to Linked InTwitter

Insights delivered to your inbox monthly