Achieving Data Privacy in ML/AI

In this article, we explore Privacy Enhancing Technologies (PETs) and highlight their need in the context of ML/AI projects that use sensitive or regulated data. We start by demystifying PETs and providing concrete examples to help CTOs and product owners make informed decisions about how to improve data protection.
Inken Hagestedt
Privacy Expert
Published 29 March 2024

Abstract

In the rapidly evolving landscape of Machine Learning (ML) and Artificial Intelligence (AI), the safeguarding of sensitive data emerges as a paramount concern for CTOs and product owners in digitally mature organizations. This article delves into the realm of Privacy Enhancing Technologies (PETs), illuminating their necessity in the context of ML/AI projects that utilize sensitive or regulated data. We begin by demystifying PETs, providing tangible examples and explaining why they are indispensable for maintaining data privacy and security in AI-driven solutions. However, the reliance on a singular PET solution often reveals limitations in scope and scalability which incurs significant costs in terms of time and resources.

Our focus shifts to an innovative paradigm: the integration of flexible PET approaches with computational governance. We spotlight solutions like Apheris, which exemplify this method, offering a tested, scalable framework that seamlessly integrates into existing technology stacks with minimal code alterations. This approach not only enhances data security but also ensures compliance with evolving regulations, thus providing a comprehensive solution for enterprises dealing with sensitive data in their ML/AI endeavors.

This article aims to equip CTOs and product owners with a nuanced understanding of PETs in the ML/AI sphere, guiding them towards making informed decisions that align with their organization’s digital maturity and strategic objectives. The insights presented herein underscore the importance of adopting a multifaceted approach to data privacy, ultimately enabling enterprises to leverage the full potential of AI and ML while upholding the highest standards of data confidentiality.

Introduction

In the rapidly advancing world of Machine Learning (ML) and Artificial Intelligence (AI), the safeguarding of sensitive data stands as a cornerstone of ethical and effective technology deployment. As organizations increasingly lean into AI-driven solutions to harness the power of data, the need for robust data privacy measures has never been more critical. This urgency is particularly pronounced for those in organizations who face the dual challenge of leveraging data for innovation while ensuring its confidentiality and integrity.

Privacy Enhancing Technologies (PETs) have emerged as vital tools in this endeavor, offering ways to protect data privacy. However, the landscape of PETs is complex and evolving, with a singular solution often falling short in addressing the multifaceted privacy needs of large-scale, sophisticated projects or impacting data utility. This article provides an overview of PETs, explores the limitations of relying on a single PET solution, and advocates for a more nuanced approach. We examine why integrated solutions like Apheris, which combine flexible PET strategies with computational governance, represent the future of data privacy in ML/AI, offering scalability, efficiency, and seamless integration into existing technology stacks.

What are PETs?

PETs are methods to protect the privacy and security of an individual and their data in collaborative (machine learning) computations while maintaining functionality and utility. Their primary purpose is to enable data-driven technologies and analytics while safeguarding the privacy and security of the data subjects. PETs are particularly crucial in the field of ML and AI, where they help balance the need for large datasets with privacy concerns.

Example PETs, their pro and cons

PETDefinitionProsCons
Data anonymization and pseudonymizationAlters personal data to prevent identificationEffectively protects identity.Risks re-identification with additional data sources obtained e.g. through advanced data mining.
Differential privacyOffers a mathematical privacy guarantee that individual data points' contribution cannot be extracted.Safe for aggregate analysis.Hard to select algorithm and parameters to maintain data analysis accuracy.
Homomorphic encryptionAllows computation on encrypted data.Ensures privacy in untrusted environments.Involves significant computational overhead and not all computations are supported.
Secure multi-party computation (SMPC)Enables joint computations while keeping inputs private.Enhances privacy in collaborative settings.Complex and computationally expensive.
Federated learningTrains models across decentralized datasets without sharing raw data.Reduces privacy risks by keeping data local.Challenging to manage and still susceptible to privacy attacks.
Data maskingObscures specific data within a database.Allows functional use while protecting information.Risk of reverse-engineering if not properly implemented.
Synthetic dataArtificially generated data mimicking real data's statistical properties.Mitigates privacy risks, useful for training models.May not accurately capture real data complexities and biases.

In an era where ML and AI systems often require vast amounts of data to improve accuracy and functionality, the challenge lies in harnessing this data while respecting individual privacy and adhering to stringent data protection regulations like the GDPR, HIPAA, or CCPA. PETs help address this challenge by providing mechanisms to use and analyze data without exposing sensitive information, thereby ensuring that ML and AI applications can operate within the legal frameworks designed to protect personal privacy.

However, while PETs offer significant benefits in safeguarding data in ML applications, they are not a panacea.

“PETs are at different stages of development and will likely need to be part of data governance frameworks to ensure they are used properly in line with the associated privacy risks. Many of these tools are still in their infancy and limited to specific data processing use cases.” Emerging privacy enhancing technologies: Maturity, opportunities and challenges, OECD

The shortcomings of a singular PET solution

The idea that a single PET solution can comprehensively address all privacy and compliance concerns is overly optimistic. These technologies, each with their unique strengths and limitations, often work best in specific contexts and scenarios. For instance, techniques like differential privacy (DP) might protect individual identities in the final result but can compromise data accuracy, impacting the effectiveness of ML models. Similarly, approaches like homomorphic encryption offer robust security at the cost of computational efficiency during the computation, but offer no protection of the final (decrypted) result.

Additionally, the rapidly evolving landscape of data privacy regulations and the increasing sophistication of cyber threats mean that relying solely on one type of PET can leave gaps in privacy and security measures. Therefore, it's crucial to understand that while PETs are integral tools for privacy protection in ML and AI, they must be part of a broader, multi-layered strategy tailored to the specific needs and risks of each project.

The impact of these limitations on enterprise ML and AI

The limitations of PETs can have significant impacts on ML and AI projects, especially in regulated industries. These limitations can affect various aspects of project development and deployment, including scalability, code portability, computational efficiency, and accuracy.

  • Scalability: As enterprises scale their ML/AI projects, the demand for data processing increases exponentially. Some PETs, like homomorphic encryption or secure multi-party computation, can introduce substantial computational overhead, making them less viable for large-scale applications. This scalability challenge can hinder the ability to process large datasets efficiently, a necessity for robust ML/AI models.

  • Code portability and rewrites: Implementing PETs often requires specialized knowledge and can lead to significant changes in existing codebases. For example, a custom pre-processing pipeline for anonymization and data masking might necessitate substantial code rewrites because it’s data-dependent customized code. Additionally, the portability of these solutions across different platforms or technologies can be limited, requiring additional resources for adaptation and testing.

  • Computational efficiency: Some PETs can dramatically reduce computational efficiency. The increased computational load can lead to longer processing times and higher operational costs. In time-sensitive applications, this can be a critical drawback, potentially making certain PETs impractical for real-time processing needs.

  • Accuracy in regulated industries: In sectors like healthcare, finance, and legal, where accuracy is paramount, the use of PETs can sometimes compromise the quality of insights derived from data. Techniques like data masking or synthetic data generation might not capture the intricate patterns and nuances present in the original data, leading to less accurate ML models. This trade-off between privacy and accuracy can be particularly challenging in regulated industries, where decisions based on AI models can have significant consequences.

“PETs should not be regarded as a silver bullet to meet all of your data protection requirements…” - ICO

For example, while DP is the go-to method for mitigating attacks that extract whether a given data point was part of the training data set (membership inference attacks), it does not protect against reconstructing data that is similar to the training data. This is due to the definition of DP. This has been confirmed empirically by Zhang et al.. Moreover, when the model should not only be privacy-preserving but the predictions should additionally meet fairness goals, the use of DP may reduce the fairness by amplifying the bias towards the more popular training data points as shown by Bagdasaryan et al. Summing up, a PET that mitigates one privacy threat can fail to mitigate another threat or even increasing the threat, thus a carefully chosen combination of PETs should be used.

The case of a multi-layered, integrated PET approach

Incorporating a multi-layered, integrated approach to PETs in ML and AI is vital for robust data protection. This strategy addresses a wider range of privacy and security concerns compared to singular PET solutions, which simply aren’t sufficient for privacy-preserving computations and, therefore, not an option. By combining various PETs, organizations can ensure comprehensive protection, with each technology addressing specific aspects of data privacy, from securing individual identities to safeguarding data in operation.

A key advantage of this approach is the ability to balance data privacy with utility. Different PETs can be strategically employed based on the data's sensitivity and the project's scale, optimizing the balance between maintaining data utility and ensuring privacy. This flexibility is also crucial in adapting to diverse regulatory requirements, enabling compliance with various standards like GDPR and HIPAA.

Moreover, a multi-layered approach provides a dynamic defence against evolving digital threats and ensures scalability and computational efficiency. It allows for the strategic use of more efficient PETs for large-scale processes and reserves resource-intensive PETs for highly sensitive operations, optimizing both performance and security.

5 benefits of a multi-layered approach to PETs

  • Enhanced data privacy and security: Integrating multiple PETs offers robust protection against data breaches, crucial for handling sensitive data and avoiding legal or reputational damages.

  • Regulatory compliance: A multi-layered PET approach facilitates adherence to data protection regulations like GDPR and HIPAA, reducing non-compliance risks.

  • Operational efficiency: A combination of PETs allows enterprises to optimize resource use, balancing computational efficiency with necessary data protection for scalable ML solutions.

  • Balanced data utility and privacy: Multiple PETs enable better maintenance of data quality while ensuring privacy, crucial for effective ML applications.

  • Innovation and competitive advantage: Enterprises using a sophisticated PET approach can safely leverage more data for innovation, giving them a competitive edge.

Federated learning meets computational governance

Apheris adopts a practical approach to data privacy in ML and AI, utilizing multiple PETs. This approach allows data custodians to set controls that align with specific compliance requirements. Central to their strategy is the use of federated learning, enabling collaborative ML model development across decentralized data sources without direct data sharing. This method reduces data transfer risks and upholds privacy.

Complementing this, Apheris integrates a computational governance framework which allows data custodians to define which PETs must be used, effectively enforcing compliant data processing. To simplify the usage of these PETs, Apheris offers ready-to-use sample implementations. Additionally, the system maintains detailed logs of all activity, essential for audit trails and regulatory compliance. The combination of clear permissions for data use, PET implementation, and logs offers a streamlined solution for enterprises to securely exploit AI and ML potential while adhering to data privacy and compliance standards.

Apheris: a closer look

The Apheris approach, incorporating federated learning along with multiple PETs, offers a robust solution to ensuring privacy in ML and AI projects. This combination addresses data privacy and security effectively, allowing the use of diverse data sources for model training without the need for data centralization. This ensures adherence to various data privacy laws and minimizes breach risks.

An integrated governance, privacy, security layer for compliant ML. A reliable source of privacy controls helps data custodians set the appropriate level of privacy depending on specific data characteristics.

In addition, Apheris provides guidance for which PETs to use for which data and model via our Trust Center and expert advice. Our experts implement PETs or use best-in-class libraries so there is no need to integrate with poorly maintained libraries.

A critical advantage of Apheris' method is its minimal impact on existing codebases, significantly aiding in scalability. Product builders can integrate Apheris' solutions with minimal code porting and rewrites, efficiently scaling their applications for larger and more complex data sets.

The use of multiple PETs, in addition to federated learning, allows for a more nuanced approach to data privacy. It enables data custodians to select the most appropriate technology based on specific data characteristics and compliance requirements. This multi-PET strategy enhances the robustness and accuracy of AI models while ensuring global regulatory compliance.

Apheris' system also provides clear audit trails, essential for regulatory compliance and transparency. The combination of federated learning, multiple PETs, and computational governance results in an agile, scalable, and cost-effective approach for developing AI-driven products. It empowers product builders to expand their solutions securely and responsibly, addressing the contemporary challenges of data privacy in AI.

Conclusion

In conclusion, the exploration of PETs in ML reveals that a singular PET approach, while beneficial, is insufficient for the complex demands of modern data privacy. The multi-layered strategy, exemplified by Apheris’ integration of federated learning with computational governance, offers a more comprehensive solution. This approach effectively addresses scalability, compliance, and data utility challenges in ML/AI projects.

Looking forward, the evolving landscape of AI necessitates a continuous innovation in data privacy strategies. The commitment to robust, adaptable PET strategies will be crucial for ethical and sustainable AI development. Choosing the right PET strategy is not merely a technical consideration; it's fundamental to building trust and ensuring responsible AI growth in the future. Enterprises embracing a multi-layered PET approach will be better equipped to harness AI's potential while upholding data privacy and security standards.

Privacy
Machine learning & AI
Share blog post to Linked InTwitter

Insights delivered to your inbox monthly

Related Posts