Apheris logo
Menu

Improving ADME Model Performance by Collaborating on Proprietary Data

The challenge: Missing training data and a lack of diversity limits the performance and applicability domain of ADME prediction models.

Predicting ADME (Absorption, Distribution, Metabolism, Excretion) endpoints is essential for drug optimization, yet the limited availability of chemically diverse, high-quality datasets constrains the development and application of QSAR (Quantitative Structure-Activity Relationship) models[RR1] . Recent advancements in computational chemistry, such as free energy binding calculations and generative chemistry, have enhanced structure-based drug discovery. [RR2] [RR3] However, accurately predicting ADME endpoints remains difficult due to the complexity of these processes, low-throughput experimental assays, and insufficient chemically relevant data, especially in early-stage research.

Large pharmaceutical companies benefit from extensive proprietary datasets that allow the creation of robust QSAR models, which can be adapted to specific chemical space. In contrast, smaller companies often lack comparable datasets, limiting their ability to develop predictive models of sufficient quality. This gap presents a significant obstacle to the adoption of advanced predictive workflows and underscores the need for collaborative solutions to enhance modelling capabilities.[RR4] 

The potential solution: Biopharmaceutical consortium for collaborative training of ADME models while protecting data confidentiality

There is an alternative to training models only on public and your own data: secure, federated learning (FL) on data of multiple parties. Pharmaceutical companies can form a consortium to enable training on their proprietary chemical data without the need for data sharing. Apheris has enabled exactly such use cases for the AI for Structural Biology (AISB) Consortium with members like AbbVie, Boehringer Ingelheim, Johnson & Johnson, and Sanofi.

Apheris currently orchestrates a consortium of pharmaceutical companies to achieve a state-of-the-art ADME model thanks to training on participants’ ADME data – while keeping the confidentiality and IP of the data sets fully protected.

Consortium Objectives[RR5] 

The consortium seeks to:

  • Achieve state-of-the-art ADME prediction models through collaborative federated learning.

  • Expand the applicability domain and accuracy of these models beyond current state-of-the-art models

  • Enable participants to further fine-tune global models to their proprietary chemical spaces using transfer learning

 

Simple workflow graphic [RR6] 

 

Focus Parameters

The consortium participants will identify a core set of ADME parameters that best suits their interests. Additional endpoints will be included as the project evolves, ensuring that the collaborative effort addresses the diverse needs of its members while maximizing the scientific and practical impact of the models. An initial proposal is as follows:

  • LogD: Lipophilicity, a critical determinant of solubility, permeability, and overall pharmacokinetics.

  • Permeability: Assessed through well-established assays (e.g., MDCK, Caco-2, PAMPA) to evaluate membrane transport.

  • Liver Microsomal and Hepatocyte Clearance: Reflects metabolic stability across species (human, rat, mouse).

  • Plasma Protein Binding (PPB): Affects free drug concentration and pharmacological activity.

  • Cytochrome P450 (CYP450) Enzyme Interactions: Focus on key isoforms such as CYP3A4 for metabolic pathway elucidation.

  • Blood-Brain Barrier (BBB) Penetration: Essential for central nervous system (CNS) drug candidates.

  • Oral Bioavailability: Indicates the fraction of the drug absorbed systemically after oral administration.

  • Aqueous Solubility: Impacts formulation strategies and bioavailability.[RR7] 

Data Requirements

Data requirements will be clearly defined for the consortium participants to ensure that they fairly contribute. To maximize model utility and ensure robust predictive performance:

  • Volume: Datasets should include at least 10,000 data points for common ADME endpoints (e.g., logD, permeability, clearance), with a preference for even larger datasets to improve model accuracy and domain coverage.

  • Format: Data format and preparation could follow the MELLODDY Tuner framework, [RR8] which simplifies data integration and ensures uniformity. This includes using standardized molecular representations (e.g., SMILES strings) along with assay results, modifiers (e.g., > or <), and numerical values. The MELLODDY Tuner's open-source codebase could be leveraged to harmonize datasets efficiently and enable seamless ingestion into the FL pipeline.

Federation and IP Considerations[RR9] 

Federated learning enables collaborative model training without the need for direct data sharing. Apheris' federated computing infrastructure follows a rigorous GovPrivSec posture, ensuring governance, privacy, and security for all data and models involved. Our product builds on NVIDIA's NVFlare framework (while also supporting the Flower framework as an alternative). Data custodians (e.g., pharmaceutical companies) always remain in full control and have auditable records of the computations run on their sensitive patient data.

Key design features of the federated computing infrastructure by Apheris include:

  • Privacy-Enhancing Technologies (PETs): Built-in options such as homomorphic encryption and differential privacy, with flexibility to expand as needed.

  • Computational Governance: Tracks, supervises, and enforces privacy and security across all computations. This ensures data custodians maintain control, contributing securely without moving data, with results aggregated and returned.

  • Output Assessments: Outgoing results are airlocked for validation against privacy tests (e.g., membership inference) or custom disclosure controls.

[Here we might need further E2E considerations (e.g., including inference) based on the high-level product definition to be defined by Ellie by early Jan.]

Consortium setup

The consortium is designed to be flexible, launching with a small seed of initial participants in H1 2025 and scaling over time. Key aspects include:

  • Recurriccess Model: Participants typically pay an annual fee for ongoing access to the latest, most up-to-date models, ensuring continuous benefits from collaboration.

  • Balanced Contributions and Benefits: Partners will collaborate based on pre-defined data requirements to ensure fairness in both contributions and derived benefits. Incentive structures will reward contribution of high-quality data that enhances model performance.

  • Centralized Facilitation: Apheris, with experience in coordinating collaborations like the AISB Consortium, will act as the central Collaboration Coordinator. This includes handling commercial, legal, and operational considerations, enabling participants to focus on scientific and technical contributions.

What’s next?

We invite you to an initial conversation to discuss the consortium proposal in more detail and clarify any open questions. At a later stage, Apheris will also facilitate direct discussions among interested parties to align on goals and parameters, ensuring a collaborative and successful launch of the initiative.

error, component not found undefined or thematicBreak

 [RR1]I wonder if Graphium / Foundational models: 1. Formally count as QSAR (I guess so) 2. Experts intuitively think of non-traditional technqiues when they hear QSAR Just skimming wikipedia: https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship --> this shows very traditional ML methods only Maybe the usage here is still right but I think we should be clear for the next sentence As well

 [RR2]I think we could bring in the language & techniques of Graphium here already: https://openreview.net/pdf?id=Zc2aIcucwc

 [RR3]“Foundational Models based on graph and geometric deep learning”

 [RR4]I think we should take the “large Pharma” out - rather say - Few companies have the extensive proprietary datasets…. - Most companies lack comparable datasets, limiting We don’t want to call mid size Pharmas small

 [RR5]To discuss with Ellie: I think given that Graphium seems prominent and well known, I think we could already bring this in here?

 [RR6]I think a simple graphic on the level of detail of what Darren shared would help - as well: - I miss the whole level of what the ADME application will look like - I think this will best fit to such a workflow graphic - so probably could be described as part of this / maybe with 3-4 extra sentences - Assuming we go with Graphium, we should definitely use some relevant language here --> state of the art GNN architecture --> exensible API etc

 [RR7]I wonder if we can make this look more like a dataset - maybe even showing the data format we want

 [RR8]TBD

 [RR9]Not yet happy with this section - as you suggest: - We should focus this more on IP protection and security concept of the E2E flow - This should both show our FC capabilities but as well our privacy expertise for the specific data (rather make comments on preventing reverse engineering - so adressing Ola Engkvist paper which is chemistry focused) - We should not mention HE / DP here


AIDrugDiscovery
Collaboration
Share blog post to

Insights delivered to your inbox monthly

Related Posts

3.5 – Adding XGBoost, BERT, and Apache Parquet Support
The new version, 3.5, is packed with new goodies and improvements to make your federated life easier and protecting your data assets even more convenient. Read more
Computational governance
Security
Collaboration
Enabling IP-Preserving Computations on Sensitive Data
Machine learning and AI needs domain-specific data to be trained for its various use cases. Often this data is sensitive and falls under various privacy regulations. In this article we will introduce the Apheris Compute Gateway as a solution for contributing sensitive data to ML projects.
Security
Data & analytics
Collaboration
Privacy
Machine learning & AI
Platform & Technology
Data collaboration vs data sharing: Everything you should know
Discover why data collaboration is the future. Explore how federated learning keeps data safe while promoting collaboration without sharing.
Collaboration
Collaborative data ecosystems