Apheris logo
Menu

Fine-tuning the OpenFold3 affinity head on a small JAK2 macrocycle dataset

We rebuilt the SandboxAQ affinity head inside ApherisFold and fine-tuned it on 49 JAK2 macrocycles. Architectural changes alone lifted Spearman ranking from 0.418 to 0.60; fine-tuning then pushed validation Pearson from 0.23 to 0.76, mostly by correcting how the model handled inactives.

We rebuilt the SandboxAQ affinity head inside ApherisFold, restoring the probability-of-binding output, making the pipeline differentiable end-to-end, and admitting representations from multiple protein chains. On a 69-compound JAK2 macrocycle series, the improved cropping alone moved Spearman ranking from 0.418 to 0.60 out of the box. A short fine-tune on 49 training compounds then pushed validation-set Pearson from 0.23 to 0.76, almost entirely by correcting how the model handled the inactives.

Protein–ligand co-folding has matured to the point where the predicted pose is often credible. OpenFold3, Boltz-2, and Protenix-v1 routinely place small molecules into binding pockets in geometries close to the experimental ground truth, including for chemistry that postdates the training cutoff. We have written about that elsewhere, see the PDE10A fine-tuning case study and the TrmD walkthrough. Pose is only one half of what a medicinal chemist needs. The other half is ranking: which of the 200 enumerated analogues in front of me is most likely to bind, and does that ranking line up with the affinity cliffs in my series? For ranking, co-folding models depend on a separate component (the affinity head) which consumes the structure module's internal representations and outputs an affinity-related score. Two production-grade affinity heads currently ship alongside the public co-folding models. The Boltz lineage exposes both a 0–1 probability of binding and a potency value via IC50 = 10^x. The SandboxAQ affinity head, the OpenFold3 companion, uses the standard IC50 = 10^(-x) convention (which, to be fair to SandboxAQ, is the more correct one, the "p" in pIC50 is -log10(IC50)). Both are reasonable starting points. Neither, in our experience, ranks reliably across a typical lead-optimization series without some adaptation. This article documents what we did to get there, what it buys you on a chemotype with industrial relevance, and where the approach still has limits worth being honest about.

Affinity heads

Where the public OpenFold3 affinity head needs work

The SandboxAQ release uses a two-step inference pattern: run a forward pass through OpenFold3, generate a "pocket embedding" by cropping a region around the predicted binding site (up to ~200 protein residues), re-inference that cropped sequence and ligand through the base OF3 weights, and only then feed the pair representation, single representation, and structure into a lightweight 19 MB affinity head. The second re-inference is the part worth taking a closer look at. Boltz-2 does something similar in spirit — crop, then feed into another model with the same architecture as the trunk but a different set of weights — which effectively re-co-folds the cropped system with a model that is no longer trained to produce sensible structures. I suspect that is part of why Boltz-2 has been observed to be relatively insensitive to point mutations. On a varied JAK2 inhibitor set — full pIC50 range plus ritonavir as a structurally unrelated negative control — the public SandboxAQ head achieves a Spearman ranking correlation of 0.418 against measured potency. Not nothing. But three things are relevant for industrial drug-discovery work, and the public release falls short on each:

  1. No probability-of-binding head: Boltz exposes both a potency prediction and a binary 0–1 binding probability. The SandboxAQ release does not. That probability output is what you would naturally use to ingest single-shot high-throughput screening data, the most plentiful affinity signal in any pharma's data warehouse, and the one routinely discarded because the model architecture cannot consume it.

  2. No gradient flow between affinity head and structure module: The two-step inference detaches the affinity head from the structure module. You cannot use the affinity signal to refine the pose, and you cannot easily co-train both. Every affinity update is a static lookup against a frozen structural representation. There is also a science debate worth flagging here: at Charm and elsewhere we have observed that letting affinity gradients flow into the trunk can degrade structural metrics if the affinity signal is muddy. So enabling gradient flow is not the same as always using it. The architectural change gives you the choice.

  3. Cropping limited to a single protein chain plus a single ligand: This is a hard ceiling for anything involving ternary complexes — molecular glues, PROTACs, kinase–cyclin systems, any case where the relevant binding interface spans more than one chain. The model could not consume features from multiple protein chains at the same time. We rebuilt the affinity head inside ApherisFold to address each of these.

What we changed

The reworked, integrated affinity head exposes:

  • A probability-of-binding output, restored to parity with the Boltz head, ready to consume binary HTS data alongside quantitative pIC50 measurements.

  • A differentiable pipeline. Gradients propagate from the affinity head back into the structure module, so structure and affinity can be co-trained when the data supports it (and not when it does not).

  • A revised cropping scheme that admits representations from multiple protein chains. This is the change that unlocks ternary complexes.

  • An on-the-fly inference path during structure-module training, so the affinity head updates alongside structural fine-tuning.

  • A simplified pocket-cropping pass that skips the second re-inference step the public SandboxAQ pipeline performs. We just crop the predicted structure and representations from the first OF3 forward pass and feed those directly into the affinity head.

The cropping change alone, independent of any fine-tuning, improved out-of-the-box ranking on the JAK2 inhibitor set from Spearman 0.418 (public SandboxAQ) to 0.60 (Apheris integrated head). Boltz-2, for reference, scored 0.545 on the same compounds.

JAK2 JH1 Affinity Predictions

That is the new baseline. The next interesting question is what happens when you fine-tune on a chemotype the model does not yet rank well.

The JAK2 macrocycle case study

I wanted to test the affinity head on a chemotype that the model probably has not seen well (macrocycles) using a published dataset with experimental potencies for the full series. Dataset: We used a series of 69 macrocyclic JAK2 inhibitors around the approved drug pacritinib, drawn from a published study that compared free energy perturbation, 3D-QSAR, and experimental IC50 measurements across the series. We took the first 49 compounds in time order as the training set and held out the latter 20 as a time-split validation set. No information leakage in either direction. Why pacritinib? Two reasons. First, a single pacritinib-bound JAK2 structure exists in the PDB, so we can confirm the structure module places the macrocycle reasonably in the pocket. Second, macrocycles sit at the edge of the structural training distribution for OpenFold3. That makes them a useful stress test for the central question of this post: can fine-tuning recover ranking in a chemotype where the out-of-the-box model fails?

Case Study: JAK2 JH1 Macrocycles

Pose check first: Before touching the affinity head, we co-folded all 69 compounds with JAK2 and inspected the predicted poses. All compounds landed in the correct pocket, with poses analogous to the experimental pacritinib structure. This step is non-negotiable. If the model cannot find the pocket, no amount of affinity fine-tuning will save you. Pose is upstream of ranking, and an affinity head trained on top of a wrong pose learns the wrong thing.

Check the structural outputs make sense

Out-of-the-box affinity: With poses confirmed, we ran the integrated affinity head out of the box on all 69 compounds. The result was a weak-to-zero correlation between predicted and measured pIC50. The macrocycle chemotype is clearly outside the affinity head's training distribution, even though the structures themselves are within the structure module's reach. This is the regime where the public model is unhelpful for prioritization decisions.

Initial Inference Results

Fine-tuning setup: We then fine-tuned the affinity head on the 49 training compounds. The setup was deliberately lightweight, and worth describing in more detail because the headline takeaway is that this regime is accessible:

  • Hardware: a single NVIDIA L40s GPU.

  • Time: 4 minutes 30 seconds per epoch, with each compound seen 10 times per epoch.

  • Convergence: the first checkpoint already converged. Further training did not improve the validation set further. We selected that first checkpoint for evaluation, rather than picking the best of many checkpoints — a small but important detail for honest reporting.

Simple Training Setup

Results: On the training set, the model fit the y = x curve as expected. The following main question is what happened on the held-out 20? Validation-set Pearson correlation improved from 0.23 to 0.76.

Before and After

The mechanism behind the improvement is the part I want to draw out. The bulk of the gain came from the inactives. Out of the box, the model predicted weak binders with similar potency to the strong ones, which is precisely the failure mode that makes a model useless for prioritization. After fine-tuning, those inactives were pushed away from the active compounds, in the right direction. The model learned what makes a macrocycle weaker against this target, not just what makes one stronger. That's the failure mode that is crucial in lead optimization. A model that cannot identify weak binders within your own series is a model that cannot prioritize synthesis.

On the thermodynamic objection to single-pose affinity

Some readers will look at this setup and reasonably object that affinity from a single pose ignores the entropy term in ΔG = ΔH − TΔS, and that ranking from a static prediction has no first-principles thermodynamic justification. The objection has been raised at recent conferences and is fair as stated. While deep learning methods may be able to correct for some of the entropic contribution by taking the estimated flexibility of the input ligand and its 3D geometry into account, there will still be a missing component from the dynamics and the full ensemble of ligand poses. The affinity head ultimately sees a single pocket embedding per prediction. Even where the structure module is diffusion-based and samples multiple seeds, and in parallel work on a VHL molecular glue degrader series led by my colleague Alwin Otto Bucher, to be published as a separate post, we see the model predict a higher pKd for the seed that lands in the correct aromatic orientation, that signal is per-seed, not an integral over the conformational ensemble. That being said, more pragmatically: the question for a discovery team is not whether the model approximates ΔG from first principles. It is whether the model ranks compounds well enough to inform a synthesis prioritization decision better than the alternative being used today. A Pearson of 0.76 on a held-out validation set of macrocycles, achieved with 49 training compounds and 45 minutes of GPU time, clears that bar for the cases I have looked at. The thermodynamic concern is real and should constrain how confidently you treat the absolute predicted pIC50 values. It does not invalidate the ranking.

What this means for a lead-optimization workflow

Different drug-discovery activities make different demands on an affinity model, and conflating them is the source of most of the noise in the "is co-folding affinity good?" debate. A practical taxonomy:

  • Virtual screening / hit finding: high throughput, low per-compound inference cost. Wants a model that ranks reasonably across diverse scaffolds. Does not need fine within-series resolution.

  • Hit-to-lead: global ranking across multiple scaffold candidates. Cross-chemotype generalization matters.

  • Lead optimization: local, scaffold-specific ranking is the whole game. Within-chemotype discrimination is what filters which of next week's analogues get synthesized.

  • Off-target / safety panels (kinome, GPCR panels, Safety Screen 44): selectivity prediction across a defined panel of anti-targets. The shape of the problem is closer to multi-task prediction.

The JAK2 fine-tune above is squarely a lead-optimization tool. The integrated head's out-of-the-box Spearman 0.60 on the broad JAK2 set is closer to a hit-to-lead tool. These are separate evaluations, not collapsed answers to a single question. For lead optimization specifically, three regimes are worth distinguishing: In-distribution chemotype, public affinity coverage adequate: Out-of-the-box affinity ranking is sometimes good enough to be useful for early filtering. Spearman 0.60 on a diverse JAK2 set without any adaptation is in this regime. For series that look like things in PDBbind or ChEMBL, this is often where you start. Edge-of-distribution chemotype, small in-house affinity dataset available: The JAK2 macrocycle case. Out-of-the-box ranking is weak. Lightweight fine-tuning on a few dozen compounds recovers most of the ranking signal, fast enough — roughly 45 minutes total on a single L40s — that this could plausibly run on a per-program basis inside a normal DMTA cycle, not as a one-off science project. The bar to making an affinity head useful for a specific chemotype is now low enough that this should be the first thing you try, not the last. Out-of-distribution chemotype, or novel system class: Ternary complexes like molecular glues, and PROTACs fall here. Out-of-the-box ranking fails for structural reasons before you even get to affinity. You need structure fine-tuning and affinity fine-tuning together, and the data design dominates whether you get a meaningful improvement. The VHL ablation work has a clean demonstration: removing the negative-data compound from the fine-tuning set collapses the active/inactive separation entirely, and skipping the structure fine-tune leaves the affinity head fitting noise. The follow-up post will go through both ablations in detail.

Why HTS data and the probability head matter

Most pharmaceutical affinity data is not clean pIC50. It is single-shot percent-inhibition at a fixed concentration. It is sparse dose-response curves. It is qualitative hit-or-miss calls from a primary screen. The conventional response from structural modellers is to filter most of this out and train only on the clean quantitative measurements. That is the wrong call when the quantitative data is sparse and the qualitative data is plentiful, which is the situation for almost every program at almost every pharma I have worked with. Every HTS run a company has ever done encodes structured signal about which compounds bind and which do not. The argument that combining heterogeneous affinity data improves prediction is not new. Eric Martin and colleagues at Novartis showed years ago with pQSAR and All-Assay-Max2 pQSAR that activity predictions trained jointly across thousands of heterogeneous assays can rival four-concentration IC50 measurements in individual assays. My own earlier work on imputation of heterogeneous bioactivity data and practical deep learning for drug-discovery imputation showed similar effects. What is new is that we can now do that trick on top of a structural representation that already encodes the binding pose, instead of starting from molecular fingerprints alone. Restoring the probability-of-binding head means we can feed HTS data in as a binary classification target alongside whatever quantitative pIC50 data exists for the same series. The affinity head learns from both, and the two modalities reinforce each other where they overlap. This matters even more for federated training. Across the partner organizations in the AISB Network, the combined affinity data inventory is large but heterogeneous in assay format, units, and confidence level. The model architecture has to be able to consume that heterogeneity, or the data goes unused. The probability head and the differentiable affinity pipeline are what make that possible.

Two limitations worth being honest about

Limitation 1: the affinity head is bound to the structure trunk it was trained on: The SandboxAQ affinity head was trained on top of OpenFold Preview 1 embeddings, and it does not transfer cleanly to other trunks. Not to OpenFold Preview 2, not to fine-tuned structural variants, and not to any structure-prediction model that does not share that exact representation. This means today's affinity fine-tuning workflow is most useful when your targets are already well predicted by the OFP1 trunk. For chemistry that needs a fine-tuned structure model first (the PDE10A case study is a worked example), you currently get the structural improvement but lose the affinity head in the process. We are training a new affinity head from scratch on top of the AISB-1C federated structure prediction model as part of AISB-2, which will give pharma partners the option to fine-tune affinity on top of a structure trunk that was itself trained on proprietary industry data. That is the next obvious capability gap to close.

Limitation 2: affinity-head reliability is tightly coupled to structure-module accuracy on your chemotype: As the VHL work shows, getting the pose wrong before predicting affinity collapses everything downstream. Pose first, affinity second, every time. If your structure module is putting the ligand in the wrong place, no amount of affinity fine-tuning will fix it. You need to fine-tune the structure module first.

What's next: combining structure, affinity, and HTS data at scale

The JAK2 case study above is one mechanism on one dataset. The broader program we are building inside ApherisFold and the AISB Network chains these mechanisms together:

  • Structural data fine-tunes the structure module to place ligands correctly for a target class.

  • Quantitative affinity data fine-tunes the affinity head to rank within that target class.

  • HTS-style binary data extends the affinity head's coverage into the chemical space where dose-response measurements were never run.

When pharma companies participate in a federated training run, what is aggregated is not raw data but the model update derived from each company's slice of all three modalities. The federated OpenFold3 result we reported recently showed this works at the structure level: five top-20 pharma companies, training across proprietary structural datasets without any data leaving the originating organization. The next milestone is showing it works at the affinity level. The binding-mode signal in structures, the SAR signal in affinity data, and the activity signal in HTS data. Combined inside one model that none of the participants could have trained alone. A larger case study is in flight on a single target that combines HTS and quantitative affinity data, with explicit ablations for how each modality contributes to the other. The early indication, both from the JAK2 work above and from the VHL ternary-complex work, is that the synergy between modalities is real, and the ablations confirm it. More on that soon.

Try ApherisFold

For teams who want to build a co-folding capability into their discovery engine, ApherisFold provides an enterprise-ready way to run models like OpenFold3, Boltz-2, and Protenix-v1 locally, benchmark them on proprietary structures, fine-tune both structure and affinity heads on program-specific data, and operate the whole stack inside the partner's own IT environment. Access ApherisFold

Acknowledgements

The VHL molecular glue work referenced above was led by Dr. Alwin Otto Bucher. The integration work builds on the public SandboxAQ release and the upstream OpenFold3 model from the AlQuraishi lab.

References:


Co-Folding AI
Lead Optimization
AI Drug Discovery
Share blog post to

Insights delivered to your inbox monthly

Related Posts

A new operational standard for industrial federation
Delivering a state-of-the-art federated co-folding model across five pharma companies in record time
Federated Learning
Collaborative Data Ecosystems
Co-Folding AI
Co-folding models in lead optimization workflows
Co-folding models help lead optimization by predicting binding modes, filtering weak candidates before synthesis, and improving compound prioritization through benchmarking, fine-tuning, and expert review.
Co-Folding AI
AI Drug Discovery
Lead Optimization
When does protein–ligand co-folding become useful in real medicinal chemistry?
Using a recent J. Med. Chem. TrmD study, we evaluate how well OpenFold3 protein–ligand co-folding recovers binding modes and key interactions, and where it can realistically support medicinal chemistry decisions.
AI Drug Discovery
Co-Folding AI
Pharma