Using AI to predict how a ligand binds to its protein pocket, or how two proteins lock together, has moved from wishful thinking to daily practice thanks to co-folding models such as AlphaFold3, Boltz-2, or the community-driven OpenFold3. Yet the accuracy of these models still drops in industrial applications that matter most in drug discovery: proprietary chemotypes, novel protein targets or modalities like molecular glues or antibody/antigen complexes that are underrepresented in public data. Customizing these models on proprietary datasets would close the gap, but proprietary data can’t leave company firewalls. Federated Learning (FL) solves this by sending the model to the data and aggregating weight updates into a shared global model. Using Boltz-1 and OpenFold3, we measured how additional training data and different FL setups affect performance. If effective, the same approach can be applied to real proprietary datasets without moving IP-sensitive data. All experiments were run on 8× H100 NVIDIA GPUs via DGX Cloud through the NVIDIA Inception program, with NVFLARE orchestrating cross-site training of Boltz-1 and OpenFold3.
Pharmaceutical and biotech companies hold rich proprietary protein–ligand data that often cannot be shared due to contractual, regulatory, and competitive constraints. Privacy-preserving federated learning (FL) enables expanding the training distribution across diverse chemotypes and targets while keeping data IP protected. By sending models to the data and aggregating only weight updates, FL ensures that raw data remains within each organization’s infrastructure. In our experiments, we compared two aggregation techniques:
Federated Averaging (FedAvg): a well-known baseline that simply averages model updates from different sites
DiLoCo: a newer method that reduces communication by allowing clients to sync less often.
We tested how these approaches perform across a range of settings on Boltz-1 and OpenFold3. In the next section, we explain how we federated Boltz-1 and OpenFold3 by varying the number of clients (W), sync intervals (H), gradient accumulation (G), and the aggregation technique (FedAvg vs. DiLoCo) and report the results of these experiments.
Boltz-1 is an open-source diffusion model for predicting 3D structures of biomolecular complexes, released with code and weights (Boltz-1 preprint). As a predecessor to the Boltz‑2 model, Boltz‑1 provided a solid testbed for training from scratch as Boltz-2 source code wasn’t available. We ran 1200 optimization steps on NVIDIA DGX Cloud (8× H100), starting from random weights and simulating either 2 or 8 clients with different compute allocations and training schedules. All experiments were conducted on i.i.d. data across clients, as a first step to understand how hyperparameters affect convergence and model performance in a controlled setting. To explore how system settings affect convergence and accuracy, we varied four parameters:
| Variable | Values explored | Rational |
|---|---|---|
| W (clients) | 2 vs. 8 clients | Compare setups with fewer clients (each 4 GPUs) vs. many clients (1 GPU each) |
| G (Gradient Accumulation Steps) | 16 vs 32 | G = per-client batch size; in FL the global effective batch is the sum across clients E.g., W=8 with G=16 → 8x16=128 |
| H (local steps) | 5, 10, 40 | Number of local updates before synchronizing |
| Aggregator | FedAvg vs. DiLoCo | Compare standard averaging vs. momentum-aware aggregation, which theoretically could work with less frequent synchronization |
We tracked training loss, RMSD (pose distance), and lDDT (local distance agreement) to evaluate structural prediction quality across setups. Next, we'll compare the different federation set-ups with baseline (centralized training) * For all graphics: Blue = centralized baseline (single-site), Red = average across clients
W=2, H=10, G=16 → matches centralized performance With 2 clients (each on 4 GPUs), syncing every 10 steps and G=16, the training and validation metrics matched centralized training (no accuracy loss from FL).
W=8, H=10, G=16 → diverges unless G↑ to 32 or H↓ to 5* With 8 clients (each on 1 GPU), keeping the same parameters to make learning worse likely because each client’s local batch size (1 GPU x G=16) would be too small to produce stable updates. While the global batch size remained at 128 (8 GPUs × G=16), the small per-client batch caused unstable gradients, leading to poor local convergence and increased drift across clients. As a result, the aggregated global model diverged from the centralized baseline. Fixes that worked (restored model performance to match centralized training and sometimes slightly exceed it on validation metrics):
Increase G to 32 (bigger per-client batch, more stable local learning)
Decrease H to 5 (sync more often, which reduces drift and keeps global model closer to optimum)
W=8, H=40, G=32 → “training collapses” Syncing too infrequently (every 40 steps) lets clients drift too far apart, and optimization breaks down. Loss spikes or stops improving.
Key takeaway: FedAvg matches centralized training for IID data when per-client batch isn’t too small and syncs are reasonably frequent (e.g., W=2, H=10, G=16). With many small clients (W=8), the local batch size (1GPUx G=16 ) may become too small, leading to unstable updates at each client. This local instability propagates across clients and reduces the quality of the aggregated model. To recover accuracy, we either increased G to 32 (providing more stable local convergence) or decreased H to 5 (increasing sync frequency to reduce between-client drift). These adjustments restored performance to match or slightly exceed centralized training, but at the cost of longer step time and higher communication overhead. As a rule of thumb, fewer clients with larger per-client batches (low W, high G) yield smoother optimization, while more clients with longer local horizons (high W, high H) reduce sync cost but increase between-site drift, often hurting convergence and final accuracy.
DiLoCo is a federated optimization method that reduces how often clients synchronize. Its performance depends on two key parameters: learning rate and momentum, both typically between 0 and 1. The learning rate determines the size of each model update: lower values slow learning but improve stability. Momentum carries forward a portion of the previous update to smooth learning; if set too high it can cause the training to oscillate or diverge. Tuned correctly, DiLoCo can match centralized accuracy with far fewer synchronizations (source).
Training stable only after momentum lowered from 0.75 → 0.25 → 0.10 “Stable” here means the loss decreases smoothly, validation improves, and runs don’t oscillate or blow up after syncs. With DiLoCo, momentum values common in LLMs (0.9/0.75) didn’t work well. Reducing momentum (to 0.25, then 0.10) produced steady learning, with learning rate (lr) tuned in the 0.7–0.9 range.
W=8, H=40, G=32, mom=0.10 → “tracks near-baseline with far fewer syncs" With H=40, DiLoCo syncs ~8× less often than FedAvg (H=5) — yet with momentum = 0.10 and G=32, it still tracks near-baseline performance (run stopped early; conclusions tentative).
Weight norms start high and converge toward FedAvg once momentum is tuned With DiLoCo, high momentum made weights larger; lowering momentum pulled the weight size back toward FedAvg’s range. This matches theory: as momentum → 0 and lr → 1, DiLoCo approaches FedAvg.
Key takeaway: DiLoCo can tolerate long gaps between syncs (e.g., H=40) and still track near-baseline when momentum is tuned low (≈0.1) and lr is set in ~0.7-0.9, and per-client batch is adequate (G≈32), which can be useful when bandwidth is limited. This method is sensitive to hyperparameters; off-the-shelf LLM settings (high momentum) performing poorly. Rule of thumb: use FedAvg when clients are few and syncs can be reasonably frequent; use DiLoCo when bandwidth is tight and you can tune momentum/lr for stability.
For OpenFold 3 (OF3) the goal was to evaluate federated fine-tuning. OpenFold3 is an open-source reproduction of AlphaFold3’s unified, diffusion-based approach to joint complex prediction across proteins, ligands, nucleic acids, ions, and modified residues (AlphaFold 3, Nature 2024). We first ran centralized training to study our baseline dynamics, then resumed from a checkpoint to fine-tune on four datasets:
Baseline PDB: original training data
Post-2023 PDB: newer, unseen structures
Mixed (50/50): combining baseline and post-2023 data
Monomer + Ligand only: a specialized subset focused on small molecule complexes
We used a warmup of 500 steps (the lr grows linearly from 0 to the target of 1.8e-3); the default of OF3 is 1000 steps used when trained from scratch
We resumed from a 5,000-step checkpoint, chosen because the initial rapid metric improvements had already stabilized, while training was still progressing quickly enough to observe meaningful changes within the next 2,000 steps. Later checkpoints showed much slower improvements, making them less suitable for quick evaluation. We then ran two FL experiments with two clients whose data distributions differed. In the first setup, both used independent and identically distributed (i.i.d.) data. This means both datasets looked statistically similar, as if drawn from the same pool. In the second, the data were non-i.i.d.: one client used the baseline PDB set while the other used only post-2023 entries.
Key findings We resumed from a 5,000-step checkpoint, chosen because the initial rapid metric improvements had already stabilized, while training was still progressing quickly enough to observe meaningful changes within the next 2,000 steps. Later checkpoints showed much slower improvements, making them less suitable for quick evaluation.
When we fine-tuned on new or smaller datasets, the model overfit quickly, e.g., the Post-2023 PDB split began to memorize within ~750 steps. A targeted Monomer + Ligand split behaved as expected: intra-ligand lDDT improved, while intra-protein lDDT declined. This illustrates that focusing the model on specific structures improved intra-ligand IDDT but reduced intra-protein IDDT (a specialization vs. generalization trade-off).
In the two-client federation i.i.d. case, the average training loss across sites matched the centralised baseline.
In the non-i.i.d. case, each site exhibited different training losses, but their average still aligned with the expectations from a centralized set-up.
We previously held off on interpreting validation metrics due to a bookkeeping bug: in structure prediction models like OpenFold3, validation doesn’t use the raw training weights but an exponential moving average (EMA) of them. In federated learning, this means the EMA also needs to be synchronized across clients, doubling bandwidth if done naively. To solve this, we applied EMA only during synchronization (every H steps), rather than at each local step. Since EMA in centralized training uses a decay rate of 0.999 per step, we adjusted this for the lower sync frequency (e.g., using 0.999⁵). This fix aligns the validation curves with training progress and allows us to meaningfully evaluate federated fine-tuning results going forward.
Since each run takes several days on 8×H100 GPUs, learning training dynamics early, like when to checkpoint, how to mix datasets, and which regularization strategies to apply, saves compute and guides how we scale future federated experiments.
Supported by NVIDIA’s Inception program and compute credits on DGX Cloud, our experiments show that with the right federated learning setup, distributed training can match the performance of centralized training. This makes FL a viable path to close the gap where today’s protein models still fall short: on the proprietary targets that matter most in drug discovery. These experiments mark a first step toward our broader mission: improving structure prediction through access to more representative training data, including proprietary data held by pharmaceutical and biotech companies. FL offers a viable path to make this possible while protecting data privacy and intellectual property. This foundation now feeds directly into real deployments. Together with pharmaceutical partners in the AI Structural Biology (AISB) Network, we are beginning to federate models like OpenFold3 across large volts of proprietary datasets, applying the lessons from these experiments to support collaborative training across organizational boundaries, without compromising sensitive data.
Authors: Avelino Javer, Nicolas Gautier, Ian Hales