One of the recurring conversations in ADMET modelling is what to do once a model saturates on internal data. The default response is usually to add more: additional plates, retrospective studies, or updated assay variants. But recent literature suggests that this instinct, while understandable, is sometimes misaligned with how models actually learn. The ChemMedChem study by Schliephacke, Kuhn, and Friedrich offers a reminder that performance gains come from diversity and balance rather than volume. Their study examined six endpoints using both Merck’s internal datasets and high-quality public data from the Biogen release. Across most endpoints, the pattern repeated. When external data added complementary chemical or assay space, multitask models improved accuracy, applicability, and in some cases uncertainty stability. When the external data made up only a small fraction of the combined training set and added little new chemistry, the benefits disappeared and sometimes reversed. Their observation is consistent with what I see in our Apheris ADMET Network, where model performance gains depend far more on complementarity than on volume.
A central finding of the Schliephacke et al. study is that the effect of integrating public and proprietary data depends heavily on how balanced those sources are. This matters because most organizations work with datasets that differ vastly in size, assay design, and chemical space coverage. Understanding how these imbalances affect model behavior is essential before designing a smart collaboration. In the study, the endpoints with the largest internal datasets, such as HLM and RLM, showed the clearest limitations of simple data addition. When public data were added, the pooled and multitask models did not improve and in some cases showed slightly worse performance on the internal test sets than the internal-only models. The authors attribute this to a combination of shifted label distributions, differences in microsomal stability assays, and relatively little added chemical diversity from the external set. In this regime, the internal-only model remained the best-performing baseline for internal compounds. The opposite pattern appeared in MDR1-MDCK ER. Here, the internal dataset was sparse and the public dataset contributed substantial new chemistry. Multitask learning produced one of the most pronounced improvements in the study: lower error, wider applicability domain, and more consistent uncertainty estimates. This endpoint benefited precisely because the public data filled structural gaps rather than repeating what was already present. Across most endpoints, the same principle held. Models improved when external data meaningfully broadened the training space. When the external data were small, redundant, or dominated by internal data, the benefit diminished and calibration often degraded. For organizations considering participation in an ADMET data network, this points to an important conclusion: the scientific value of collaboration comes from complementary contributions, not from volume alone.
Another key insight from the Schliephacke et al. study is that complementary data can broaden a model’s applicability domain (AD) even when headline accuracy does not change. This matters because many ADMET models appear sufficiently predictive on global metrics while still behaving unpredictably in regions of chemical space that are poorly represented in their training data. The study’s similarity-stratified analyses illustrate this clearly. For endpoints where internal and public data were reasonably balanced, multitask learning produced a smoother and more consistent error profile across similarity bins, thereby extending the AD. The HLM endpoint is a representative example. Although the MAE remained unchanged, the multitask model showed fewer abrupt error shifts in regions where single-source training had little representation, indicating a more stable response surface. This pattern reflects what often occurs in siloed proprietary settings. Models trained on a single internal dataset tend to perform well on structures they have seen, but their errors rise sharply in areas with little or no training coverage. When new data broaden the chemistry in a genuinely complementary way, these steep error gradients often flatten and the model behaves more consistently across previously sparse regions. In contrast, adding data that are structurally redundant or heavily overlapping with existing chemistry offers little benefit to the AD. It can give the impression of broader chemical coverage, even though the model does not improve in areas that extend beyond its existing training chemistry. In short, ADs become more predictable when contributions are complementary and more fragile when combined datasets reinforce the same structural biases.
What I find most interesting about the Schliephacke et al. study is that several results challenge familiar assumptions in ADMET modelling, including the expectation that more data improves uncertainty estimates, that more expressive models provide better calibration, and that accuracy gains are required for a broader applicability domain.
First, more data does not always reduce uncertainty. Some multitask models became less calibrated when public data were added, with imbalance between sources appearing to drive the effect.
Second, simpler models provided the most reliable uncertainty estimates. The MLPs outperformed more expressive architectures, counter to the common assumption that representational flexibility naturally supports better UQ.
Third, the applicability domain sometimes expanded without gains in absolute accuracy. This indicates that a model can learn more stable structural relationships even when endpoint-level performance remains static.
Finally, public data improved internal predictions in several cases, showing that high-quality, complementary public data can strengthen even in-house model behavior.
These findings show that intuition is a poor substitute for understanding how models behave under heterogeneous training conditions. ADMET model behavior remains governed by the availability, balance, and complementarity of training data; no architecture compensates for these factors when they are misaligned. These findings also raise a broader question for collaborative ADMET efforts: how can participants assess whether their data are truly complementary without revealing proprietary IP? Our recent work on privacy-preserving federated diversity analysis provides one approach. Using federated clustering and chemistry-informed metrics such as SF–ICF, we can estimate structural diversity, scaffold coverage, and sparsity across distributed datasets without centralizing any structures or fingerprints. As demonstrated in our benchmarking of federated k-means and PCA-based methods this analysis helps quantify whether a dataset broadens or overlaps with the existing chemical space. These signals indicate when combined training is likely to improve ADMET models and when size imbalance or redundancy may limit the benefit.
The findings confirm a familiar principle in federated ADMET modelling: performance improves when contributions are complementary rather than large in volume. This shifts the usual single-partner mindset, that more data always gleans better models. The results show that representative and well-balanced datasets improve performance, whereas large quantities of marginal or domain-shifted data may not. This has direct implications for how ADMET collaborations should be designed. Contribution requirements need to account for chemical diversity, endpoint comparability, and assay consistency, not just raw volume. Imbalances in these dimensions can limit the benefit of joint training or lead to degraded calibration. These considerations are reflected in our current ADMET Network design, where endpoints are harmonized, units are aligned, and data contributions are evaluated with attention to their structural diversity. The federated setup then allows each organization to retain full control over its data and IP while still benefiting from a broader and more representative training domain.
The most interesting outcome from the study is not that multitask learning can help, but the conditions under which it does. The clearest gains appear when additional data introduce genuinely new chemical or assay space. Limited or overlapping contributions offer little benefit and can even reduce calibration when one source dominates the training set. This matters because ADMET prediction is often most valuable precisely where experimental data are sparse or at the limit of the applicability domain. Chemists rely on model-based hypotheses to explore ideas early, to prioritize compounds before assays are available, or to test directions that fall outside routine chemical series. In these settings, a model’s behavior in underrepresented regions is often more important than its accuracy on well-characterized scaffolds. Complementary data broaden these regions. Redundant data do not. The study also reinforces a broader point for collaborative ADMET efforts. Federated networks can provide access to a wider range of chemical and assay space, but the value comes from how contributions are selected and harmonized. Diversity, balance, and assay comparability shape the behavior of any collaborative model. When these elements are aligned, federated learning can support more reliable predictions in exactly the parts of chemical space where they are most needed.