May the -∇E be with you
Uncertainty Quantification for MLIPs is nothing like vision, and needs to be approached very carefully
Foreword
It’s widely known that AI for molecular modeling lags behind traditional ML research by 1-2 years, and there’s an elegant transfer of methods and techniques (meant for images and text) from the latter to the former when being repurposed for 3D molecular data (eg: generative models like flows and diffusion for de novo design). There’s been a lot of action in the MLIP space the past few years, and more recently, we’re seeing uncertainty quantification (UQ) efforts from traditional ML start to bloom.
Here’s a short commentary on where I see this going and what are some potential pitfalls to avoid (or at least, navigate).
This post was inspired by a recent conversation, among many others, on MLIPs and UQ with Ty Perez, a brilliant PhD student at MIT, friend, and co-first-author on our latest work, Zatom-1, which features some MLIP experiments. A lot of what I say below came up during these calls, and this is a good chance to consolidate all of that.
On MLIPs
Machine Learning Interatomic Potentials (MLIPs) are neural networks trained to predict per-atom energies and forces given a molecule’s 3D conformer and chemical formula. The eventual goal is to replace (or supplement) molecular dynamics by bypassing the Newtonian equations bloat and land at forces directly. Over the past few years, it’s become a hot-topic with a lot of interest from academia (that’s produced models like MACE, SevenNet, Allegro, etc) and industry (teams like Orb, Meta’s FAIR Chemistry, IBM, etc are building leading models). Force-energy and quantum property datasets are getting larger than ever: OMat24, OMol25, OPoly26. The list goes on.
MLIPs come in two flavors: conservative, where energies are predicted using the neural network, and forces are computed using the negative energy gradient (following the law of conservation of energy); and non-conservative, where the neural network has two heads to predict scalar energies and 3D forces separately; this is also called direct force prediction. Recent literature has shown non-conservative MLIPs perform equally if not better than conservative ones, and the jury is still out on whether one really needs things like strict equivariance to get the job done1.
Ultimately, MLIPs are trying to solve a big regression problem. It is perhaps the best place to provide error bars for how uncertain an MLIP is about its energy and force predictions (like examining a classifier’s logits as a proxy for confidence).
And this is exactly where we need to be very careful.
Classic Uncertainty Quantification
UQ measures how confident a model is at a given task. It indicates how reliable a prediction is in the broader context of the model’s parameters (or specifically, its algorithm for making decisions) and random noise inherently present in the training data. We can define uncertainty as being epistemic, where it’s a limitation with the model itself, or aleatoric, where the limitation comes from external noise and less-than-ideal data quality. The former is addressible through better modeling practices while the latter can only be managed and not fully removed. In most cases, throwing more (balanced) data at the model or using more expressive architectures helps reduce epistemic uncertainty.
In structured, closed-world contexts, the uncertainty shows up when the model is at the decision boundary between classes or categories. Knowing which way the model is leaning, in a classification problem, for example, tells us how the model perceives the world. UQ for computer vision has largely been successful, with better uncertainty estimates in medical imaging and self-driving cars. The number of categories or situations are usually finite and the boundaries between them are mostly discernable. When a model is confused between a dog and cat, the features (eg: fur texture, fur color, ear and muzzle shape, etc) lie at this blurry boundary with almost equal logit weights for both classes. This can be resolved based on the problem-setter’s assumptions and needs to nudge it towards a proper prediction.
In open-world, unbounded domains, UQ fails to realise when the model is in OOD territory. There’s little to no clear semantic boundary between objects of the same or different category. Most times, the differences between samples aren’t apparent in the space they lie in itself, but when viewed contextually in the domain they’re from. Molecular modeling falls under this bucket, and as a result, UQ for force-energy prediction is at risk of jumping to unsubstantiated conclusions about absolute and relative MLIP performance.
Typical UQ practices
Here’s a flash explanation on what’s done in UQ for traditional ML.
Ensembling and measuring disagreement
The most common approach is to train multiple models on the same data and measure how much they disagree on predictions. If all the models give similar answers, you're probably in a regime where the model is confident. If they disagree wildly, something unusual is happening.
Disagreement signals uncertainty and when the model is at a decision boundary, and when the evidence is ambiguous, the models will naturally disagree more. In medical imaging, this works well: a radiologist looking at a borderline tumor and an ensemble of models will both be uncertain. But this only works if you actually have decision boundaries to measure disagreement against. In closed-world problems with discrete categories, this isn’t really a concern. In open-ended prediction problems without clear boundaries, disagreement becomes noise that’s hard to disentangle.
Bayesian methods
Bayesian approaches try to quantify uncertainty over the model’s parameters, not just give a point estimate. More uncertainty in parameters means more uncertainty in predictions. This gives us confidence intervals, which is nice.
It’s elegant and sometimes works, but Bayesian methods assume that more data in the same regime reduces your uncertainty. What they’re really measuring is parameter uncertainty, and not whether your model is fundamentally broken for what you’re asking it to do.
Calibration and Conformal Prediction
Recent work tries to rescale uncertainty estimates after training so they actually match reality. If the model says 95% confident but is only right 70% of the time, calibration can be used to fixe the mismatch. It works when the problem is just miscalibration within a known, finite setting or domain. For example, if we train a house price predictor on Californian houses and test on more California houses, the uncertainty might just be systematically overconfident. Calibration fixes that. Conformal prediction also gives you coverage guarantees. If we say the prediction intervals will contain the actual value 95% of the time, this could work well as long as the test data comes from the same distribution as the training data. Both are useful tools for understanding a model’s reliability in familiar territory, but they don’t tell you when you’ve left that territory.
Molecules are funky …
Molecules are multimodal collections of discrete atom types (the elements) and continuous 3D coordinates. This multimodality is a big culprit: the discrete and continuous features interact in ways that create hard-to-find and hard-to-quantify regime changes in the model’s representation space, making it hard to demarcate clear semantic boundaries. Unlike vision problems (where pixels space contains the semantic separation) or generic regression settings, small discrete changes can shift the entire physico-chemical landscape. Continuous changes in coordinates are actually governed by the underlying energy landscape, where certain conformers are allowed to varying degrees, while other conformations are completely illegal and violate physics. The blurry but discernable boundaries don’t really exist the same way they do for images, text, and audio.
The next big issue is OOD. Given that existing large-scale datasets oversample equilibrium/meta-stable conformations with relatively low potential energy, they do not cover the illegal conformations; perhaps the closest we can get are the non-equilibrium states in datasets like OMol25. Interpolating between two valid conformations doesn’t give us a third valid conformation, both from an energy landscape perspective and invalid geometry perspective. When given a unique conformation (ie, a new regime) that’s not present in the training data2, there’s no guarantee the conformation is valid, and the model gives some unreliable force-energy prediction. We can’t say, for sure, whether this falls under epistemic uncertainty or aleatoric uncertainty. You can imagine a magic third category of poor model specification, where there are so many modeling assumptions not adequately addressed. There’s also large gaps in the force and energy landscapes without any representative states. Any predictions from the model are at best an overconfident interpolative guess.
To know you’re in OOD territory assumes you have a way of computing some form of distance from the training samples. If I have a dog and cat dataset, it’s easy to say monkeys are OOD. For molecules, distances in embedding space preserve no such notion. Two molecules being far way in embedding/representation space can still be valid from a physics standpoint (eg: tautomers, chameleon sequences, fold-switching proteins). Surprisingly, even if you’re within distribution, there’s a lot more nuance to molecules that can’t be captured by their chemical formula (or sequence) and 3D structure. A small molecule with a valid conformation and a newly embedded molecule with the same formula but completely invalid conformation can look similar in embedding space. Only deep domain knowledge can be used to delineate boundaries undescribed by formula/sequence and structure.

UQ for MLIPs is hard
I mention that previous bit on regime changes and OOD because a lot of UQ methods in vision heavily rely on disagreement to measure uncertainty. Thankfully, the UQ for MLIP literature agrees that ensemble disagreement correlates poorly with MLIP error. Kurniawan et al. (2025) says uncertainty estimates "behave counterintuitively in OOD settings, often plateauing or even decreasing as predictive errors grow". PROBE shows "member-disagreement signals correlate weakly with per-molecule prediction error", leading them to pivot toward learned classifiers on backbone embeddings instead3. But we don’t really need better model ensemble design (which is difficult, needless to say). We need to move away from purely measuring disagreement in-distribution.
Recent flexible calibration and conformal prediction methods improve within-distribution uncertainty estimates, which is useful for active learning and relative ranking. But they don't yet solve the regime-crossing issue. Calibration makes your wrongness more self-aware by rescaling uncertainty within distribution, not across regimes. Conformal prediction gives coverage guarantees, but only when the IID assumption is fulfilled. The moment a sample leaves the training distribution, those guarantees evaporate.
This is why the reactive chemistry gap exists. Zhao et al. (2025) found that all pre-trained universal MLIPs struggle with transition state search. Transition states live in a fundamentally different regime than equilibrium structures. They have different bonding patterns, different electronic structure, different force landscapes. Here’s exactly where this fails: when you feed an MLIP trained only on equilibrium geometries a transition state, it sees familiar local features. C-C bonds are at roughly normal lengths. C-H angles look normal. All ensemble members trained on the same equilibrium data will confidently agree on the prediction because nothing really is locally “wrong” or different for the models. But again, the regime has completely changed.
Rather than chasing universality for MLIPs or better statistical UQ, we need better infrastructure. What could this look like? Maybe regime-aware training where you explicitly stratify data by chemistry type (equilibrium, reactive, high-energy, defects). We can stop pretending datasets like OMol25 are one homogeneous collection. This might be slightly expensive and labor-intensive but we can explicitly annotate and separate structures that are at their minima or are high-energy, or have defects or are undergoing partial bond breaking. Separate models can be trained for each regime or we can include those regime labels as explicit conditioning information. Demarcating all this makes the assessment of MLIPs more honest since users no longer accidentally use MLIPs in the wrong regime and we can move away from universality. We also can better decide where to focus active learning efforts. Physics-informed flags that signal when you might be extrapolating (bond lengths approaching breaking points, temperatures far above training range). Validation-indexed uncertainty where we’re more honest about scope: "this model is reliable on these regimes, untested on those". Imagine saying something more specific like “this MLIP is validated to ±2 kcal/mol on equilibrium organics (~300K) and is not validated on transition states, metals, or anything >500K.”
I was on a call with someone working on RNA and molecular dynamics recently and the topic of hybrid workflows came up. It’s something I’m excited to see. Imagine hybrid setups where the MLIP rapidly screens structures during exploration (MD, structure optimization, pathway search), uncertainty and physics guide which structures to validate with DFT or experiment, and validated results feed back into training for the appropriate regime. But, knowing when to validate also becomes a modeling design choice: should we validate when MLIP uncertainty is high, or when MLIP uncertainty and physics flags disagree, or when the MLIP is confident but in an unseen regime?
Fortunately, I am yet to come across anyone in the community who believes universal MLIPs are possible in the short-term. There’s a healthy skepticism to how the community is approaching this, and while we’re still training on large homogeneous collections of molecular data, we’re aware of the gaps that prevent universality.
Tiny epilogue
Slight digression, but something else that popped up during that conversation on RNA and MD was the practical utility of MLIP-like architectures. There are so many leaderboards and competitions (eg: OpenCatalyst) for MLIPs but pharma companies really love their scoring and property prediction models. So the big question on my mind is whether we really do care about MLIPs and can give them a proper application, or whether we really just care about building powerful, expressive architectures for MLIPs that will eventually be repurposed for property prediction and scoring. Food for thought.
Obviously, given my leanings towards geometry-informed methods, I’m in the pro-equivariance crowd, or as Erik Bekkers calls it, equivariance extremists.
By this, I specifically mean if that exact new conformation is not represented or covered in the set of all unique conformations present in the data for the same molecule or molecules that are structurally very similar (eg: proteins).
Big fan of Olexandr Isayev’s work!

