Validation — Predicted vs Experiment

At a glance

Mean absolute error (MAE) of MoleBench against experiment, across the benchmark sets below:

Property	Method	MAE	Verdict
Bond lengths	GFN2-xTB geometry	≈ 0.007 Å	excellent
Dipole moment	GFN2-xTB	≈ 0.22 D	good (slight over-prediction)
¹H chemical shift	GIAO B3LYP/6-31G*	≈ 0.15 ppm	good
¹³C chemical shift	GIAO B3LYP/6-31G*	≈ 2 ppm*	good
pKa	GFN2 ΔG + per-class scaling	≈ 0.3 units	good (per functional group)
UV-Vis λ_max	TD-B3LYP/6-31G*	≈ 5 nm	good (basis-dependent)

*excluding one gas-phase carboxylic-acid outlier discussed below. All calculations run on this site; you can reproduce any of them in the Studio.

The honest one-liner: MoleBench is excellent for geometry and trends, and good and quantitative for NMR, dipoles and (after a per-class re-fit) pKa. Use it to understand and compare — and reach for the literature when you need a publication number.

Geometry — GFN2-xTB bond lengths

Getting the shape right is the foundation of everything else, and here MoleBench's default engine is genuinely strong: bond lengths land within about a hundredth of an ångström of experiment.

Molecule	Bond	MoleBench (Å)	Experiment (Å)	Δ
water	O–H	0.959	0.958	+0.001
methane	C–H	1.082	1.087	−0.005
ethane	C–C	1.522	1.535	−0.013
ethane	C–H	1.088	1.094	−0.006
benzene	C–C	1.385	1.397	−0.012
benzene	C–H	1.080	1.084	−0.004

Dipole moments — GFN2-xTB

The dipole tests the electronic structure, not just the shape. Trends are captured perfectly (the non-polar molecules come out at zero; the most polar comes out most polar), with a mild systematic over-prediction of a few tenths of a Debye on the carbonyls.

Molecule	MoleBench (D)	Experiment (D)	Δ
benzene	0.00	0.00	0.00
formaldehyde	2.33	2.33	0.00
acetonitrile	3.85	3.92	−0.07
chloromethane	2.04	1.87	+0.17
methanol	1.92	1.70	+0.22
dimethyl ether	1.58	1.30	+0.28
ammonia	1.78	1.47	+0.31
water	2.22	1.85	+0.37
acetone	3.42	2.88	+0.54

¹³C NMR — GIAO B3LYP/6-31G* (Advanced tier)

The quantum NMR is calibrated against experiment and performs well across a 200-ppm range, from a shielded methyl to a deshielded carbonyl.

Molecule	Carbon	MoleBench (ppm)	Experiment (ppm)	Δ
benzene	CH	128.7	128.5	+0.2
acetone	C=O	205.7	206.0	−0.3
acetone	CH₃	28.2	30.9	−2.7
toluene	C1 (ipso)	139.2	137.8	+1.4
toluene	C2–C6 (avg)	128.1	127.8	+0.3
toluene	CH₃	22.3	21.4	+0.9
methanol	CH₃	53.1	50.4	+2.7
ethanol	CH₂	61.9	58.4	+3.5
acetic acid	C=O	168.7	178.1	−9.4

⚠ The acetic-acid carbonyl is the one large miss — and it's instructive. The calculation is gas-phase, single molecule; real acetic acid hydrogen-bonds and dimerizes, which shifts that carbon by several ppm. This is a model error (the chemistry of the environment), not a method failure — and it's exactly why we show it. Most carbons land within ~2–3 ppm.

¹H NMR — GIAO B3LYP/6-31G*

Molecule	Proton	MoleBench (ppm)	Experiment (ppm)	Δ
ethanol	CH₃	1.16	1.21	−0.05
toluene	CH₃	2.32	2.34	−0.02
acetone	CH₃	1.95	2.09	−0.14
benzene	ArH	7.12	7.26	−0.14
acetic acid	CH₃	1.93	2.10	−0.17
methanol	CH₃	3.66	3.40	+0.26
ethanol	CH₂	3.96	3.69	+0.27

O–H / N–H protons are omitted: they are dominated by hydrogen bonding and concentration, so a gas-phase value is not comparable to a solution measurement.

pKa — GFN2 deprotonation + per-class calibration

This one has a story. An earlier single global calibration carried a +1.5–2 unit high bias on carboxylic acids — which this very benchmark exposed. The cause: the GFN2 deprotonation energy maps to pKa with a class-dependent slope (carboxylic acids, phenols and alcohols each follow a different line), so no single line can fit them all. We re-fit per functional-group class against a 20-acid set spanning pKa 0–17. The bias is gone, and the error dropped to ~0.3 units:

Acid	Class	MoleBench	Experiment	Δ
trifluoroacetic acid	acid	0.1	0.23	−0.1
formic acid	acid	3.3	3.75	−0.5
acetic acid	acid	4.7	4.76	−0.1
benzoic acid	acid	4.6	4.20	+0.4
p-nitrophenol	phenol	6.4	7.15	−0.8
phenol	phenol	10.1	9.99	+0.1
thiophenol	thiol	6.8	6.62	+0.2
ethanol	alcohol	16.1	16.0	+0.1
phosphoric acid	P-oxyacid	2.0	2.15	−0.2
methanesulfonic acid	S-oxyacid	−2.0	−1.9	−0.1

Held out from the calibration set, then predicted blind: 2-naphthol 9.8 (exp 9.51) and propanoic acid 4.7 (exp 4.87) — so it generalizes, it isn't memorizing. Acidity ranking across 17 units is reliable, and absolute values are now good to roughly ±0.5 unit for the common classes.

Two honest edges, now flagged in the tool itself. Phosphorus/sulfur oxyacids (phosphoric, phosphonic, sulfonic) were originally mis-scored as alcohols (phosphoric came out ~8 instead of ~2); they now use dedicated P- and S-oxyacid classes and carry an "approximate, strong acid" note. Amino acids are detected and labelled: the gas-phase neutral model can't form the zwitterion that dominates in water, so glycine's −COOH reads ≈3.9 rather than the measured ≈2.35 — the tool now says so up front instead of quietly handing you the wrong number.

UV-Vis & why the basis set matters

UV-Vis is a great lesson in how method choice drives accuracy. The strong π→π* absorption of paracetamol (experimental λ_max ≈ 243–249 nm) marches steadily toward experiment as the basis set improves:

Basis set	Predicted λ_max	vs exp (~244 nm)
STO-3G (minimal)	212 nm	−32
3-21G (Quick tier)	235 nm	−9
6-31G* (Advanced tier)	240 nm	−4
6-31+G* (diffuse)	246 nm	+2

This is why the Studio's Quick UV uses 3-21G and Advanced uses 6-31G*. TD-DFT is reliable for ordinary valence (π→π*, n→π*) excitations but should not be trusted for charge-transfer or Rydberg states.

The instant (empirical) NMR — and its honest limits

The Quick NMR returns shifts in milliseconds using substituent-additivity rules. For the chemistry it was built for it is remarkably good — but it knows its limits, and now tells you so.

Molecule	Works well?	Why
aspirin, paracetamol, toluene	yes (±~2–3 ppm)	substituted benzenes + carbonyls + simple aliphatics — its sweet spot
pyridine, furan (heteroaromatic)	no	additivity has no good base values; flagged "low confidence"
caffeine (fused rings)	no	fused/heteroaromatic; flagged, with a "Run Advanced" button

When the instant estimate is unreliable, MoleBench shows a warning and offers to run the real quantum calculation instead — so a fast estimate never masquerades as a trustworthy one. And the Advanced (quantum) tier genuinely handles them — here it is on the very heteroaromatics the instant tier flags, within ~2–3 ppm of experiment:

Molecule	Carbon	Advanced (QM)	Experiment	Δ
pyridine	C2/C6	151.8	149.9	+1.9
pyridine	C4	136.3	136.0	+0.3
pyridine	C3/C5	124.2	123.8	+0.4
furan	C2/C5	141.3	142.8	−1.5
furan	C3/C4	111.6	109.6	+2.0
thiophene	C2/C5	124.8	125.4	−0.6
thiophene	C3/C4	129.9	127.2	+2.7

The two tiers are complementary by design: the instant estimate for speed on its sweet spot, the quantum calculation for the cases it can't reach — and the tool always tells you which one you should be using.

How to read this

Trends are more reliable than absolutes. Comparing two similar molecules with the same method cancels most systematic error — relative answers are the safest use of any of these tools.
The model matters as much as the method. Several of the larger errors above are gas-phase vs. solution effects, not the quantum chemistry being wrong.
Pick the right tier. Quick tiers are for speed and exploration; Advanced (quantum) tiers are for the numbers you'll quote.
Everything here is reproducible. Build any of these molecules in the Studio and run the same calculation yourself.

Open the Studio → Read the Lecture Notes

Benchmark run on the live MoleBench compute service. Experimental values from standard reference compilations (CRC Handbook, NIST, SDBS and the primary literature). Updated 2026-06-25.

Validation — predicted vs experiment