Tests that catch tampering
Deliberate faults are injected into the engine to confirm the suite notices. A first pass killed 15 / 15 mutants; the campaign later extended to 27 mutants.
Reliability isn't a claim you make — it's a number you measure. The flagship FetalGrowth engine — the deepest module in ObGynAssist™ — is held to an independent guideline oracle, replayed bit-for-bit across thousands of seeds, and gated in continuous integration so a regression cannot merge. Every figure below is reproducible from the repository, and the honest limits are stated as plainly as the wins.
Each synthetic case is independently graded by a guideline oracle that knows nothing of the engine's internals. Across 2,994 graded cases, the engine and the oracle agreed on every one.
2,994 graded of a 3,000-case synthetic corpus · 0 under-calls · 0 over-calls
Concordance is agreement with an independent guideline oracle on synthetic cases — it measures how faithfully the engine reproduces standard-of-care references like RCOG, NICE and ISUOG. It is not a measure of neonatal outcomes, and it does not claim clinical superiority. A prospective clinical-validation study — perinatologist assessment against the engine — is in design. We publish the limit as prominently as the result on purpose.
A clinical decision can't rest on a coin flip. Across 50 distinct configurations replayed over 1,000 seeds each, every single run produced a byte-identical output. No sampling, no drift, no hidden randomness — the engine is a pure function of its input.
Reproducibility is also forward-looking: every output is stamped with its engineVersion and a full-content rulesetSHA, so any decision can be replayed bit-for-bit years later against the exact ruleset that produced it. The result is commit-backed in the repository.
stamp: engineVersion + rulesetSHA + firedRules[]
Illustration with a fixed example case — in the app, the live engine computes this on-device. The point here is the replay: every run returns the identical bytes. Outputs are classification labels that support clinician judgment; clinician evaluation required.
A second, version-pinned oracle (v1.0.0) runs inside continuous integration. It holds 99.3% concordance on 1,000 synthetic cases — but the point isn't only the number. The floor is enforced: if agreement drops below it, the build fails and the change does not land.
overall concordance on 1,000 synthetic cases against the CI-locked oracle v1.0.0.
the enforced overall floor. Drop below it and the build fails — the regression cannot merge.
a separate floor on each of the six graded decision dimensions — no single axis can quietly degrade.
Two oracles, two purposes. Metric A is a continuously-evolving tooling oracle used in development; Metric B is a frozen, version-pinned reference whose floors are wired into the test gate. Together they keep clinical accuracy from regressing silently between releases — the engine has to keep earning its concordance, build after build.
Concordance answers "is it right on typical cases?" These axes answer the harder question — "what happens at the edges, under mutation, and across paths that ought to agree?"
Deliberate faults are injected into the engine to confirm the suite notices. A first pass killed 15 / 15 mutants; the campaign later extended to 27 mutants.
Twin cases evaluated through independent code paths are compared verdict-for-verdict. Across 504 twin cases: 0 divergent verdicts.
A high-volume harness checks invariants that must never break. Across 39,000 cases: 0 violations.
An added +77 boundary tests probe gestational-age limits, centile cut-offs and threshold crossings where silent extrapolation would otherwise hide.
An adversarial "break-it" campaign hunts for contradictory or under-called outputs, then hardens the engine against every reproducible failure it surfaces.
A golden-corpus regression suite locks expected outputs so any unintended change in clinical behavior shows up as drift — re-blessed only on a deliberate, reviewed decision.
The full suite — 2,358 tests, 0 failing — runs the unit, parity, boundary and golden checks together, so a single command tells you whether the engine still behaves exactly as it should.
Beyond the test suite, a runtime coherence gate checks structural invariants on every output before it reaches the screen — for example that an FGR case is never silently blocked, that a severe pre-eclampsia crossing is honored, and that periviable ordering stays consistent. Contradictory outputs are trapped rather than displayed.
Concordance isn't a single yes/no. The oracle grades each output along six independent axes — and the engine matched the guideline reference on all of them.
FGR / SGA stage and size classification from biometry and Doppler.
Outpatient, admit-consider or inpatient classification.
The acuity tier label attached to the assessment.
Whether criteria are met for a deliver-now classification.
The reassess-at surveillance timing the case supports.
The gestational-age window associated with planned birth.
All of this resolves on the phone, in real time. Validation never costs the clinician a wait.
A full sourced plan computes in well under a millisecond.
The plan updates in real time as biometry and Doppler are typed.
You can't replay a stochastic answer, gate it in CI, or stamp it for audit. Determinism is what makes every metric on this page possible in the first place.
Every threshold the engine fires traces to a named reference card. Concordance is only meaningful because the standard being matched is itself sourced, line by line.
The 243 rules span five tables — 112 FGR / SGA, 43 AGA, 26 LGA, 43 Twin-pair and 19 Modifier — and four centile standards: WHO (default), INTERGROWTH-21st, Hadlock 1991 and NICHD. The technology page walks the pipeline that consumes them.
The seven-step pipeline, the gating layers and the provenance stamp are what make this evidence possible. See how the engine works — or reach out about clinical partnership and the prospective study.
Concordance figures reflect agreement with an independent guideline oracle on synthetic cases — agreement with standard-of-care references, not neonatal outcomes. ObGynAssist™ is a clinical companion and reference tool, not a regulated medical device. A prospective clinical-validation study is in design. Outputs are classification labels and guideline citations that support clinician evaluation; clinician evaluation required.