The proof

The numbers behind the structure.

Reliability isn't a claim you make — it's a number you measure. The flagship FetalGrowth engine — the deepest module in ObGynAssist™ — is held to an independent guideline oracle, replayed bit-for-bit across thousands of seeds, and gated in continuous integration so a regression cannot merge. Every figure below is reproducible from the repository, and the honest limits are stated as plainly as the wins.

2,994Graded cases
100%Oracle concordance
50,000 / 50,000Bit-identical runs
2,358Passing tests
v1.56.0Engine version

Metric A · tooling oracle

Measured, not asserted.

Each synthetic case is independently graded by a guideline oracle that knows nothing of the engine's internals. Across 2,994 graded cases, the engine and the oracle agreed on every one.

0
Graded synthetic cases
0%
Oracle concordance
0
Under-calls · missed danger
0
Over-calls · false alarms

2,994 graded of a 3,000-case synthetic corpus · 0 under-calls · 0 over-calls

What this number is — and isn't

Honesty is a feature, not fine print.

Concordance is agreement with an independent guideline oracle on synthetic cases — it measures how faithfully the engine reproduces standard-of-care references like RCOG, NICE and ISUOG. It is not a measure of neonatal outcomes, and it does not claim clinical superiority. A prospective clinical-validation study — perinatologist assessment against the engine — is in design. We publish the limit as prominently as the result on purpose.

Synthetic cases Guideline references Not neonatal outcomes Prospective study in design
Determinism

The same answer — yesterday, today, in five years.

A clinical decision can't rest on a coin flip. Across 50 distinct configurations replayed over 1,000 seeds each, every single run produced a byte-identical output. No sampling, no drift, no hidden randomness — the engine is a pure function of its input.

Reproducibility is also forward-looking: every output is stamped with its engineVersion and a full-content rulesetSHA, so any decision can be replayed bit-for-bit years later against the exact ruleset that produced it. The result is commit-backed in the repository.

50 configurations 1,000 seeds each commit-backed
0
of 50,000 bit-identical
50 × 1,000
configs × seeds
0
divergent outputs

stamp: engineVersion + rulesetSHA + firedRules[]

Replay it yourself

Same case in. Same bytes out.

Input — fixed example case GA 32+4 · singleton · EFW 1,612 g · 3rd centile
UA-PI >95th centile · end-diastolic flow present
MCA normal · DV normal · AFI normal
Output — run the case to produce the output —

Illustration with a fixed example case — in the app, the live engine computes this on-device. The point here is the replay: every run returns the identical bytes. Outputs are classification labels that support clinician judgment; clinician evaluation required.

Metric B · CI-locked oracle

Validation that a regression can't merge.

A second, version-pinned oracle (v1.0.0) runs inside continuous integration. It holds 99.3% concordance on 1,000 synthetic cases — but the point isn't only the number. The floor is enforced: if agreement drops below it, the build fails and the change does not land.

Held
0%

overall concordance on 1,000 synthetic cases against the CI-locked oracle v1.0.0.

CI floor
≥ 98.5%

the enforced overall floor. Drop below it and the build fails — the regression cannot merge.

Per dimension
≥ 95.5%

a separate floor on each of the six graded decision dimensions — no single axis can quietly degrade.

Two oracles, two purposes. Metric A is a continuously-evolving tooling oracle used in development; Metric B is a frozen, version-pinned reference whose floors are wired into the test gate. Together they keep clinical accuracy from regressing silently between releases — the engine has to keep earning its concordance, build after build.

The axes of rigor

Many ways to try to break it.

Concordance answers "is it right on typical cases?" These axes answer the harder question — "what happens at the edges, under mutation, and across paths that ought to agree?"

01 / Mutation testing

Tests that catch tampering

Deliberate faults are injected into the engine to confirm the suite notices. A first pass killed 15 / 15 mutants; the campaign later extended to 27 mutants.

02 / Cross-path parity

Two routes, one verdict

Twin cases evaluated through independent code paths are compared verdict-for-verdict. Across 504 twin cases: 0 divergent verdicts.

03 / Stress-flow invariants

Structural rules that always hold

A high-volume harness checks invariants that must never break. Across 39,000 cases: 0 violations.

04 / Boundary tests

The edges of validity

An added +77 boundary tests probe gestational-age limits, centile cut-offs and threshold crossings where silent extrapolation would otherwise hide.

05 / Adversarial campaign

A deliberate break-it pass

An adversarial "break-it" campaign hunts for contradictory or under-called outputs, then hardens the engine against every reproducible failure it surfaces.

06 / Golden-corpus regression

A frozen baseline

A golden-corpus regression suite locks expected outputs so any unintended change in clinical behavior shows up as drift — re-blessed only on a deliberate, reviewed decision.

All of it, on every build.

The full suite — 2,358 tests, 0 failing — runs the unit, parity, boundary and golden checks together, so a single command tells you whether the engine still behaves exactly as it should.

0
OutputCoherenceGate

A last sentinel before anything renders.

Beyond the test suite, a runtime coherence gate checks structural invariants on every output before it reaches the screen — for example that an FGR case is never silently blocked, that a severe pre-eclampsia crossing is honored, and that periviable ordering stays consistent. Contradictory outputs are trapped rather than displayed.

What gets graded

Six decision dimensions. Zero divergence.

Concordance isn't a single yes/no. The oracle grades each output along six independent axes — and the engine matched the guideline reference on all of them.

01

Staging

FGR / SGA stage and size classification from biometry and Doppler.

02

Disposition

Outpatient, admit-consider or inpatient classification.

03

Urgency

The acuity tier label attached to the assessment.

04

Deliver-now

Whether criteria are met for a deliver-now classification.

05

Scan interval

The reassess-at surveillance timing the case supports.

06

Delivery window

The gestational-age window associated with planned birth.

Staging Disposition Urgency Deliver-now Scan interval Delivery window 0 divergence
Performance

Rigor you don't feel.

All of this resolves on the phone, in real time. Validation never costs the clinician a wait.

< 0.4 ms
p95 engine latency

A full sourced plan computes in well under a millisecond.

~250 ms
live recompute

The plan updates in real time as biometry and Doppler are typed.

Why structure beats sampling

A language model can't be validated like this.

You can't replay a stochastic answer, gate it in CI, or stamp it for audit. Determinism is what makes every metric on this page possible in the first place.

A language model

  • Stochastic — no two runs guaranteed identical to replay
  • Concordance can't be locked into a CI floor
  • No firedRules[] to grade dimension by dimension
  • Can invent thresholds and citations under confidence
  • No version stamp — a past decision can't be reproduced

The ObGynAssist engine

  • Deterministic — 50,000 / 50,000 bit-identical
  • Concordance floors enforced in continuous integration
  • Every one of six dimensions graded independently
  • Every threshold wired to a named guideline card
  • Stamped with engineVersion + rulesetSHA for replay
The reference wall

58 guideline cards behind 243 rules.

Every threshold the engine fires traces to a named reference card. Concordance is only meaningful because the standard being matched is itself sourced, line by line.

0
Guideline reference cards
0
Guideline-wired rules
0
Rule tables
0
Centile standards
RCOG GTG NICE NG ISUOG ACOG INTERGROWTH-21st Delphi 2016 WHO NICHD

The 243 rules span five tables — 112 FGR / SGA, 43 AGA, 26 LGA, 43 Twin-pair and 19 Modifier — and four centile standards: WHO (default), INTERGROWTH-21st, Hadlock 1991 and NICHD. The technology page walks the pipeline that consumes them.

Go deeper

Read the structure that earns the numbers.

The seven-step pipeline, the gating layers and the provenance stamp are what make this evidence possible. See how the engine works — or reach out about clinical partnership and the prospective study.

Concordance figures reflect agreement with an independent guideline oracle on synthetic cases — agreement with standard-of-care references, not neonatal outcomes. ObGynAssist™ is a clinical companion and reference tool, not a regulated medical device. A prospective clinical-validation study is in design. Outputs are classification labels and guideline citations that support clinician evaluation; clinician evaluation required.