What clinical trial should be
funded next?

We review nearly 150,000 clinical trials to find where medical research has blind spots and where new funding would make the biggest difference.

149,947

clinical trials reviewed

271,182

connections mapped

57,605

research blind spots found

health checks per trial

How It Works

📊

Gather All Trials

150K trials ingested

🧠

Find Patterns

ML ensemble

🚨

Spot Trouble Early

5 checks per trial

🗺️

Map the Blind Spots

What's missing & why

🎯

Recommend Action

Top 20 priorities

How Are Clinical Trials Doing?

We run health checks on every trial — here's the overall picture across assessable indicators

Looking healthy

—

Need a closer look

—

Not enough info yet

—

What's New in v4

Autonomous ML predictions for 6,187 active Phase 3 trials

Autoresearch Predictions

AI predicts each trial's probability of being practice-changing

5 Early Warning Checks

Each trial is monitored for enrollment pace, protocol changes, study design, biological rationale, and prior evidence

Research Gap Breakdown

Tells you why a gap exists — not enough studies, weak designs, or no independent confirmation

Condition Rankings

Recommendations surface which disease areas have the most promising active trials, ranked by predicted success probability

Early Warning Scorecards v2.0 — LEAP Pipeline

How healthy is each clinical trial? Health checks across 150K trials

How to read this

Each trial is scored on 3 core indicators that we can assess for nearly every trial:

Enrollment — Has the trial enrolled patients? Completed enrollment = on track. Still recruiting = attention.
Protocol Stability — Is the trial running smoothly? Completed or active = on track. Suspended or terminated = attention.
Design Strength — How rigorous is the study? Randomized + blinded + multi-arm = on track. Open-label or single-arm = attention.

Two advanced indicators appear when data is available: biological plausibility (knowledge graph distance) and prior evidence chain (earlier-phase trial results). The score shows how many checks pass out of those with enough data (e.g., "3/3" means all assessable checks passed).

Note: Completed trials naturally score higher because enrollment and protocol stability are confirmed after the fact. This doesn't mean ongoing trials are worse — they just have less data available yet.

Total Trials

On Track Signals

Attention Signals

Insufficient Data

Health Check Results by Category

Loading chart...

How Many Assessable Checks Are Trials Passing?

Loading chart...

Loading scorecards...

Loading 1,000 sample trials...

Evidence Gap Map v2.0 — LEAP Pipeline

Where is medical research falling short? 57K blind spots mapped

How to read this

Each cell is a disease + treatment combination. We break down why the evidence is lacking into three reasons: Volume (not enough trials have been done), Quality (existing studies have weak designs or uncertain results), and Replication (only one research group has studied it). Darker red = bigger gap. Use the tabs to explore each reason. Hover any cell for the full breakdown.

Composite Evidence Gap

Weighted combination of volume, quality, and replication deficiencies

Loading heatmap...

Deficiency Types

Volume

Not enough trials have been done

Quality

Studies exist but results are uncertain

Replication

Only one group has studied it

Legend

Large gap (high need)

Moderate gap

Evidence adequate

Recommended Next Trials v4.0 — Autoresearch Ensemble

Which disease areas have the most promising active Phase 3 trials right now?

What you're looking at

An ML model (4-model ensemble, AUROC 0.710) scanned 6,187 active Phase 3 clinical trials and predicted each one's probability of reporting a statistically significant positive result (p < 0.05).

We grouped trials by disease area and surfaced the top 20 conditions where our model is most confident a trial will succeed. For each condition, we show the single highest-rated trial. Click any row to see the full model rationale.

Top Conditions by P(Positive Result)

Each bar shows the model's predicted probability that the top trial for that condition will report p < 0.05. Higher = more confident.

Loading chart...

#	Condition?	Type?	Top Trial?	P(positive)?	95% CI?	Active Trials?	Avg P(positive)?

Predictions Explorer v7.1 — Practice-Changing Predictor

How likely is each active Phase 3 trial to be practice-changing? AI predictions for 6,187 trials.

What you're looking at

For each of the 6,187 active Phase 3 clinical trials, our model predicts the probability it will be practice-changing — published in a high-impact journal and widely cited in the literature.

The model is a 4-model ensemble autonomously optimized over 100 experiments using publication-based ground truth. Trials are classified as: high (≥70%), medium (50–70%), low (30–50%), unlikely (<30%).

Active Phase 3 Trials

Median P(positive)

Mean P(positive)

-- / -- / -- / --

Unlikely / Low / Med / High

Distribution of Predicted Success Probabilities

How many trials fall in each probability bucket? Most cluster around 50–80%, meaning the model sees moderate-to-good chances for most active Phase 3 trials.

Loading chart...

Predicted Effect Size Distribution

Beyond statistical significance: how large is the predicted treatment effect? Classified using Cohen's d thresholds — small (<0.2), medium (0.2–0.5), large (≥0.5).

Loading chart...

Model Performance

AUROC, AUPRC, and 1−Brier scores on held-out test data (trials registered 2022+).

Loading chart...

Filter by specialty:

Most Likely to Succeed

Top 20 trials with the highest predicted P(practice-changing). These are the trials our model is most confident will be published in top journals and highly cited.

NCT ID	P(positive)	95% CI	Condition	Rating	Effect

Least Likely to Succeed

Bottom 20 trials with the lowest predicted probabilities. The model sees these as having the hardest path to a positive primary outcome.

NCT ID	P(positive)	95% CI	Condition	Rating	Effect

Look Up a Trial

Portfolio Simulator v2.0 — LEAP Pipeline

Given a budget, which mix of new trials would close the most research gaps?

How to read this

Pick a budget and see which combination of new trials would fill the most blind spots. We automatically find the best mix for each spending level. Gap closure shows what percentage of missing evidence would be addressed. The equity version spreads funding across disease areas. Shaded bands show the range of possible outcomes.

How Quickly Could We Close Research Gaps?

Projected progress over time, with uncertainty bands

Loading chart...

Score-Optimal ($50M)

Equity-Constrained ($50M)

Key Insight

With a $50M budget, requiring equitable disease coverage costs almost nothing in efficiency: the equity-constrained portfolio closes 64% of evidence gaps across 3 disease areas, while the purely score-optimized portfolio closes 60% across only 2. Funders can spread impact more broadly without sacrificing effectiveness.

More Money = More Gaps Closed?

How much impact does each additional dollar buy?

Loading chart...

Autoresearch v7.1 — Practice-Changing Predictor

Predicting which clinical trials will change medical practice

What Is Practice-Changing?

We define a trial as practice-changing based on its real-world impact: publication in a top-tier journal (NEJM, Lancet, JAMA, BMJ, Nature Medicine, Annals of Internal Medicine), high citation count (≥20 in OpenAlex), or significant downstream literature (DERIVED references in ClinicalTrials.gov). This replaces the prior p-value-based label, which was circular with model features.

An AI agent systematically sweeps label thresholds (journal tier, citation count, derived reference count) across 100 autonomous experiments, keeping only configurations that improve AUROC. The model uses 94 pre-outcome features from AACT — no post-hoc data.

What Changed in v7.1

An autonomous AI agent ran 100 experiments (up from 24 in v7.0), exploring label thresholds, feature engineering, and model architecture. AUROC improved from 0.676 to 0.710 (+5.0%). Of 100 experiments, only 8 were kept — a 92% discard rate showing the model is in a plateau region where most changes are neutral or harmful.

Architectural Diversity

Added ExtraTrees (bagging) alongside XGBoost and LightGBM (boosting). Bagging and boosting make different errors — their disagreements cancel out, improving ensemble robustness without overfitting.

Noise Reduction

Reduced text SVD components from 15 to 8. Higher dimensions captured noise rather than signal from trial descriptions. Fewer, cleaner text features improved generalization on the holdout set.

Experiment Log (Key Improvements)

Exp	Description	AUROC	Delta
68	ExtraTrees 4th ensemble member	0.7044	+0.0062
71	ExtraTrees depth=15	0.7072	+0.0028
72	ExtraTrees depth=20	0.7081	+0.0009
85	Reduce text SVD 15→8	0.7096	+0.0015

4 of 43 experiments kept in this session (exps 58–100). 39 discarded — most landed in the 0.703–0.708 plateau.

Model Performance

100 experiments across label thresholds, features, and model architecture — March 2026

AUROC

0.710

Wide temporal split

AUPRC

0.836

High-prevalence label

Brier Score

0.176

Well-calibrated

Test Trials

1,685

Registered 2020+

Ground Truth: Publication Impact

5,003 practice-changing trials (68% of labeled)

Tier 4 journal (1,184) | 20+ citations (4,924) | 5+ derived refs (97)

120,950 PMIDs enriched via OpenAlex

Median 62 citations | P99: 3,495 | Max: 251,219

Non-Circular Label Design

All label components (journal tier, citation count, derived references) come from external sources (OpenAlex, AACT study_references) — not from the trial's own outcome data. The model uses only pre-outcome features (phase, enrollment, sponsor type, condition, etc.), preventing data leakage.

ROC Curve

Feature Importance (SHAP)

Top 10 features by mean |SHAP| value. Features with × are interaction terms autonomously discovered by the AI agent.

Calibration

Predicted probability vs observed outcome frequency (10 bins).

Model Architecture

The agent converged on a 4-model ensemble with isotonic calibration, combining both boosting and bagging architectures for maximum diversity:

XGB #1 XGBoost logloss — gradient boosting baseline, max_depth=4, 1,200 estimators, learning_rate=0.04

LGBM LightGBM GBDT — matched hyperparameters, min_split_gain=1.0

XGB #2 XGBoost rank:pairwise — learning-to-rank objective, captures relative trial ordering

ET ExtraTrees — bagging-based (vs. boosting), max_depth=20, adds architectural diversity

Each model is independently calibrated with 5-fold cross-validated isotonic regression, then predictions are averaged. The key insight: mixing boosting (XGB, LGBM) with bagging (ExtraTrees) improves ensemble diversity more than adding another boosted model. Training uses the combined train+validation set (registered before 2022), with temporal holdout test set (2022+) for final evaluation.

How It Works

Data preparation — 151,313 trials from AACT, enriched with OpenAlex citations (120K PMIDs) and journal tier classification. Temporal split: train (<2018), validation (2018–2019), test (2020+)
Label construction — Practice-changing defined by publication impact: top-tier journal, high citation count (≥20), or downstream literature influence (≥5 DERIVED refs)
Autonomous optimization — Agent runs 100 experiments across label thresholds, feature engineering, and model architecture, keeping only improvements to AUROC
Ensemble training — 4-model ensemble (XGBoost logloss + LightGBM GBDT + XGBoost rank:pairwise + ExtraTrees) with isotonic calibration on 94 pre-outcome features
Portfolio scoring — Best model scores all 6,187 active Phase 3 trials with P(practice-changing) and 95% bootstrap CIs

Temporal Evaluation

Strict temporal split prevents data leakage — model never sees future trials during training:

Train

Before 2018

4,805 labeled

Validation

2018 – 2019

848 labeled

Test

2020+

1,685 labeled

Active Pool

Recruiting

Prediction Target

practice_changing — Did the trial's results get published in a top-tier journal and/or receive significant citations? Ground truth is constructed from three external signals: (1) Journal tier classification from AACT RESULT publications, (2) Citation counts from OpenAlex, (3) DERIVED reference counts from ClinicalTrials.gov. This provides a non-circular, externally-validated label that captures real-world research impact.

Source

GitHub: Autoresearch Code AACT Data Source

About BayesianScience v7.1.0

Methodology

BayesianScience reads every clinical trial on ClinicalTrials.gov, maps how diseases, treatments, and sponsors connect, then uses statistical models to predict which trials will succeed, flag ones in trouble, and recommend where new research funding would have the biggest impact.

v7.1: Practice-Changing Predictor — The model predicts which trials will be practice-changing — published in top journals and highly cited — using publication-based ground truth from OpenAlex citations and journal tier classification. AUROC 0.710 on the wide temporal split (train <2018, test 2020+, n=1,685) with 94 pre-outcome features and a 4-model ensemble (XGBoost + LightGBM + ExtraTrees) across 100 autonomous experiments. The key breakthroughs were architectural diversity (adding bagging via ExtraTrees alongside boosted models) and noise reduction (reducing text SVD components from 15 to 8).

Gather all trials — 149,947 trials from ClinicalTrials.gov
Map connections — 271,182 relationships between trials, diseases, drugs, and sponsors
Predict outcomes — AI model estimates how effective each treatment will be, with uncertainty ranges
Spot trouble early — 5 health checks per trial flag problems before they derail a study
Map the gaps — For each disease-treatment pair, identify what's missing and why
Explain in plain language — Each recommendation tells the story: why this gap exists, what we'd learn, how patients benefit, and the science behind it
Simulate budgets — Given a budget, find the best mix of new trials to fund

Key Formulas

Gap Deficiency (Typed)

G(c,i) = w_v × volume_deficiency + w_q × quality_deficiency + w_r × replication_deficiency

Composite Recommendation Score (Single Drug)

C = 0.30 × kg_similarity + 0.30 × predicted_effect + 0.20 × evidence_chain + 0.20 × gap_severity

Combination Score

C_combo = 0.60 × mean(drug_A, drug_B) + 0.20 × (1 - drug_drug_similarity) + 0.20 × gap_severity

Bayesian Posterior

P(θ | data) ∝ P(data | θ) × P(θ) — hierarchical priors by disease domain

Model Performance

Limitations

•Bayesian posteriors are model-based estimates, not observed outcomes — treat as informative priors for decision-making
•Knowledge graph embeddings capture structural similarity, not guaranteed biological mechanism
•Cost estimates are heuristic ($7.2M-$8M per trial) — real costs vary by phase, indication, and geography
•Based on a frozen AACT snapshot — not real-time
•Recommendations are structured prompts for human decision-makers, not autonomous allocation decisions

Author

Shuhan He, MD

Cite This


          He S. BayesianScience: Bayesian Clinical Trial Intelligence for Research Funding Optimization. 2026. Available at: bayesianscience.org

Links

GitHub Repository Paper (coming soon)

What clinical trial should be funded next?

How It Works

How Are Clinical Trials Doing?

What's New in v4

Early Warning Scorecards v2.0 — LEAP Pipeline

Health Check Results by Category

How Many Assessable Checks Are Trials Passing?

--

Evidence Gap Map v2.0 — LEAP Pipeline

Composite Evidence Gap

Deficiency Types

Legend

Cell Detail

Recommended Next Trials v4.0 — Autoresearch Ensemble

Top Conditions by P(Positive Result)

Predictions Explorer v7.1 — Practice-Changing Predictor

Distribution of Predicted Success Probabilities

Predicted Effect Size Distribution

Model Performance

Most Likely to Succeed

Least Likely to Succeed

Look Up a Trial

Portfolio Simulator v2.0 — LEAP Pipeline

How Quickly Could We Close Research Gaps?

Score-Optimal ($50M)

Equity-Constrained ($50M)

Key Insight

More Money = More Gaps Closed?

Autoresearch v7.1 — Practice-Changing Predictor

What Is Practice-Changing?

What Changed in v7.1

Model Performance

Ground Truth: Publication Impact

ROC Curve

Feature Importance (SHAP)

Calibration

Model Architecture

How It Works

Temporal Evaluation

Prediction Target

Source

About BayesianScience v7.1.0

Methodology

Key Formulas

Model Performance

Limitations

Author

Cite This

Links

What clinical trial should be
funded next?