What clinical trial should be
funded next?

We review nearly 150,000 clinical trials to find where medical research has blind spots and where new funding would make the biggest difference.

149,947
clinical trials reviewed
271,182
connections mapped
57,605
research blind spots found
5
health checks per trial

How It Works

📊
Gather All Trials
150K trials ingested
🧠
Find Patterns
ML ensemble
🚨
Spot Trouble Early
5 checks per trial
🗺️
Map the Blind Spots
What's missing & why
🎯
Recommend Action
Top 20 priorities

How Are Clinical Trials Doing?

We run health checks on every trial — here's the overall picture across assessable indicators

--
Looking healthy
--
Need a closer look
--
Not enough info yet

What's New in v4

Autonomous ML predictions for 6,187 active Phase 3 trials

Autoresearch Predictions
AI predicts each trial's probability of being practice-changing
5 Early Warning Checks
Each trial is monitored for enrollment pace, protocol changes, study design, biological rationale, and prior evidence
Research Gap Breakdown
Tells you why a gap exists — not enough studies, weak designs, or no independent confirmation
Condition Rankings
Recommendations surface which disease areas have the most promising active trials, ranked by predicted success probability

Early Warning Scorecards v2.0 — LEAP Pipeline

How healthy is each clinical trial? Health checks across 150K trials

How to read this

Each trial is scored on 3 core indicators that we can assess for nearly every trial:

  • Enrollment — Has the trial enrolled patients? Completed enrollment = on track. Still recruiting = attention.
  • Protocol Stability — Is the trial running smoothly? Completed or active = on track. Suspended or terminated = attention.
  • Design Strength — How rigorous is the study? Randomized + blinded + multi-arm = on track. Open-label or single-arm = attention.

Two advanced indicators appear when data is available: biological plausibility (knowledge graph distance) and prior evidence chain (earlier-phase trial results). The score shows how many checks pass out of those with enough data (e.g., "3/3" means all assessable checks passed).

Note: Completed trials naturally score higher because enrollment and protocol stability are confirmed after the fact. This doesn't mean ongoing trials are worse — they just have less data available yet.

--
Total Trials
--
On Track Signals
--
Attention Signals
--
Insufficient Data

Health Check Results by Category

Loading chart...

How Many Assessable Checks Are Trials Passing?

Loading chart...
Loading scorecards...
--
Loading 1,000 sample trials...

Evidence Gap Map v2.0 — LEAP Pipeline

Where is medical research falling short? 57K blind spots mapped

How to read this

Each cell is a disease + treatment combination. We break down why the evidence is lacking into three reasons: Volume (not enough trials have been done), Quality (existing studies have weak designs or uncertain results), and Replication (only one research group has studied it). Darker red = bigger gap. Use the tabs to explore each reason. Hover any cell for the full breakdown.

Composite Evidence Gap

Weighted combination of volume, quality, and replication deficiencies

Loading heatmap...

Deficiency Types

Volume
Not enough trials have been done
Quality
Studies exist but results are uncertain
Replication
Only one group has studied it

Legend

Large gap (high need)
Moderate gap
Evidence adequate

Recommended Next Trials v4.0 — Autoresearch Ensemble

Which disease areas have the most promising active Phase 3 trials right now?

What you're looking at

An ML model (4-model ensemble, AUROC 0.710) scanned 6,187 active Phase 3 clinical trials and predicted each one's probability of reporting a statistically significant positive result (p < 0.05).

We grouped trials by disease area and surfaced the top 20 conditions where our model is most confident a trial will succeed. For each condition, we show the single highest-rated trial. Click any row to see the full model rationale.

Top Conditions by P(Positive Result)

Each bar shows the model's predicted probability that the top trial for that condition will report p < 0.05. Higher = more confident.

Loading chart...
# Condition? Type? Top Trial? P(positive)? 95% CI? Active Trials? Avg P(positive)?

Predictions Explorer v7.1 — Practice-Changing Predictor

How likely is each active Phase 3 trial to be practice-changing? AI predictions for 6,187 trials.

What you're looking at

For each of the 6,187 active Phase 3 clinical trials, our model predicts the probability it will be practice-changing — published in a high-impact journal and widely cited in the literature.

The model is a 4-model ensemble autonomously optimized over 100 experiments using publication-based ground truth. Trials are classified as: high (≥70%), medium (50–70%), low (30–50%), unlikely (<30%).

--
Active Phase 3 Trials
--
Median P(positive)
--
Mean P(positive)
-- / -- / -- / --
Unlikely / Low / Med / High

Distribution of Predicted Success Probabilities

How many trials fall in each probability bucket? Most cluster around 50–80%, meaning the model sees moderate-to-good chances for most active Phase 3 trials.

Loading chart...

Model Performance

AUROC, AUPRC, and 1−Brier scores on held-out test data (trials registered 2022+).

Loading chart...
Filter by specialty:

Most Likely to Succeed

Top 20 trials with the highest predicted P(practice-changing). These are the trials our model is most confident will be published in top journals and highly cited.

NCT IDP(positive)95% CIConditionRating

Least Likely to Succeed

Bottom 20 trials with the lowest predicted probabilities. The model sees these as having the hardest path to a positive primary outcome.

NCT IDP(positive)95% CIConditionRating

Look Up a Trial

Portfolio Simulator v2.0 — LEAP Pipeline

Given a budget, which mix of new trials would close the most research gaps?

How to read this

Pick a budget and see which combination of new trials would fill the most blind spots. We automatically find the best mix for each spending level. Gap closure shows what percentage of missing evidence would be addressed. The equity version spreads funding across disease areas. Shaded bands show the range of possible outcomes.

How Quickly Could We Close Research Gaps?

Projected progress over time, with uncertainty bands

Loading chart...

Score-Optimal ($50M)

Equity-Constrained ($50M)

Key Insight

With a $50M budget, requiring equitable disease coverage costs almost nothing in efficiency: the equity-constrained portfolio closes 64% of evidence gaps across 3 disease areas, while the purely score-optimized portfolio closes 60% across only 2. Funders can spread impact more broadly without sacrificing effectiveness.

More Money = More Gaps Closed?

How much impact does each additional dollar buy?

Loading chart...

Autoresearch v7.1 — Practice-Changing Predictor

Predicting which clinical trials will change medical practice

What Is Practice-Changing?

We define a trial as practice-changing based on its real-world impact: publication in a top-tier journal (NEJM, Lancet, JAMA, BMJ, Nature Medicine, Annals of Internal Medicine), high citation count (≥20 in OpenAlex), or significant downstream literature (DERIVED references in ClinicalTrials.gov). This replaces the prior p-value-based label, which was circular with model features.

An AI agent systematically sweeps label thresholds (journal tier, citation count, derived reference count) across 100 autonomous experiments, keeping only configurations that improve AUROC. The model uses 94 pre-outcome features from AACT — no post-hoc data.

What Changed in v7.1

An autonomous AI agent ran 100 experiments (up from 24 in v7.0), exploring label thresholds, feature engineering, and model architecture. AUROC improved from 0.676 to 0.710 (+5.0%). Of 100 experiments, only 8 were kept — a 92% discard rate showing the model is in a plateau region where most changes are neutral or harmful.

Architectural Diversity

Added ExtraTrees (bagging) alongside XGBoost and LightGBM (boosting). Bagging and boosting make different errors — their disagreements cancel out, improving ensemble robustness without overfitting.

Noise Reduction

Reduced text SVD components from 15 to 8. Higher dimensions captured noise rather than signal from trial descriptions. Fewer, cleaner text features improved generalization on the holdout set.

Experiment Log (Key Improvements)
ExpDescriptionAUROCDelta
68ExtraTrees 4th ensemble member0.7044+0.0062
71ExtraTrees depth=150.7072+0.0028
72ExtraTrees depth=200.7081+0.0009
85Reduce text SVD 15→80.7096+0.0015

4 of 43 experiments kept in this session (exps 58–100). 39 discarded — most landed in the 0.703–0.708 plateau.

Model Performance

100 experiments across label thresholds, features, and model architecture — March 2026

AUROC
0.710
Wide temporal split
AUPRC
0.836
High-prevalence label
Brier Score
0.176
Well-calibrated
Test Trials
1,685
Registered 2020+

Ground Truth: Publication Impact

5,003 practice-changing trials (68% of labeled)
Tier 4 journal (1,184) | 20+ citations (4,924) | 5+ derived refs (97)
120,950 PMIDs enriched via OpenAlex
Median 62 citations | P99: 3,495 | Max: 251,219
Non-Circular Label Design

All label components (journal tier, citation count, derived references) come from external sources (OpenAlex, AACT study_references) — not from the trial's own outcome data. The model uses only pre-outcome features (phase, enrollment, sponsor type, condition, etc.), preventing data leakage.

ROC Curve

Feature Importance (SHAP)

Top 10 features by mean |SHAP| value. Features with × are interaction terms autonomously discovered by the AI agent.

Calibration

Predicted probability vs observed outcome frequency (10 bins).

Model Architecture

The agent converged on a 4-model ensemble with isotonic calibration, combining both boosting and bagging architectures for maximum diversity:

XGB #1 XGBoost logloss — gradient boosting baseline, max_depth=4, 1,200 estimators, learning_rate=0.04
LGBM LightGBM GBDT — matched hyperparameters, min_split_gain=1.0
XGB #2 XGBoost rank:pairwise — learning-to-rank objective, captures relative trial ordering
ET ExtraTrees — bagging-based (vs. boosting), max_depth=20, adds architectural diversity

Each model is independently calibrated with 5-fold cross-validated isotonic regression, then predictions are averaged. The key insight: mixing boosting (XGB, LGBM) with bagging (ExtraTrees) improves ensemble diversity more than adding another boosted model. Training uses the combined train+validation set (registered before 2022), with temporal holdout test set (2022+) for final evaluation.

How It Works

  1. Data preparation — 151,313 trials from AACT, enriched with OpenAlex citations (120K PMIDs) and journal tier classification. Temporal split: train (<2018), validation (2018–2019), test (2020+)
  2. Label construction — Practice-changing defined by publication impact: top-tier journal, high citation count (≥20), or downstream literature influence (≥5 DERIVED refs)
  3. Autonomous optimization — Agent runs 100 experiments across label thresholds, feature engineering, and model architecture, keeping only improvements to AUROC
  4. Ensemble training — 4-model ensemble (XGBoost logloss + LightGBM GBDT + XGBoost rank:pairwise + ExtraTrees) with isotonic calibration on 94 pre-outcome features
  5. Portfolio scoring — Best model scores all 6,187 active Phase 3 trials with P(practice-changing) and 95% bootstrap CIs

Temporal Evaluation

Strict temporal split prevents data leakage — model never sees future trials during training:

Train
Before 2018
4,805 labeled
Validation
2018 – 2019
848 labeled
Test
2020+
1,685 labeled
Active Pool
Recruiting

Prediction Target

practice_changing — Did the trial's results get published in a top-tier journal and/or receive significant citations? Ground truth is constructed from three external signals: (1) Journal tier classification from AACT RESULT publications, (2) Citation counts from OpenAlex, (3) DERIVED reference counts from ClinicalTrials.gov. This provides a non-circular, externally-validated label that captures real-world research impact.

About BayesianScience v7.1.0

Methodology

BayesianScience reads every clinical trial on ClinicalTrials.gov, maps how diseases, treatments, and sponsors connect, then uses statistical models to predict which trials will succeed, flag ones in trouble, and recommend where new research funding would have the biggest impact.

v7.1: Practice-Changing Predictor — The model predicts which trials will be practice-changing — published in top journals and highly cited — using publication-based ground truth from OpenAlex citations and journal tier classification. AUROC 0.710 on the wide temporal split (train <2018, test 2020+, n=1,685) with 94 pre-outcome features and a 4-model ensemble (XGBoost + LightGBM + ExtraTrees) across 100 autonomous experiments. The key breakthroughs were architectural diversity (adding bagging via ExtraTrees alongside boosted models) and noise reduction (reducing text SVD components from 15 to 8).

  1. Gather all trials149,947 trials from ClinicalTrials.gov
  2. Map connections271,182 relationships between trials, diseases, drugs, and sponsors
  3. Predict outcomesAI model estimates how effective each treatment will be, with uncertainty ranges
  4. Spot trouble early5 health checks per trial flag problems before they derail a study
  5. Map the gapsFor each disease-treatment pair, identify what's missing and why
  6. Explain in plain language — Each recommendation tells the story: why this gap exists, what we'd learn, how patients benefit, and the science behind it
  7. Simulate budgetsGiven a budget, find the best mix of new trials to fund

Key Formulas

Gap Deficiency (Typed)
G(c,i) = w_v × volume_deficiency + w_q × quality_deficiency + w_r × replication_deficiency
Composite Recommendation Score (Single Drug)
C = 0.30 × kg_similarity + 0.30 × predicted_effect + 0.20 × evidence_chain + 0.20 × gap_severity
Combination Score
C_combo = 0.60 × mean(drug_A, drug_B) + 0.20 × (1 - drug_drug_similarity) + 0.20 × gap_severity
Bayesian Posterior
P(θ | data) ∝ P(data | θ) × P(θ) — hierarchical priors by disease domain

Model Performance

Limitations

  • Bayesian posteriors are model-based estimates, not observed outcomes — treat as informative priors for decision-making
  • Knowledge graph embeddings capture structural similarity, not guaranteed biological mechanism
  • Cost estimates are heuristic ($7.2M-$8M per trial) — real costs vary by phase, indication, and geography
  • Based on a frozen AACT snapshot — not real-time
  • Recommendations are structured prompts for human decision-makers, not autonomous allocation decisions

Author

Shuhan He, MD

Cite This

He S. BayesianScience: Bayesian Clinical Trial Intelligence for Research Funding Optimization. 2026. Available at: bayesianscience.org