What clinical trial should be
funded next?
We review nearly 150,000 clinical trials to find where medical research has blind spots and where new funding would make the biggest difference.
How It Works
How Are Clinical Trials Doing?
We run health checks on every trial — here's the overall picture across assessable indicators
What's New in v4
Autonomous ML predictions for 6,187 active Phase 3 trials
Early Warning Scorecards v2.0 — LEAP Pipeline
How healthy is each clinical trial? Health checks across 150K trials
Each trial is scored on 3 core indicators that we can assess for nearly every trial:
- Enrollment — Has the trial enrolled patients? Completed enrollment = on track. Still recruiting = attention.
- Protocol Stability — Is the trial running smoothly? Completed or active = on track. Suspended or terminated = attention.
- Design Strength — How rigorous is the study? Randomized + blinded + multi-arm = on track. Open-label or single-arm = attention.
Two advanced indicators appear when data is available: biological plausibility (knowledge graph distance) and prior evidence chain (earlier-phase trial results). The score shows how many checks pass out of those with enough data (e.g., "3/3" means all assessable checks passed).
Note: Completed trials naturally score higher because enrollment and protocol stability are confirmed after the fact. This doesn't mean ongoing trials are worse — they just have less data available yet.
Health Check Results by Category
How Many Assessable Checks Are Trials Passing?
Evidence Gap Map v2.0 — LEAP Pipeline
Where is medical research falling short? 57K blind spots mapped
Each cell is a disease + treatment combination. We break down why the evidence is lacking into three reasons: Volume (not enough trials have been done), Quality (existing studies have weak designs or uncertain results), and Replication (only one research group has studied it). Darker red = bigger gap. Use the tabs to explore each reason. Hover any cell for the full breakdown.
Composite Evidence Gap
Weighted combination of volume, quality, and replication deficiencies
Deficiency Types
Legend
Recommended Next Trials v4.0 — Autoresearch Ensemble
Which disease areas have the most promising active Phase 3 trials right now?
An ML model (4-model ensemble, AUROC 0.710) scanned 6,187 active Phase 3 clinical trials and predicted each one's probability of reporting a statistically significant positive result (p < 0.05).
We grouped trials by disease area and surfaced the top 20 conditions where our model is most confident a trial will succeed. For each condition, we show the single highest-rated trial. Click any row to see the full model rationale.
Top Conditions by P(Positive Result)
Each bar shows the model's predicted probability that the top trial for that condition will report p < 0.05. Higher = more confident.
| # | Condition? | Type? | Top Trial? | P(positive)? | 95% CI? | Active Trials? | Avg P(positive)? |
|---|
Predictions Explorer v7.1 — Practice-Changing Predictor
How likely is each active Phase 3 trial to be practice-changing? AI predictions for 6,187 trials.
For each of the 6,187 active Phase 3 clinical trials, our model predicts the probability it will be practice-changing — published in a high-impact journal and widely cited in the literature.
The model is a 4-model ensemble autonomously optimized over 100 experiments using publication-based ground truth. Trials are classified as: high (≥70%), medium (50–70%), low (30–50%), unlikely (<30%).
Distribution of Predicted Success Probabilities
How many trials fall in each probability bucket? Most cluster around 50–80%, meaning the model sees moderate-to-good chances for most active Phase 3 trials.
Model Performance
AUROC, AUPRC, and 1−Brier scores on held-out test data (trials registered 2022+).
Most Likely to Succeed
Top 20 trials with the highest predicted P(practice-changing). These are the trials our model is most confident will be published in top journals and highly cited.
| NCT ID | P(positive) | 95% CI | Condition | Rating |
|---|
Least Likely to Succeed
Bottom 20 trials with the lowest predicted probabilities. The model sees these as having the hardest path to a positive primary outcome.
| NCT ID | P(positive) | 95% CI | Condition | Rating |
|---|
Look Up a Trial
Portfolio Simulator v2.0 — LEAP Pipeline
Given a budget, which mix of new trials would close the most research gaps?
Pick a budget and see which combination of new trials would fill the most blind spots. We automatically find the best mix for each spending level. Gap closure shows what percentage of missing evidence would be addressed. The equity version spreads funding across disease areas. Shaded bands show the range of possible outcomes.
How Quickly Could We Close Research Gaps?
Projected progress over time, with uncertainty bands
Score-Optimal ($50M)
Equity-Constrained ($50M)
Key Insight
With a $50M budget, requiring equitable disease coverage costs almost nothing in efficiency: the equity-constrained portfolio closes 64% of evidence gaps across 3 disease areas, while the purely score-optimized portfolio closes 60% across only 2. Funders can spread impact more broadly without sacrificing effectiveness.
More Money = More Gaps Closed?
How much impact does each additional dollar buy?
Autoresearch v7.1 — Practice-Changing Predictor
Predicting which clinical trials will change medical practice
What Is Practice-Changing?
We define a trial as practice-changing based on its real-world impact: publication in a top-tier journal (NEJM, Lancet, JAMA, BMJ, Nature Medicine, Annals of Internal Medicine), high citation count (≥20 in OpenAlex), or significant downstream literature (DERIVED references in ClinicalTrials.gov). This replaces the prior p-value-based label, which was circular with model features.
An AI agent systematically sweeps label thresholds (journal tier, citation count, derived reference count) across 100 autonomous experiments, keeping only configurations that improve AUROC. The model uses 94 pre-outcome features from AACT — no post-hoc data.
What Changed in v7.1
An autonomous AI agent ran 100 experiments (up from 24 in v7.0), exploring label thresholds, feature engineering, and model architecture. AUROC improved from 0.676 to 0.710 (+5.0%). Of 100 experiments, only 8 were kept — a 92% discard rate showing the model is in a plateau region where most changes are neutral or harmful.
Added ExtraTrees (bagging) alongside XGBoost and LightGBM (boosting). Bagging and boosting make different errors — their disagreements cancel out, improving ensemble robustness without overfitting.
Reduced text SVD components from 15 to 8. Higher dimensions captured noise rather than signal from trial descriptions. Fewer, cleaner text features improved generalization on the holdout set.
| Exp | Description | AUROC | Delta |
|---|---|---|---|
| 68 | ExtraTrees 4th ensemble member | 0.7044 | +0.0062 |
| 71 | ExtraTrees depth=15 | 0.7072 | +0.0028 |
| 72 | ExtraTrees depth=20 | 0.7081 | +0.0009 |
| 85 | Reduce text SVD 15→8 | 0.7096 | +0.0015 |
4 of 43 experiments kept in this session (exps 58–100). 39 discarded — most landed in the 0.703–0.708 plateau.
Model Performance
100 experiments across label thresholds, features, and model architecture — March 2026
Ground Truth: Publication Impact
All label components (journal tier, citation count, derived references) come from external sources (OpenAlex, AACT study_references) — not from the trial's own outcome data. The model uses only pre-outcome features (phase, enrollment, sponsor type, condition, etc.), preventing data leakage.
ROC Curve
Feature Importance (SHAP)
Top 10 features by mean |SHAP| value. Features with × are interaction terms autonomously discovered by the AI agent.
Calibration
Predicted probability vs observed outcome frequency (10 bins).
Model Architecture
The agent converged on a 4-model ensemble with isotonic calibration, combining both boosting and bagging architectures for maximum diversity:
Each model is independently calibrated with 5-fold cross-validated isotonic regression, then predictions are averaged. The key insight: mixing boosting (XGB, LGBM) with bagging (ExtraTrees) improves ensemble diversity more than adding another boosted model. Training uses the combined train+validation set (registered before 2022), with temporal holdout test set (2022+) for final evaluation.
How It Works
- Data preparation — 151,313 trials from AACT, enriched with OpenAlex citations (120K PMIDs) and journal tier classification. Temporal split: train (<2018), validation (2018–2019), test (2020+)
- Label construction — Practice-changing defined by publication impact: top-tier journal, high citation count (≥20), or downstream literature influence (≥5 DERIVED refs)
- Autonomous optimization — Agent runs 100 experiments across label thresholds, feature engineering, and model architecture, keeping only improvements to AUROC
- Ensemble training — 4-model ensemble (XGBoost logloss + LightGBM GBDT + XGBoost rank:pairwise + ExtraTrees) with isotonic calibration on 94 pre-outcome features
- Portfolio scoring — Best model scores all 6,187 active Phase 3 trials with P(practice-changing) and 95% bootstrap CIs
Temporal Evaluation
Strict temporal split prevents data leakage — model never sees future trials during training:
Prediction Target
practice_changing — Did the trial's results get published in a top-tier journal and/or receive significant citations? Ground truth is constructed from three external signals: (1) Journal tier classification from AACT RESULT publications, (2) Citation counts from OpenAlex, (3) DERIVED reference counts from ClinicalTrials.gov. This provides a non-circular, externally-validated label that captures real-world research impact.
About BayesianScience v7.1.0
Methodology
BayesianScience reads every clinical trial on ClinicalTrials.gov, maps how diseases, treatments, and sponsors connect, then uses statistical models to predict which trials will succeed, flag ones in trouble, and recommend where new research funding would have the biggest impact.
v7.1: Practice-Changing Predictor — The model predicts which trials will be practice-changing — published in top journals and highly cited — using publication-based ground truth from OpenAlex citations and journal tier classification. AUROC 0.710 on the wide temporal split (train <2018, test 2020+, n=1,685) with 94 pre-outcome features and a 4-model ensemble (XGBoost + LightGBM + ExtraTrees) across 100 autonomous experiments. The key breakthroughs were architectural diversity (adding bagging via ExtraTrees alongside boosted models) and noise reduction (reducing text SVD components from 15 to 8).
- Gather all trials — 149,947 trials from ClinicalTrials.gov
- Map connections — 271,182 relationships between trials, diseases, drugs, and sponsors
- Predict outcomes — AI model estimates how effective each treatment will be, with uncertainty ranges
- Spot trouble early — 5 health checks per trial flag problems before they derail a study
- Map the gaps — For each disease-treatment pair, identify what's missing and why
- Explain in plain language — Each recommendation tells the story: why this gap exists, what we'd learn, how patients benefit, and the science behind it
- Simulate budgets — Given a budget, find the best mix of new trials to fund
Key Formulas
G(c,i) = w_v × volume_deficiency + w_q × quality_deficiency + w_r × replication_deficiency
C = 0.30 × kg_similarity + 0.30 × predicted_effect + 0.20 × evidence_chain + 0.20 × gap_severity
C_combo = 0.60 × mean(drug_A, drug_B) + 0.20 × (1 - drug_drug_similarity) + 0.20 × gap_severity
P(θ | data) ∝ P(data | θ) × P(θ) — hierarchical priors by disease domain
Model Performance
Limitations
- •Bayesian posteriors are model-based estimates, not observed outcomes — treat as informative priors for decision-making
- •Knowledge graph embeddings capture structural similarity, not guaranteed biological mechanism
- •Cost estimates are heuristic ($7.2M-$8M per trial) — real costs vary by phase, indication, and geography
- •Based on a frozen AACT snapshot — not real-time
- •Recommendations are structured prompts for human decision-makers, not autonomous allocation decisions
Author
Shuhan He, MD
Cite This
He S. BayesianScience: Bayesian Clinical Trial Intelligence for Research Funding Optimization. 2026. Available at: bayesianscience.org