methods summary · Daniel Saltzberg · @dargason

OpenADMET PXR Challenge — Phase 1 model

Generated: 2026-05-26 23:47:33 UTC

My final model for the first phase of the challenge contains four levels, each using a different swath of OpenADMET-supplied and derived data. I will add more detail in the coming days now that my modeling sprint is over.

The model consists of:

  1. 1. PXR foundation models Two PXR-specific foundation models were generated by fine-tuning ChemProp v2 / CheMeleon on the 10,870 molecules in the single-point high throughput screening dataset: one to a continuous (log2fc_median) target, and one to a binary (log2fc_gt_0.75) target.
  2. 2. Blended regression models Multiple regressors were trained on the foundation-model embeddings using a curated set of the dose-response data. Stack weights were optimized on 5-fold Butina out-of-fold predictions: each head predicted compounds from held-out chemical clusters, and NNLS / positive regression learned the best nonnegative blend against the known pEC50 labels.
    • TabICL continuous — 0.311
    • TabPFN continuous — 0.204
    • LGBM continuous — 0.189
    • LGBM binary — 0.146
    • auxMT anchor — 0.143
    • TabICL binary — 0
    • TabPFN binary — 0
  3. 3. Analog-series active perturbation A small perturbation was added to any molecule with high similarity (via RDKit2D and DrugLike descriptors) to high-active compounds (pEC50 > 5.5).
  4. 4. Docking-enhanced lift Cross-docking was performed for all dose-response and challenge compounds against a battery of 8 PXR crystal structures using GNINA. Poses were filtered, and the top 12 poses for each molecule were embedded in UniMol2 and regressed against the dose-response actives (pEC50 > 4.5). Perturbations were capped at ±0.05 pEC50 units.

    I ran out of time before testing the uncapped model.

Other notes and thoughts:

Waterfall: where the final predicted pEC50 came from

pre-perturbation 2D/activity baselineanalog-series gateGNINA/UniMol2 correction
Gray is the 2D/activity baseline entering the analog gate; green and gold are the two bounded perturbations added afterward. The 3D correction is a small sliver — the base model does the main work.
post-mortem · added 2026-05-27

Phase 1 Unblinded: successes, failures, head-smacks, and thoughts for Phase 2

The challenge organizers unblinded 253 of the 513 test compounds on 2026-05-27 (49.3% of the test set; the remaining 260 stay blinded). This gives us a good picture of where the model works, where it fails, and some potential paths forward.

Final rank
14 / 338
MAE
0.4291
RAE 0.5664 · R² 0.6061
Spearman ρ
0.7947
weakest metric — diagnostic below
Activity call (≥ 5.0)
0.848
precision & recall balanced

Initial thoughts

As soon as I saw this scatterplot below, I smacked my head. I had spent the better part of the last two weeks chasing what I thought was active-region compression, which is clearly an issue (blue dots under the diagonal). However, I had significantly more issues on the bottom end, which I had neglected to investigate fully. Obviously there would be some duds in the analogue expansion and my model was predicting at least pEC50>3.0 for almost all compounds, while the training set demonstrated the assay happily reports values under 2.

Predicted vs true pEC₅₀ on the unblinded set

Very inactive (< 3.5) Inactive (3.5–4.5) Weak active (4.5–5.5) Active (≥ 5.5)
253 unblinded compounds, colored by true-activity category. Good job in the middle and needs improvement at both ends.

What worked

Active activity prediction For confirmed actives (pEC₅₀ ≥ 5.0, n=112) the model MAE is 0.292 with a healthy underprediction bias (−0.207). The perturbation layers: the analog-series gate and GNINA / UniMol successfully improved fidelity in the active tail.

Binary activity classification (threshold pEC₅₀ ≥ 5.0) came back at 84.8% precision and recall: the model correctly learns what an active PXR compound looks like even though the training set is heavily enriched for actives. Rank ordering within the mid-range pEC₅₀ (3.5–5.5) is also strong at Spearman ≈ 0.52–0.53 in-window.

What failed

The inactive region pEC₅₀ < 3.5 region has Spearman ρ = −0.14: anti-correlated rank-order! All 10 worst absolute errors fall in this range, overshooting their true potencies by 1.6–2.8 pEC₅₀ units.

Plate 1 top compounds I generally consider it a success if a model predicts the top molecules to be included in the first experimental batch, whether that be 8, 24, or however many. By this metric, the model is not great. The top molecule (OADMET-0006546) is 22nd ranked by the model; the second most potent molecule is 66th. Only two of the top eight molecules are in the model top eight. Only seven of the top 24 are in the model top 24. In instances where resources are limited, this model is not satisfactory at triaging up the best compounds.

Rank order of top 42 actives (true pEC50 > 5.5)

42 unblinded actives (true pEC₅₀ > 5.5) ranked by true potency (x) vs predicted potency (y). Red box: only 2/8 of the actual top 8 land in the predicted top 8. Blue box: only 7/24 of the actual top 24 land in the predicted top 24.

pEC₅₀ distributions: test predictions, test actuals, training set

experimental (test set 1) predicted training set (n=4,139)
Distribution of pEC₅₀: actual, predicted, and training data. Predictions clearly missed the low-end and a bit on the high-end, too.

Failure Diagnosis:

Active-enriched training + MAE-optimized stacking + positive-only NNLS created model that cannot express "this molecule really does not bind PXR."

The NNLS (non-negative least squares) stack was chosen to keep ensemble weights interpretable and prevent cancellation artifacts. It also means no single head can pull a prediction down unless the base head does so — and the base heads were themselves trained on an active-heavy target distribution. There is no mechanism in the current stack to recognize and downshift true inactives, outside of the minor perturbations.

Some thoughts for Phase 2

  1. 1. Trust the training distribution One difficulty of this challenge is that the training compounds are far from the challenge compounds. However, on aggregate, the expansion analogs performed pretty similarly to the initial dose-response data. Stratifying the final ensemble to fit this distribution may help at both ends.
  2. 2. Rank-aware stack loss Replace the pure MAE-optimized NNLS objective with a blended ranking-aware loss (e.g., MAE + a pairwise rank-penalty), so the stack is explicitly rewarded for getting the ordering right, not just the absolute residuals.
  3. 3. Full-range docking perturbation Docking was applied conservatively - it was one of the last things I enabled. I concentrated on the active region, under the hypothesis that docking could help discern molecules that were partial agonists from full agonists. Perhaps it is just as, if not more effective at the low end, identifying clear non-binders that should have no activity.
  4. 4. External inactive augmentation Curated public inactives could help on the low-side of the prediction window. Earlier attempts hurt, but those attempts predate the CheMeleon representation.