methods summary · Daniel Saltzberg · @dargason

OpenADMET PXR Challenge — Phase 1 model

Generated: 2026-05-26 23:47:33 UTC

My final model for the first phase of the challenge contains four levels, each using a different swath of OpenADMET-supplied and derived data. I will add more detail in the coming days now that my modeling sprint is over.

The model consists of:

1. PXR foundation models Two PXR-specific foundation models were generated by fine-tuning ChemProp v2 / CheMeleon on the 10,870 molecules in the single-point high throughput screening dataset: one to a continuous (log2fc_median) target, and one to a binary (log2fc_gt_0.75) target.
2. Blended regression models Multiple regressors were trained on the foundation-model embeddings using a curated set of the dose-response data. Stack weights were optimized on 5-fold Butina out-of-fold predictions: each head predicted compounds from held-out chemical clusters, and NNLS / positive regression learned the best nonnegative blend against the known pEC50 labels.
- TabICL continuous — 0.311
- TabPFN continuous — 0.204
- LGBM continuous — 0.189
- LGBM binary — 0.146
- auxMT anchor — 0.143
- TabICL binary — 0
- TabPFN binary — 0
3. Analog-series active perturbation A small perturbation was added to any molecule with high similarity (via RDKit2D and DrugLike descriptors) to high-active compounds (pEC50 > 5.5).
4. Docking-enhanced lift Cross-docking was performed for all dose-response and challenge compounds against a battery of 8 PXR crystal structures using GNINA. Poses were filtered, and the top 12 poses for each molecule were embedded in UniMol2 and regressed against the dose-response actives (pEC50 > 4.5). Perturbations were capped at ±0.05 pEC50 units.
I ran out of time before testing the uncapped model.

Other notes and thoughts:

Chemistry was cleaned with RDKit; tautomers / protomers were enumerated with UniPKa at pH 7.4.
The dose-response data was cleaned of obvious electrophiles and compounds with very high molecular weight (e.g., rifamycin and friends).
Cofolding was not as useful as hoped. I generated hundreds of thousands of cofolding models, but traditional docking performed better in the end.
ECFP4 / Tanimoto was found to be less useful than RDKit2D and DrugLike descriptors for representing the universe of PXR molecular activity.
Enhancing the dataset with ChEMBL or BindingDB only hurt results. I tried a number of ways to add that data, specifically only including EC50 data on the same cell line, but in every instance it did not help. I think this really drives home how much assay differences confound the data.
I had an extremely hard time constructing an internal validation suite that accurately — or even directionally — tracked my model's holdout performance.
For the docking perturbation, I regressed only against actives. Regression against the entire set was net-negative and I rationalized that the docking signal was most useful for distinguishing partial-agonist-like poses from fuller-agonist-like poses.

Waterfall: where the final predicted pEC50 came from

pre-perturbation 2D/activity baselineanalog-series gateGNINA/UniMol2 correction

Gray is the 2D/activity baseline entering the analog gate; green and gold are the two bounded perturbations added afterward. The 3D correction is a small sliver — the base model does the main work.

post-mortem · added 2026-05-27

Phase 1 Unblinded: successes, failures, head-smacks, and thoughts for Phase 2

The challenge organizers unblinded 253 of the 513 test compounds on 2026-05-27 (49.3% of the test set; the remaining 260 stay blinded). This gives us a good picture of where the model works, where it fails, and some potential paths forward.

Final rank

14 / 338

MAE

0.4291

RAE 0.5664 · R² 0.6061

Spearman ρ

0.7947

weakest metric — diagnostic below

Activity call (≥ 5.0)

0.848

precision & recall balanced

Initial thoughts

As soon as I saw this scatterplot below, I smacked my head. I had spent the better part of the last two weeks chasing what I thought was active-region compression, which is clearly an issue (blue dots under the diagonal). However, I had significantly more issues on the bottom end, which I had neglected to investigate fully. Obviously there would be some duds in the analogue expansion and my model was predicting at least pEC50>3.0 for almost all compounds, while the training set demonstrated the assay happily reports values under 2.

Predicted vs true pEC₅₀ on the unblinded set

Very inactive (< 3.5) Inactive (3.5–4.5) Weak active (4.5–5.5) Active (≥ 5.5)

253 unblinded compounds, colored by true-activity category. Good job in the middle and needs improvement at both ends.

What worked

Active activity prediction For confirmed actives (pEC₅₀ ≥ 5.0, n=112) the model MAE is 0.292 with a healthy underprediction bias (−0.207). The perturbation layers: the analog-series gate and GNINA / UniMol successfully improved fidelity in the active tail.

Binary activity classification (threshold pEC₅₀ ≥ 5.0) came back at 84.8% precision and recall: the model correctly learns what an active PXR compound looks like even though the training set is heavily enriched for actives. Rank ordering within the mid-range pEC₅₀ (3.5–5.5) is also strong at Spearman ≈ 0.52–0.53 in-window.

What failed

The inactive region pEC₅₀ < 3.5 region has Spearman ρ = −0.14: anti-correlated rank-order! All 10 worst absolute errors fall in this range, overshooting their true potencies by 1.6–2.8 pEC₅₀ units.

Plate 1 top compounds I generally consider it a success if a model predicts the top molecules to be included in the first experimental batch, whether that be 8, 24, or however many. By this metric, the model is not great. The top molecule (OADMET-0006546) is 22nd ranked by the model; the second most potent molecule is 66th. Only two of the top eight molecules are in the model top eight. Only seven of the top 24 are in the model top 24. In instances where resources are limited, this model is not satisfactory at triaging up the best compounds.

Rank order of top 42 actives (true pEC50 > 5.5)

42 unblinded actives (true pEC₅₀ > 5.5) ranked by true potency (x) vs predicted potency (y). Red box: only 2/8 of the actual top 8 land in the predicted top 8. Blue box: only 7/24 of the actual top 24 land in the predicted top 24.

pEC₅₀ distributions: test predictions, test actuals, training set

experimental (test set 1) predicted training set (n=4,139)

Distribution of pEC₅₀: actual, predicted, and training data. Predictions clearly missed the low-end and a bit on the high-end, too.

Failure Diagnosis:

Active-enriched training + MAE-optimized stacking + positive-only NNLS created model that cannot express "this molecule really does not bind PXR."

The NNLS (non-negative least squares) stack was chosen to keep ensemble weights interpretable and prevent cancellation artifacts. It also means no single head can pull a prediction down unless the base head does so — and the base heads were themselves trained on an active-heavy target distribution. There is no mechanism in the current stack to recognize and downshift true inactives, outside of the minor perturbations.

Some thoughts for Phase 2

1. Trust the training distribution One difficulty of this challenge is that the training compounds are far from the challenge compounds. However, on aggregate, the expansion analogs performed pretty similarly to the initial dose-response data. Stratifying the final ensemble to fit this distribution may help at both ends.
2. Rank-aware stack loss Replace the pure MAE-optimized NNLS objective with a blended ranking-aware loss (e.g., MAE + a pairwise rank-penalty), so the stack is explicitly rewarded for getting the ordering right, not just the absolute residuals.
3. Full-range docking perturbation Docking was applied conservatively - it was one of the last things I enabled. I concentrated on the active region, under the hypothesis that docking could help discern molecules that were partial agonists from full agonists. Perhaps it is just as, if not more effective at the low end, identifying clear non-binders that should have no activity.
4. External inactive augmentation Curated public inactives could help on the low-side of the prediction window. Earlier attempts hurt, but those attempts predate the CheMeleon representation.