OpenADMET PXR Challenge - Phase 2 Activity Report

Summary

In the end I could not defensibly improve my Phase 1 model, which had a public Phase 1 MAE of 0.4291. Performance against the Phase 2 compounds is 0.4317. This is not from lack of trying, as shown below.

Information Gathering

I generated the following features and information.

Challenge Data

Name	Description
PXR single-concentration HTS summaries	Max log2 fold-change and strict-active labels.
PXR counter-assay data	Null-assay pEC50 and Emax.
Emax labels	Maximum response from dose-response assays.

Gathered and Generated Data/Features

Name	Description
Curated ChEMBL PXR EC50 data	External functional EC50 measurements for PXR.
Physicochemical descriptors	Molecular weight, logP, TPSA, HBD, HBA, rotatable bonds, ring counts, aromatic rings, fraction sp3, formal charge, and related descriptors.
2D molecular fingerprints	Morgan/ECFP-style fingerprints used for folds, duplicate checks, rough similarity, and legacy baselines.
CheMeleon-FT embeddings	CheMeleon/ChemProp embeddings fine-tuned on PXR HTS readouts, plus base-model embeddings.
MolE predictions	Fine-tuned MolE activity predictions.
MolFormer XL predictions	Fine-tuned MolFormer activity predictions.
GNINA docking	Cross-docking on 8 PXR binding-site configurations and CNN scores.
Boltz2 cofolding	3 samples over all dose-response compounds.
Mixed-method cofolding	25 samples over 5 cofolding methods for all activity challenge compounds and a curated set of 180 training compounds.
UniMol2 pose embeddings and predictions	Structure-derived ligand-pose representations from docking.
Protein-ligand contact features (ProLIF)	Interaction fingerprints and contact frequencies.
Ligand geometry and pose-QC features	Strain, topology checks, and PoseBusters-style quality measures.
Analog-series support features	Active-neighbor support, inactive-neighbor support, and series-level activity evidence.
RDKit2D and DrugLike neighborhood features	Descriptor-space similarity and coverage features for analog reasoning.

Final Stack

Base NNLS Stack

Component	Coef
TabICL on CheMeleon-FT continuous HTS log2FC representation	0.311
TabPFN on CheMeleon-FT continuous HTS log2FC representation	0.204
LightGBM on CheMeleon-FT continuous HTS log2FC representation	0.189
LightGBM on CheMeleon-FT binary strict-active HTS representation	0.146
auxMT positive-only active-tail uplift anchor	0.143

Post-Stack Perturbation Layers

Layer	Description
Analog-series active perturbation	Uses RDKit2D and DrugLike descriptor-space similarity to look for molecules with support from high-active neighbors, pEC50 > 5.5. This produced `current_best_alltrain_active55`, moved 14 rows by at least 0.05 pEC50, had max movement about 0.0844 pEC50, and mean delta about 0.00628.
GNINA/UniMol2 top12 activefit perturbation	GNINA cross-docking against 8 PXR crystal/template configurations; top pose-derived ligand representations embedded with UniMol2; available-rank mean over up to top 12 GNINA/UniMol2 pose predictions. Applied only where base pEC50 >= 5.0 and UniMol2 top12 prediction existed.

GNINA/UniMol2 rule:

new_prediction = base + 0.25 * clip(unimol2_prediction - base, -0.20, +0.20)

Maximum possible movement: ±0.05 pEC50. For the allraw_top12 report model: 427 covered rows, 217 rows moved after base-pEC50 gate, 138 moved up, 79 moved down, mean delta 0.00603, max absolute delta 0.05.

Interpretation notes:

"CheMeleon-FT" means CheMeleon/ChemProp foundation model fine-tuned on PXR HTS data.
"HTS strict075" means the representation was trained on a binary-like strict activity target from the single-concentration HTS set.
"HTS log2FC" means the representation was trained on continuous single-concentration HTS response, median log2 fold-change.
MolE and MolFormer belonged to a different upstream/raw stack candidate, not to the final Phase 1 layered model.
UniMol2 did not enter as a raw stack component; it entered later as a bounded structural perturbation layer.
"auxMT positive-only active-tail uplift anchor" is a kitchen-sink model I developed very early on. It utilized the primary dose-response data, single-point data, null counterassay, and curated ChEMBL data. It had a primary target of pEC50 with auxiliary targets of primary_exam_vs_pos_ctrl, null_pEC50, null_exam_vs_pos_ctrl, and single_point_max_log2fc. Overall, this model had an MAE of 0.4878, with its major limitation being badly underpredicting high actives. However, those it did predict as active were almost uniformly pEC50 > 5.0. I then used this signal to help boost these molecules in other models with poor active regions. It consistently had high performance in NNLS stacks.
ChEMBL data were curated by physicochemical relevance. In total, 592/704 rows were kept. 35 of these were dubbed high tier.

What Did Not Work

Rebuilding from ligand representations. I rebuilt the activity model using standard 2D chemical descriptors: topology, polarity, ring counts, hydrogen-bond features, logP-like terms, and related calculated properties. I also included learned molecular embeddings, including CheMeleon and ChemProp-style representations. Fine-tuned CheMeleon models were better than frozen embeddings and better than the simplest 2D controls (MAE 0.4756 vs. 0.5226). NNLS, TabICL, and TabPFN stacks maxed out at MAE 0.4842, however.
Pairwise difference model. Modeling activity differences between related compounds rather than predicting each compound in isolation was attempted. Chemically, this makes sense: many errors looked like local SAR problems, where the question was "does this analog go up or down from its neighbor?" In practice, the model mostly rediscovered what the original model already knew. If the base model already predicted a large difference, that was usually the strongest signal. The pairwise model hit a ceiling at a similar MAE to the 2D models. So the problem seemed not to be in representation, but in the information content that we have.
Hidden activity cliff model. I then focused on specific cliff pairs: close in Tanimoto similarity (>0.5) and high in pEC50 difference (>1.0). The model tends to smooth them together. Broadly, for very similar structures where one is known active, lower hydrophobicity in the analog was a signal that the other compound would have significantly lower activity. While this made a useful review list, it could not safely tell us which Phase 2 compounds to move. Precision in cliff identification never got above 0.2.
Structure-informed model. Could predicted binding poses, protein contacts, and pose-quality features rescue some of the local SAR failures? Cross-docking and cofolding were run extensively on the challenge compounds, a limited set of training compounds, and a sparse set of all 4,100 training compounds. The structure data was not junk; some contact patterns do have correlation with activity, including interaction with Asn285 and Phe251. But the signal was sparse. It helped explain a few cases after the fact, and the GNINA/UniMol2 top12 activefit signal did enter the final model as a bounded perturbation, but heuristic thresholds had precision much too low (0.2-0.3 max) to justify using as an unbounded uplift or downgrade rule.
CheMeleon and UniMol. I looked at whether external learned representations, including CheMeleon- and UniMol-like support, could identify compounds the main model was likely to miss. These signals were sometimes directionally useful, especially around possible high-activity undercalls. UniMol2 was not decisive enough as a raw stack component, but it was useful as part of the GNINA top12 bounded structural perturbation. When I turned broader signals into rules, they caught some interesting compounds but also swept in too many false positives.
Analog-series correction. I tried grouping compounds into likely analog series and using related active compounds as anchors. It did recover some real relationships, and the bounded active-tail perturbation remained in the final model. Broader activity transfer inside a series was not stable enough. Some analogs followed the parent, while others broke the pattern. Fixed correction by series added as much error as it removed.
Expanded ensemble. I tested whether adding more models or reweighting the final average could improve the submitted model. This was a kitchen-sink approach, hoping that models had enough orthogonal information to be complementary. The expanded blends did not beat the original, and in some cases diluted the better signal.
Improving the validation suite. No model fitting can withstand a poor objective. My validation suite, at the top end (MAE < 0.46), was anti-correlated with actual performance against the Phase 1 compounds. I spent considerable time trying to improve my validation suite from the standard 5-fold Butina clustering. I tried some leaderboard-hacky methods: assessing different holdout approaches, retraining all models based on that, and seeing if rank-ordering would improve. I developed a closed-loop approach and ran an autoresearch loop with a Hermes agent to test different ways to attack the weakness of the validator against the low-regime and high-regime. I built hundreds of models using each validation protocol, scored them against the Phase 1 set, and computed the rank ordering of the models in MAE between truth and validation. Nothing was better beyond noise than the original 5-fold Butina clustering. In the end, as above, I think the high actives and the miscalls on the low end are so sparse and so discrete that there was no way through validator adjustment to better capture those.

Final Thoughts

In the end, somehow my original model hit the maximum for the information I had and the approaches I considered, so I am sticking with it. Trying to identify the true activity cliffs in PXR is exceedingly difficult.

I am eager to see how those who score better did, whether it is by a different model representation or by generating or identifying features with orthogonal information.

Thanks so much to the organizers and all of the participants in what has really been a very difficult and stimulating challenge.