PPAR-δ agonists dataset
Agonists (789) were collected from different literature sources including CHEMBL29 and WOMBAT30. The range of their agonist activities (EC50) was 0.03–1000 nM. Figures S1–S3 present some characteristics of those 789 PPAR-δ agonists. They partially obey Lipinski’s Rule of Five (ROF) for Oral availability of drugs31 and Oprea’s rules for “Lead like molecules”30 (63% and 25% of the ligands, respectively). The violations are mainly due to high lipophilicity and high molecular weight of PPAR-δ agonists. The range of molecular weights is between 300–600 Daltons. There are no hydrophilic ligands, and most are hydrophobic with clogP value above 3. However, their functional groups have the potential to form strong specific interactions – hydrogen bonds and electrostatic interactions.
This dataset was divided in two, with one-half (394 molecules) serving as training set, and the others (395) as test set. Typically, active molecules are prone to be highly similar due to their shared synthetic procedures when applied for hit and lead optimization, often using the same scaffolds and chemical series. To avoid such bias, the diversity of the training set of actives needs to be increased, by limiting the similarity within the set. On the other hand, a demand for highly diverse molecules might exclude too many potential ones. Figure S4 presents the numbers of active molecules that remain at each “cutting edge” of Tanimoto imilarity scores (T)32 in both the training and the test set. The T = 0.8 threshold was found to present a fair balance between having diverse molecules and having enough molecules. Thus, 129 active molecules were excluded from the training set for model construction and 265 actives remained.
Five thousand molecules were collected randomly from the ENAMINE database to represent the set of “inactives”
The assumption is that, statistically, most of those randomly chosen molecules are inactive although they have not been confirmed as such. The set size of inactives is a compromise: it should represent accurately the properties’ space of the inactives but the number should not be too large, which might extend computation time and risk considerable bias (see discussion). Nearly 98% of these molecules have Tanimoto values of <0.4 to others. Those inactives were picked by limiting some of their properties based on the idea of an “applicability domain”33, as described in the methods.
A model for identifying of PPAR-δ agonists was created by Iterative Stochastic Elimination (ISE)
A flowchart of the ISE process is presented in Fig. S5. The ISE modeling was performed with a training set of 5265 molecules, for classifying the 265 actives against the class of 5000 inactives. We generated sixty-eight unique filters, with 4 descriptors each, that distinguish best between PPAR-δ agonists and the inactive molecules. The top filter had an MCC value (Matthews Correlation Coefficient)34 of 0.97 (98% TP, 99% TN) and the least classifying filter had an MCC of 0.755 (91% TP, 84% TN). Filters have been clustered to ensure dissimilarity >1% between filters. That is, if two filters are identical in the numbers of actives/inactives to the extent of >99%, the one with lower MCC is discarded. For details about the compositions of the filters – and for the meaning of descriptors see Table S2.
The ISE model was validated by the test set
The test set contained 395 known agonists and 10,000 “newly picked” inactives from ENAMINE. It was used in order to test the sixty-eight unique filters (the final model). The model produced a molecular bioactivity index (MBI) for each molecule in the test set. The MBI score4 is a result of the number of filters successfully passed (by having properties that fully fit the 4 descriptor ranges of a filter), which add their TP/FP value to the successfully passed molecule, while missing any filter (if one non-fitting descriptor’s value of any filter is found in a molecule) reduces the MBI score by TP/FP of that filter.
The model was highly efficient for scoring bioactivity on PPAR-δ. More than 96% of the actives in the test set were captured at the top 1% of the scores (10,395 molecules, including 10,000 random molecules). The area under the ROC is above 0.98, revealing a highly accurate and efficient model. See the enrichment plots in Fig. S6 and ROC curve plots in Fig. S7. In Fig. 2, the X-axis indicates MBI values (between −25 and +25) for the molecules that are simply counted on the Y-Axis, with no particular order. The relatively small number of filters is responsible for the fact that there are many very negative scores (mostly of TN) while there are only few molecules with MBI scores between −25 and −15. Larger MBI values are associated mostly with the true positives. In Table 1, each column represents an index (MBI) border between molecules considered to be positives (which are with higher MBI values) and those considered to be negatives, which have lower MBI values. As we know which ones, on each side, are actives or are assumed inactives, it is easy to compute the 6 values of each column. It is clear that enrichments are much larger at higher MBI values, while the numbers of TP become smaller, and so does the number of TN. However, MCC values do not change linearly, and are maximal around the MBI value of 3. As much as there is no dramatic difference from +6 to +10, we deal in the discussion with our decision to pick from screening only molecules with MBI of 10 and greater.
The ENAMINE database of 1.56 million molecules was screened through the model
Only 2,491 molecules achieved an MBI score >10. These molecules were used in the next step of docking and then for the selection for in vitro tests. Table S3 presents the number of the commercial molecules in ENAMINE, which have indexes over various thresholds. None of the 2,491 molecules were examined previously for PPAR-δ binding. The top 2,491 commercial molecules that got indexes above +10 in the ISE model were subject to docking with OpenEye’s FRED35. Finally, 306 top molecules were purchased for in vitro binding experiments as described below.
Five PDB complexes were collected and used to define the most important residues for docking
We used the 5 complexes in order to construct a list of interactions of protein residues with the crystallized ligands, shown in Fig. S8. Nearly 30 residues in those complexes have one or more close connections to their ligands by distance criteria that are described in the methods. The maximum distance for VdW interactions is 3.9 Å. Cys285, Thr288, Thr289, His323, Leu330, Ile364, His449, Met453, Leu469 and Tyr473 interact with more than 3 different ligands or create H-bonds with one of them, and these residues were defined as “important”. In addition, His323, His449 and Tyr473 consistently create a Hydrogen bond with all the ligands and so we define them as “crucial to H-bonds”.
Two PDB complexes were found optimal for docking
The Ramachandran plots38 of 3GZ9 and 3D5F do not have any outliers, while those of 3ET2, 3GWX and 1GWX have 2, 5 and 10 outliers, respectively (Supporting Information Fig. S9). In re-docking, we test how well the original ligand of a complex is predicted by the docking protocol. Except for the case of 3ET2, all the other ligands fulfilled the criteria described in the methods. Geometrical data, as well as energy score, for each selected pose in the re-docking is presented in Table S4.
Since we search for novel agonists, a crucial test of the ability of the docking algorithm and of a specific crystal complex is that of distinguishing between two sets of molecules, the known agonists and the inactives. The measure for success of discrimination between the sets is MCC. One hundred thirty-five molecules were picked out of the 789 agonists (each of them has Tanimoto value <0.7 to the others) and were tested on each one of the complexes. The criteria for success in docking were the same as in the re-docking validation test. Two chains (3D5Fa and 3D5Fb) of the same crystal complex and one complex, 3GZ9 identified the largest numbers of true positives (more than 100 out of the 135. See Table S5). The data regarding resolutions and Ramachandran Plots support these results, as these complexes have better resolution and Ramachandran Plot. Out of 1000 Random molecules (different from the set that was described above – see applicability domain in Table S6), only 179, 130 and 143 were docked, respectively. Therefore, MCC of 0.68–0.69 were calculated for each complex.
The 2491 molecules with top MBI scores were screened by docking to the three selected chains
The best results were picked on the basis of “voting”: 335 ISE hits were successfully docked to all three PPAR-δ complexes, 349 were successful in only two chains and 489 were successful in one only. The 335 hits that were successful in docking to all the 3 selected PPAR-δ complexes are highly diverse in comparison to the 394 agonists of the training set. Tanimoto index <0.3 is found for 318 molecules to all the others, while the rest 17 molecules have Tanimoto index <0.4. Among these 318 hits, 306 were available for purchasing, and sent to the Genomics Institute of Novartis Research Foundation (GNF) for in vitro experiments.
EC50 values (for activation) of the 306 candidates were determined for each of the three human PPARs
GW501516 was used as the positive control for hPPAR-δ with EC50 value of 0.001 µM for hPPAR-δ (EC50 values of 0.704 and 0.839 µM for hPPAR-α and hPPAR-γ, respectively)39. GW7647 was used as the positive control for hPPAR-α with EC50 value of 0.003 µM for hPPAR-α (EC50 values of 0.974 and 0.85 µM for hPPAR-δ and hPPAR-γ, respectively)39. GW1929 was used as the positive control for hPPAR-γ with EC50 value of 0.013 µM for hPPAR-γ (EC50 values of 1 µM for hPPAR-δ and hPPAR-α)40.
The in vitro results were categorized into 4 main classes: 1) agonists of hPPAR-δ with highest affinity (EC50 < 1 µM), 2) agonists of hPPAR-δ with low affinity (EC50 > 1 µM), 3) non-agonists of hPPAR-δ but agonists of other hPPARs and 4) non-agonists of hPPARs. The first class has 14 hits (Table 2 and Fig. 1). Only one of them is a “pan-agonist” hitting all three targets (GNF-8560), two of them are agonists of hPPAR-δ and hPPAR-α only (GNF-3632 & GNF-6952), and two of them are agonists of hPPAR-δ and hPPAR-γ only (GNF-5295 & GNF-0341). All the remaining 8 agonists are highly selective for hPPAR-δ. The second class has 13 hits (Table 3 and Fig. 3). Four of them are selective to hPPAR-δ. The third class has 64 hits (Table S7 and Fig. S10). Most of them have low affinities. Only 3 of them have EC50 < 1 µM for hPPAR-γ, and the best (GNF-6635) has EC50 = 0.277 µM and is not active on the other two targets. The other two (GNF-7017 & GNF-1165) have EC50 < 1 µM for hPPAR-γ, and EC50 > 1 µM for hPPAR-α.
All the new active scaffolds are diverse both with respect to the “learning set” as well as with respect to each other
One of the advantages of our methods is due to the representation of molecules as sets of physico-chemical properties and not as structural fragments. This has been repeatedly shown to lead to the discovery of novel scaffolds in our previous projects41,42. Tanimoto index was used in the training set to eliminate active molecules that are similar at a level of Tanimoto index ~0.8 and greater.
We compared the novel agonists to the training and the test set of actives. It is a matrix of Tanimoto values (each row is a known agonist while each column is a newly discovered agonist). The highest value in this matrix is 0.57, indicating that novel agonists are very different from the known 789 agonists from the ChEMBL and WOMBAT databases. Another interesting result is the diversity among the 27 newly discovered agonists. None of the values is greater that T = 0.7. Out of 351 Tanimoto values, only 8 are with T > 0.4, again indicating large diversity of the novel agonists.
Figure 4 presents the distribution of Tanimoto values for the three comparisons – the 789 known agonist set vs. novel agonists (Fig. 4A) and novel agonists among themselves (Fig. 4B). The third comparison is between the discovered 27 agonists and the randomly picked set of 5000. Only 10 out of the 135000 have T values above 0.7 (See Fig. 4C).
Recently, Wu et. al. discovered 16 new agonists43, all of which have high MBI scores of 13–19 in our models (Table S8), so the ISE model could identify them as agonists had they been in the dataset of the virtual screening. These agonists have no similarity to our novel agonists (see Fig. S11). We searched for similar molecules of these agonists in the ENAMINE catalog (in order to check if our model missed potential agonists). Only four molecules out of the 1.56 million have similarities of Tanimoto value > 0.7 to compound 1 (T0517-7230 & T5405641 have 0.74; T5681815 & T5999586 have 0.71). None of the ENAMINE molecules is similar to the other agonists.
Scaffolds of these new agonists are based on the substrate, and are similar in most of the cases. Figure S11b shows that all have Tanimoto values >0.6 to each other (and in most cases it is >0.7). Furthermore, 4 of these agonists (Compounds 1, 12, 13 & 14) have Tanimoto values > 0.7 to the 789 known agonists (our training and test sets), while in the case of our novel agonists none has similarity to Tanimoto value > 0.7 to the known agonists.
Finally, we used the Tanimoto index to compare our novel agonists vs. all molecules that are known in BindingDB44 or predicted agonists in ZINC1545. None of our agonists were found to have high similarity (>0.7) to any of the known or predicted ones.