# Bibliography of computer-aided Drug Design

Updated on 7/18/2014. Currently 2130 references

## Docking / Benchmarks

2014 / 2013 / 2012 / 2011 / 2010 / 2009 / 2008 / 2007 / 2006 / 2005 / 2004 / 2003 / 2002 / 2001 / 2000 /

## 2014

• SAMPL4 & DOCK3.7: lessons for automated docking procedures.
Coleman, Ryan G and Sterling, Teague and Weiss, Dahlia R
Journal of computer-aided molecular design, 2014, 28(3), 201-209
PMID: 24515818     doi: 10.1007/s10822-014-9722-6

The SAMPL4 challenges were used to test current automated methods for solvation energy, virtual screening, pose and affinity prediction of the molecular docking pipeline DOCK 3.7. Additionally, first-order models of binding affinity were proposed as milestones for any method predicting binding affinity. Several important discoveries about the molecular docking software were made during the challenge: (1) Solvation energies of ligands were five-fold worse than any other method used in SAMPL4, including methods that were similarly fast, (2) HIV Integrase is a challenging target, but automated docking on the correct allosteric site performed well in terms of virtual screening and pose prediction (compared to other methods) but affinity prediction, as expected, was very poor, (3) Molecular docking grid sizes can be very important, serious errors were discovered with default settings that have been adjusted for all future work. Overall, lessons from SAMPL4 suggest many changes to molecular docking tools, not just DOCK 3.7, that could improve the state of the art. Future difficulties and projects will be discussed.

## 2013

• Evaluation and Optimization of Virtual Screening Workflows with DEKOIS 2.0 - A Public Library of Challenging Docking Benchmark Sets.
Bauer, Matthias R and Ibrahim, Tamer M and Vogel, Simon M and Boeckler, Frank M
Journal of chemical information and modeling, 2013, 53(6), 1447-1462
PMID: 23705874     doi: 10.1021/ci400115b

The application of molecular benchmarking sets helps to assess the actual performance of virtual screening (VS) workflows. To improve the efficiency of structure-based VS approaches, the selection and optimization of various parameters can be guided by benchmarking. With the DEKOIS 2.0 library, we aim to further extend and complement the collection of publicly available decoy sets. Based on BindingDB bioactivity data, we provide 81 new and structurally diverse benchmark sets for a wide variety of different target classes. To ensure a meaningful selection of ligands, we address several issues that can be found in bioactivity data. We have improved our previously introduced DEKOIS methodology with enhanced physicochemical matching, now including the consideration of molecular charges, as well as a more sophisticated elimination of latent actives in the decoy set (LADS). We evaluate the docking performance of Glide, GOLD, and AutoDock Vina with our data sets and highlight existing challenges for VS tools. All DEKOIS 2.0 benchmark sets will be made accessible at http://www.dekois.com .

• CSAR Data Set Release 2012: Ligands, Affinities, Complexes, and Docking Decoys
Dunbar, James B and Smith, Richard D and Damm-Ganamet, Kelly L and Ahmed, Aqeel and Esposito, Emilio Xavier and Delproposto, James and Chinnaswamy, Krishnapriya and Kang, You-Na and Kubish, Ginger and Gestwicki, Jason E and Stuckey, Jeanne A and Carlson, Heather A
Journal of chemical information and modeling, 2013, 53(8), 1842-1852
PMID: 23617227     doi: 10.1021/ci4000486

A major goal in drug design is the improvement of computational methods for docking and scoring. The Community Structure Activity Resource (CSAR) has collected several data sets from industry and added in-house data sets that may be used for this purpose ( www.csardock.org ). CSAR has currently obtained data from Abbott, GlaxoSmithKline, and Vertex and is working on obtaining data from several others. Combined with our in-house projects, we are providing a data set consisting of 6 protein targets, 647 compounds with biological affinities, and 82 crystal structures. Multiple congeneric series are available for several targets with a few representative crystal structures of each of the series. These series generally contain a few inactive compounds, usually not available in the literature, to provide an upper bound to the affinity range. The affinity ranges are typically 3-4 orders of magnitude per series. For our in-house projects, we have had compounds synthesized for biological testing. Affinities were measured by Thermofluor, Octet RED, and isothermal titration calorimetry for the most soluble. This allows the direct comparison of the biological affinities for those compounds, providing a measure of the variance in the experimental affinity. It appears that there can be considerable variance in the absolute value of the affinity, making the prediction of the absolute value ill-defined. However, the relative rankings within the methods are much better, and this fits with the observation that predicting relative ranking is a more tractable problem computationally. For those in-house compounds, we also have measured the following physical properties: logD, logP, thermodynamic solubility, and pKa. This data set also provides a substantial decoy set for each target consisting of diverse conformations covering the entire active site for all of the 58 CSAR-quality crystal structures. The CSAR data sets (CSAR-NRC HiQ and the 2012 release) provide substantial, publically available, curated data sets for use in parametrizing and validating docking and scoring methods.

## 2012

• FRED and HYBRID docking performance on standardized datasets.
McGann, Mark
Journal of computer-aided molecular design, 2012, 26(8), 897-906
PMID: 22669221     doi: 10.1007/s10822-012-9584-8

The docking performance of the FRED and HYBRID programs are evaluated on two standardized datasets from the Docking and Scoring Symposium of the ACS Spring 2011 national meeting. The evaluation includes cognate docking and virtual screening performance. FRED docks 70 % of the structures to within 2\AA} in the cognate docking test. In the virtual screening test, FRED is found to have a mean AUC of 0.75. The HYBRID program uses a modified version of FRED's algorithm that uses both ligand- and structure-based information to dock molecules, which increases its mean AUC to 0.78. HYBRID can also implicitly account for protein flexibility by making use of multiple crystal structures. Using multiple crystal structures improves HYBRID's performance (mean AUC 0.80) with a negligible increase in docking time (~15 %).

• How Good Are State-of-the-Art Docking Tools in Predicting Ligand Binding Modes in Protein-Protein Interfaces?
Krüger, Dennis M and Jessen, Gisela and Gohlke, Holger
Journal of chemical information and modeling, 2012, 52(11), 2807-2811
PMID: 23072688     doi: 10.1021/ci3003599

Protein-protein interfaces (PPIs) are an important class of drug targets. We report on the first large-scale validation study on docking into PPIs. DrugScore-adapted AutoDock3 and Glide showed good success rates with a moderate drop-off compared to docking to "classical targets". An analysis of the binding energetics in a PPI allows identifying those interfaces that are amenable for docking. The results are important for deciding if structure-based design approaches can be applied to a particular PPI.

• Evaluation of DOCK 6 as a pose generation and database enrichment tool.
Brozell, Scott R and Mukherjee, Sudipto and Balius, Trent E and Roe, Daniel R and Case, David A and Rizzo, Robert C
Journal of computer-aided molecular design, 2012, 26(6), 749-773
PMID: 22569593     doi: 10.1007/s10822-012-9565-y

In conjunction with the recent American Chemical Society symposium titled "Docking and Scoring: A Review of Docking Programs" the performance of the DOCK6 program was evaluated through (1) pose reproduction and (2) database enrichment calculations on a common set of organizer-specified systems and datasets (ASTEX, DUD, WOMBAT). Representative baseline grid score results averaged over five docking runs yield a relatively high pose identification success rate of 72.5 % (symmetry corrected rmsd) and sampling rate of 91.9 % for the multi site ASTEX set (N

• Variability in docking success rates due to dataset preparation.
Corbeil, Christopher R and Williams, Christopher I and Labute, Paul
Journal of computer-aided molecular design, 2012, 26(6), 775-786
PMID: 22566074     doi: 10.1007/s10822-012-9570-1

The results of cognate docking with the prepared Astex dataset provided by the organizers of the "Docking and Scoring: A Review of Docking Programs" session at the 241st ACS national meeting are presented. The MOE software with the newly developed GBVI/WSA dG scoring function is used throughout the study. For 80 % of the Astex targets, the MOE docker produces a top-scoring pose within 2\AA} of the X-ray structure. For 91 % of the targets a pose within 2\AA} of the X-ray structure is produced in the top 30 poses. Docking failures, defined as cases where the top scoring pose is greater than 2\AA} from the experimental structure, are shown to be largely due to the absence of bound waters in the source dataset, highlighting the need to include these and other crucial information in future standardized sets. Docking success is shown to depend heavily on data preparation. A "dataset preparation" error of 0.5 kcal/mol is shown to cause fluctuations of over 20 % in docking success rates.

• Lead Finder docking and virtual screening evaluation with Astex and DUD test sets.
Novikov, Fedor N and Stroylov, Viktor S and Zeifman, Alexey A and Stroganov, Oleg V and Kulkov, Val and Chilov, Ghermes G
Journal of computer-aided molecular design, 2012, 26(6), 725-735
PMID: 22569592     doi: 10.1007/s10822-012-9549-y

Lead Finder is a molecular docking software. Sampling uses an original implementation of the genetic algorithm that involves a number of additional optimization procedures. Lead Finder's scoring functions employ a set of semi-empiric molecular mechanics functionals that have been parameterized independently for docking, binding energy predictions and rank-ordering for virtual screening. Sampling and scoring both utilize a staged approach, moving from fast but less accurate algorithm versions to computationally more intensive but more accurate versions. Lead Finder includes tools for the preparation of full atom protein and ligand models. In this exercise, Lead Finder achieved 72.9% docking success rate on the Astex test set when the original author-prepared full atom models were used, and 74.1% success rate when the structures were prepared by Lead Finder. The major cause of docking failures were scoring errors resulting from the use of imperfect solvation models. In many cases, docking errors could be corrected by the proper protonation and the use of correct cyclic conformations of ligands. In virtual screening experiments on the DUD test set the early enrichment factor of several tens was achieved on average. However, the area under the ROC curve ("AUC ROC") ranged from 0.70 to 0.74 depending on the screening protocol used, and the separation from the null model was not perfect-0.12-0.15 units of AUC ROC. We assume that effective virtual screening in the whole range of enrichment curve and not just at the early enrichment stages requires more accurate solvation modeling and accounting for the protein backbone flexibility.

• Docking and scoring with ICM: the benchmarking results and strategies for improvement.
Neves, Marco A C and Totrov, Maxim and Abagyan, Ruben
Journal of computer-aided molecular design, 2012, 26(6), 675-686
PMID: 22569591     doi: 10.1007/s10822-012-9547-0

Flexible docking and scoring using the internal coordinate mechanics software (ICM) was benchmarked for ligand binding mode prediction against the 85 co-crystal structures in the modified Astex data set. The ICM virtual ligand screening was tested against the 40 DUD target benchmarks and 11-target WOMBAT sets. The self-docking accuracy was evaluated for the top 1 and top 3 scoring poses at each ligand binding site with near native conformations below 2\AA} RMSD found in 91 and 95% of the predictions, respectively. The virtual ligand screening using single rigid pocket conformations provided the median area under the ROC curves equal to 69.4 with 22.0% true positives recovered at 2% false positive rate. Significant improvements up to ROC AUC

• Surflex-Dock: Docking benchmarks and real-world application.
Spitzer, Russell and Jain, Ajay N
Journal of computer-aided molecular design, 2012, 26(6), 687-699
PMID: 22569590     doi: 10.1007/s10822-011-9533-y

Benchmarks for molecular docking have historically focused on re-docking the cognate ligand of a well-determined protein-ligand complex to measure geometric pose prediction accuracy, and measurement of virtual screening performance has been focused on increasingly large and diverse sets of target protein structures, cognate ligands, and various types of decoy sets. Here, pose prediction is reported on the Astex Diverse set of 85 protein ligand complexes, and virtual screening performance is reported on the DUD set of 40 protein targets. In both cases, prepared structures of targets and ligands were provided by symposium organizers. The re-prepared data sets yielded results not significantly different than previous reports of Surflex-Dock on the two benchmarks. Minor changes to protein coordinates resulting from complex pre-optimization had large effects on observed performance, highlighting the limitations of cognate ligand re-docking for pose prediction assessment. Docking protocols developed for cross-docking, which address protein flexibility and produce discrete families of predicted poses, produced substantially better performance for pose prediction. Performance on virtual screening performance was shown to benefit by employing and combining multiple screening methods: docking, 2D molecular similarity, and 3D molecular similarity. In addition, use of multiple protein conformations significantly improved screening enrichment.

• In Silico Mutagenesis and Docking Study of Ralstonia solanacearum RSL Lectin: Performance of Docking Software To Predict Saccharide Binding.
Mishra, Sushil Kumar and Adam, Jan and Wimmerová, Michaela and Koca, Jaroslav
Journal of chemical information and modeling, 2012, 52(5), 1250-1261
PMID: 22506916     doi: 10.1021/ci200529n

In this study, in silico mutagenesis and docking in Ralstonia solanacearum lectin (RSL) were carried out, and the ability of several docking software programs to calculate binding affinity was evaluated. In silico mutation of six amino acid residues (Agr17, Glu28, Gly39, Ala40, Trp76, and Trp81) was done, and a total of 114 in silico mutants of RSL were docked with Me-$\alpha$-l-fucoside. Our results show that polar residues Arg17 and Glu28, as well as nonpolar amino acids Trp76 and Trp81, are crucial for binding. Gly39 may also influence ligand binding because any mutations at this position lead to a change in the binding pocket shape. The Ala40 residue was found to be the most interesting residue for mutagenesis and can affect the selectivity and/or affinity. In general, the docking software used performs better for high affinity binders and fails to place the binding affinities in the correct order.

• Docking performance of the glide program as evaluated on the Astex and DUD datasets: a complete set of glide SP results and selected results for a new scoring function integrating WaterMap and glide.
Repasky, Matthew P and Murphy, Robert B and Banks, Jay L and Greenwood, Jeremy R and Tubert-Brohman, Ivan and Bhat, Sathesh and Friesner, Richard A
Journal of computer-aided molecular design, 2012, 26(6), 787-799
PMID: 22576241     doi: 10.1007/s10822-012-9575-9

Glide SP mode enrichment results for two preparations of the DUD dataset and native ligand docking RMSDs for two preparations of the Astex dataset are presented. Following a best-practices preparation scheme, an average RMSD of 1.140\AA} for native ligand docking with Glide SP is computed. Following the same best-practices preparation scheme for the DUD dataset an average area under the ROC curve (AUC) of 0.80 and average early enrichment via the ROC (0.1 %) metric of 0.12 were observed. 74 and 56 % of the 39 best-practices prepared targets showed AUC over 0.7 and 0.8, respectively. Average AUC was greater than 0.7 for all best-practices protein families demonstrating consistent enrichment performance across a broad range of proteins and ligand chemotypes. In both Astex and DUD datasets, docking performance is significantly improved employing a best-practices preparation scheme over using minimally-prepared structures from the PDB. Enrichment results for WScore, a new scoring function and sampling methodology integrating WaterMap and Glide, are presented for four DUD targets, hivrt, hsp90, cdk2, and fxa. WScore performance in early enrichment is consistently strong and all systems examined show AUC > 0.9 and superior early enrichment to DUD best-practices Glide SP results.

• Pose prediction and virtual screening performance of GOLD scoring functions in a standardized test.
Liebeschuetz, John W and Cole, Jason C and Korb, Oliver
Journal of computer-aided molecular design, 2012, 26(6), 737-748
PMID: 22371207     doi: 10.1007/s10822-012-9551-4

The performance of all four GOLD scoring functions has been evaluated for pose prediction and virtual screening under the standardized conditions of the comparative docking and scoring experiment reported in this Edition. Excellent pose prediction and good virtual screening performance was demonstrated using unmodified protein models and default parameter settings. The best performing scoring function for both pose prediction and virtual screening was demonstrated to be the recently introduced scoring function ChemPLP. We conclude that existing docking programs already perform close to optimally in the cognate pose prediction experiments currently carried out and that more stringent pose prediction tests should be used in the future. These should employ cross-docking sets. Evaluation of virtual screening performance remains problematic and much remains to be done to improve the usefulness of publically available active and decoy sets for virtual screening. Finally we suggest that, for certain target/scoring function combinations, good enrichment may sometimes be a consequence of 2D property recognition rather than a modelling of the correct 3D interactions.

• Protein-Ligand-Based Pharmacophores: Generation and Utility Assessment in Computational Ligand Profiling.
Meslamani, Jamel and Li, Jiabo and Sutter, Jon and Stevens, Adrian and Bertrand, Hugues-Olivier and Rognan, Didier
Journal of chemical information and modeling, 2012, 52, 943-955
PMID: 22480372     doi: 10.1021/ci300083r

Ligand profiling is an emerging computational method for predicting the most likely targets of a bioactive compound and therefore anticipating adverse reactions, side effects and drug repurposing. A few encouraging successes have already been reported using ligand 2-D similarity searches and protein-ligand docking. The current study describes the use of receptor-ligand-derived pharmacophore searches as a tool to link ligands to putative targets. A database of 68,056 pharmacophores was first derived from 8,166 high-resolution protein-ligand complexes. In order to limit the number of queries, a maximum of 10 pharmacophores was generated for each complex according to their predicted selectivity. Pharmacophore search was compared to ligand-centric (2-D and 3-D similarity searches) and docking methods in profiling a set of 157 diverse ligands against a panel of 2,556 unique targets of known X-ray structure. As expected, ligand-based methods outperformed, in most of the cases, structure-based approaches in ranking the true targets among the top 1% scoring entries. However, we could identify ligands for which only a single method was successful. Receptor-ligand-based pharmacophore search is notably a fast and reliable alternative to docking when few ligand information is available for some targets. Overall, the present study suggests that a workflow using the best profiling method according to the protein-ligand context is the best strategy to follow. We notably present concrete guidelines for selecting the optimal computational method according to simple ligand and binding site properties.

## 2011

• Virtual decoy sets for molecular docking benchmarks.
Wallach, Izhar and Lilien, Ryan
Journal of chemical information and modeling, 2011, 51(2), 196-202
PMID: 21207928     doi: 10.1021/ci100374f

Virtual docking algorithms are often evaluated on their ability to separate active ligands from decoy molecules. The current state-of-the-art benchmark, the Directory of Useful Decoys (DUD), minimizes bias by including decoys from a library of synthetically feasible molecules that are physically similar yet chemically dissimilar to the active ligands. We show that by ignoring synthetic feasibility, we can compile a benchmark that is comparable to the DUD and less biased with respect to physical similarity.

• FRED pose prediction and virtual screening accuracy.
McGann, Mark
Journal of chemical information and modeling, 2011, 51(3), 578-596
PMID: 21323318     doi: 10.1021/ci100436p

Results of a previous docking study are reanalyzed and extended to include results from the docking program FRED and a detailed statistical analysis of both structure reproduction and virtual screening results. FRED is run both in a traditional docking mode and in a hybrid mode that makes use of the structure of a bound ligand in addition to the protein structure to screen molecules. This analysis shows that most docking programs are effective overall but highly inconsistent, tending to do well on one system and poorly on the next. Comparing methods, the difference in mean performance on DUD is found to be statistically significant (95% confidence) 61% of the time when using a global enrichment metric (AUC). Early enrichment metrics are found to have relatively poor statistical power, with 0.5% early enrichment only able to distinguish methods to 95% confidence 14% of the time.

• Construction and test of ligand decoy sets using MDock: community structure-activity resource benchmarks for binding mode prediction.
Huang, Sheng-You and Zou, Xiaoqin
Journal of chemical information and modeling, 2011, 51(9), 2107-2114
PMID: 21755952     doi: 10.1021/ci200080g

Two sets of ligand binding decoys have been constructed for the community structure-activity resource (CSAR) benchmark by using the MDock and DOCK programs for rigid- and flexible-ligand docking, respectively. The decoys generated for each complex in the benchmark thoroughly cover the binding site and also contain a certain number of near-native binding modes. A few scoring functions have been evaluated using the ligand binding decoy sets for their abilities of predicting near-native binding modes. Among them, ITScore achieved a success rate of 86.7% for the rigid-ligand decoys and 79.7% for the flexible-ligand decoys, under the common definition of a successful prediction as root-mean-square deviation <2.0\AA} from the native structure if the top-scored binding mode was considered. The decoy sets may serve as benchmarks for binding mode prediction of a scoring function, which are available at the CSAR Web site ( http://www.csardock.org/).

• Evaluation of docking performance in a blinded virtual screening of fragment-like trypsin inhibitors.
Surpateanu, Georgiana and Iorga, Bogdan I
Journal of computer-aided molecular design, 2011, 26(5), 595-601
PMID: 22180049     doi: 10.1007/s10822-011-9526-x

In this study, we have "blindly" assessed the ability of several combinations of docking software and scoring functions to predict the binding of a fragment-like library of bovine trypsine inhibitors. The most suitable protocols (involving Gold software and GoldScore scoring function, with or without rescoring) were selected for this purpose using a training set of compounds with known biological activities. The selected virtual screening protocols provided good results with the SAMPL3-VS dataset, showing enrichment factors of about 10 for Top 20 compounds. This methodology should be useful in difficult cases of docking, with a special emphasis on the fragment-based virtual screening campaigns.

• Docking performance of fragments and druglike compounds.
Verdonk, Marcel L and Giangreco, Ilenia and Hall, Richard J and Korb, Oliver and Mortenson, Paul N and Murray, Christopher W
Journal of medicinal chemistry, 2011, 54(15), 5422-5431
PMID: 21692478     doi: 10.1021/jm200558u

This paper addresses two questions of key interest to researchers working with protein-ligand docking methods: (i) Why is there such a large variation in docking performance between different test sets reported in the literature? (ii) Are fragments more difficult to dock than druglike compounds? To answer these, we construct a test set of in-house X-ray structures of protein-ligand complexes from drug discovery projects, half of which contain fragment ligands, the other half druglike ligands. We find that a key factor affecting docking performance is ligand efficiency (LE). High LE compounds are significantly easier to dock than low LE compounds, which we believe could explain the differences observed between test sets reported in the literature. There is no significant difference in docking performance between fragments and druglike compounds, but the reasons why dockings fail appear to be different.

• Ligand and Decoy Sets for Docking to G Protein-Coupled Receptors.
Gatica, Edgar A and Cavasotto, Claudio N
Journal of chemical information and modeling, 2011, 52(1), 1-6
PMID: 22168315     doi: 10.1021/ci200412p

We compiled a G protein-coupled receptor (GPCR) ligand library (GLL) for 147 targets, selecting for each ligand 39 decoy molecules, collected in the GPCR Decoy Database (GDD). Decoys were chosen ensuring a ligand-decoy similarity of six physical properties, while enforcing ligand-decoy chemical dissimilarity. The performance in docking of the GDD was evaluated on 19 GPCRs, showing a marked decrease in enrichment compared to bias-uncorrected decoy sets. Both the GLL and GDD are freely available for the scientific community.

• DEKOIS: Demanding Evaluation Kits for Objective in Silico Screening - A Versatile Tool for Benchmarking Docking Programs and Scoring Functions.
Vogel, Simon M and Bauer, Matthias R and Boeckler, Frank M
Journal of chemical information and modeling, 2011, 51(10), 2650-2665
PMID: 21774552     doi: 10.1021/ci2001549

For widely applied in silico screening techniques success depends on the rational selection of an appropriate method. We herein present a fast, versatile, and robust method to construct demanding evaluation kits for objective in silico screening (DEKOIS). This automated process enables creating tailor-made decoy sets for any given sets of bioactives. It facilitates a target-dependent validation of docking algorithms and scoring functions helping to save time and resources. We have developed metrics for assessing and improving decoy set quality and employ them to investigate how decoy embedding affects docking. We demonstrate that screening performance is target-dependent and can be impaired by latent actives in the decoy set (LADS) or enhanced by poor decoy embedding. The presented method allows extending and complementing the collection of publicly available high quality decoy sets toward new target space. All present and future DEKOIS data sets will be made accessible at www.dekois.com .

• SERAPhiC: A Benchmark for in Silico Fragment-Based Drug Design.
Favia, Angelo D and Bottegoni, Giovanni and Nobeli, Irene and Bisignano, Paola and Cavalli, Andrea
Journal of chemical information and modeling, 2011, 51(11), 2882-2896
PMID: 21936510     doi: 10.1021/ci2003363

Our main objective was to compile a data set of high-quality protein-fragment complexes and make it publicly available. Once assembled, the data set was challenged using docking procedures to address the following questions: (i) Can molecular docking correctly reproduce the experimentally solved structures? (ii) How thorough must the sampling be to replicate the experimental data? (iii) Can commonly used scoring functions discriminate between the native pose and other energy minima? The data set, named SERAPhiC (Selected Fragment Protein Complexes), is publicly available in a ready-to-dock format ( http://www.iit.it/en/drug-discovery-and-development/seraphic.html ). It offers computational medicinal chemists a reliable test set for both in silico protocol assessment and software development.

• LigDockCSA: Protein-ligand docking using conformational space annealing.
Shin, Woong-Hee and Heo, Lim and Lee, Juyong and Ko, Junsu and Seok, Chaok and Lee, Jooyoung
Journal of computational chemistry, 2011, 32(15), 3226-3232
PMID: 21837636     doi: 10.1002/jcc.21905

Protein-ligand docking techniques are one of the essential tools for structure-based drug design. Two major components of a successful docking program are an efficient search method and an accurate scoring function. In this work, a new docking method called LigDockCSA is developed by using a powerful global optimization technique, conformational space annealing (CSA), and a scoring function that combines the AutoDock energy and the piecewise linear potential (PLP) torsion energy. It is shown that the CSA search method can find lower energy binding poses than the Lamarckian genetic algorithm of AutoDock. However, lower-energy solutions CSA produced with the AutoDock energy were often less native-like. The loophole in the AutoDock energy was fixed by adding a torsional energy term, and the CSA search on the refined energy function is shown to improve the docking performance. The performance of LigDockCSA was tested on the Astex diverse set which consists of 85 protein-ligand complexes. LigDockCSA finds the best scoring poses within 2\AA} root-mean-square deviation (RMSD) from the native structures for 84.7% of the test cases, compared to 81.7% for AutoDock and 80.5% for GOLD. The results improve further to 89.4% by incorporating the conformational entropy.

• Can We Trust Docking Results? Evaluation of Seven Commonly Used Programs on PDBbind Database
Plewczynski, Dariusz and Lazniewski, Michal and Augustyniak, Rafal and Ginalski, Krzysztof
Journal of computational chemistry, 2011, 32(4), 742-755
PMID: 20812323     doi: 10.1002/jcc.21643

Docking is one of the most commonly used techniques in drug design. It is used for both identifying correct poses of a ligand in the binding site of a protein as well as for the estimation of the strength of protein-ligand interaction. Because millions of compounds must be screened, before a suitable target for biological testing can be identified, all calculations should be done in a reasonable time frame. Thus, all programs currently in use exploit empirically based algorithms, avoiding systematic search of the conformational space. Similarly, the scoring is done using simple equations, which makes it possible to speed up the entire process. Therefore, docking results have to be verified by subsequent in vitro studies. The purpose of our work was to evaluate seven popular docking programs (Surf lex, LigandFit, Glide, GOLD, FlexX, eHiTS, and Auto Dock) on the extensive dataset composed of 1300 protein-ligands complexes from PDBbind 2007 database, where experimentally measured binding affinity values were also available. We compared independently the ability of proper posing [according to Root mean square deviation (or Root mean square distance) of predicted conformations versus the corresponding native one] and scoring (by calculating the correlation between docking score and ligand binding strength). To our knowledge, it is the first large-scale docking evaluation that covers both aspects of docking programs, that is, predicting ligand conformation and calculating the strength of its binding. More than 1000 protein-ligand pairs cover a wide range of different protein families and inhibitor classes. Our results clearly showed that the ligand binding conformation could be identified in most cases by using the existing software, yet we still observed the lack of universal scoring function for all types of molecules and protein families. (C) 2010 Wiley Periodicals, Inc. J Comput Chem 32: 742-755, 2011

• VoteDock: Consensus Docking Method for Prediction of Protein-Ligand Interactions
Plewczynski, Dariusz and Lazniewski, Michal and Von Grotthuss, Marcin and Rychlewski, Leszek and Ginalski, Krzysztof
Journal of computational chemistry, 2011, 32(4), 568-581
PMID: 20812324     doi: 10.1002/jcc.21642

Molecular recognition plays a fundamental role in all biological processes, and that is why great efforts have been made to understand and predict protein ligand interactions. Finding a molecule that can potentially bind to a target protein is particularly essential in drug discovery and still remains an expensive and time-consuming task. In sale, tools are frequently used to screen molecular libraries to identify new lead compounds, and if protein structure is known, various protein ligand docking programs can be used. The aim of docking procedure is to predict correct poses of ligand in the binding site of the protein as well as to score them according to the strength of interaction in a reasonable time frame. The purpose of our studies was to present the novel consensus approach to predict both protein ligand complex structure and its corresponding binding affinity. Our method used as the input the results from seven docking programs (Surflex, LigandFit, Glide, GOLD, FlexX, eHiTS, and AutoDock) that are widely used for docking of ligands. We evaluated it on the extensive benchmark dataset of 1300 protein-ligands pairs from refined PDBbind database for which the structural and affinity data was available. We compared independently its ability of proper scoring and posing to the previously proposed methods. In most cases, our method is able to dock properly approximately 20% of pairs more than docking methods on average, and over 10% of pairs more than the best single program. The RMSD value of the predicted complex conformation versus its native one is reduced by a factor of 0.5 angstrom. Finally, we were able to increase the Pearson correlation of the predicted binding affinity in comparison with the experimental value up to 0.5. (C) 2010 Wiley Periodicals, Inc. J Comput Chem 32: 568-581, 2011

• Comments on "leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets": significance for the validation of scoring functions.
Ballester, Pedro J and Mitchell, John B O
Journal of chemical information and modeling, 2011, 51(8), 1739-1741
PMID: 21591735     doi: 10.1021/ci200057e

## 2010

• Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods.
Venkatraman, Vishwesh and Pérez-Nueno, Violeta I and Mavridis, Lazaros and Ritchie, David W
Journal of chemical information and modeling, 2010, 50(12), 2079-2093
PMID: 21090728     doi: 10.1021/ci100263p

In recent years, many virtual screening (VS) tools have been developed that employ different molecular representations and have different speed and accuracy characteristics. In this paper, we compare ten popular ligand-based VS tools using the publicly available Directory of Useful Decoys (DUD) data set comprising over 100 000 compounds distributed across 40 protein targets. The DUD was developed initially to evaluate docking algorithms, but our results from an operational correlation analysis show that it is also well suited for comparing ligand-based VS tools. Although it is conventional wisdom that 3D molecular shape is an important determinant of biological activity, our results based on permutational significance tests of several commonly used VS metrics show that the 2D fingerprint-based methods generally give better VS performance than the 3D shape-based approaches for surprisingly many of the DUD targets. To help understand this finding, we have analyzed the nature of the scoring functions used and the composition of the DUD data set itself. We propose that to improve the VS performance of current 3D methods, it will be necessary to devise screening queries that can represent multiple possible conformations and which can exploit knowledge of known actives that span multiple scaffold families.

• Prediction of protein-ligand binding affinities using multiple instance learning.
Teramoto, Reiji and Kashima, Hisashi
Journal of molecular graphics & modelling, 2010, 29(3), 492-497
PMID: 20965757     doi: 10.1016/j.jmgm.2010.09.006

Accurate prediction of protein-ligand binding affinities for lead optimization in drug discovery remains an important and challenging problem on scoring functions for docking simulation. In this paper, we propose a data-driven approach that integrates multiple scoring functions to predict protein-ligand binding affinity directly. We then propose a new method called multiple instance regression based scoring (MIRS) that incorporates unbound ligand conformations using multiple scoring functions. We evaluated the predictive performance of MIRS using 100 protein-ligand complexes and their binding affinities. The experimental results showed that MIRS outperformed the 11 conventional scoring functions including LigScore, PLP, AutoDock, G-Score, D-Score, LUDI, F-Score, ChemScore, X-Score, PMF, and DrugScore. In addition, we confirmed that MIRS performed well on binding pose prediction. Our results reveal that it is indispensable to incorporate unbound ligand conformations in both binding affinity prediction and binding pose prediction. The proposed method will accelerate efficient lead optimization on structure-based drug design and provide a new direction to designing of new scoring score functions.

• Comparative evaluation of 3D virtual ligand screening methods: impact of the molecular alignment on enrichment.
Giganti, David and Guillemain, Hélène and Spadoni, Jean-Louis and Nilges, Michael and Zagury, Jean-François and Montes, Matthieu
Journal of chemical information and modeling, 2010, 50(6), 992-1004
PMID: 20527883     doi: 10.1021/ci900507g

In the early stage of drug discovery programs, when the structure of a complex involving a target and a small molecule is available, structure-based virtual ligand screening methods are generally preferred. However, ligand-based strategies like shape-similarity search methods can also be applied. Shape-similarity search methods consist in exploring a pseudo-binding-site derived from the known small molecule used as a reference. Several of these methods use conformational sampling algorithms which are also shared by corresponding docking methods: for example Surflex-dock/Surflex-sim, FlexX/FlexS, ICM, and OMEGA-FRED/OMEGA-ROCS. Using 11 systems issued from the challenging "own" subsets of the Directory of Useful Decoys (DUD-own), we evaluated and compared the performance of the above-cited programs in terms of molecular alignment accuracy, enrichment in active compounds, and enrichment in different chemotypes (scaffold-hopping). Since molecular alignment is a crucial aspect of performance for the different methods, we have assessed its impact on enrichment. We have also illustrated the paradox of retrieving active compounds with good scores even if they are inaccurately positioned. Finally, we have highlighted possible positive aspects of using shape-based approaches in drug-discovery protocols when the structure of the target in complex with a small molecule is known.

• Reducing docking score variations arising from input differences.
Feher, Miklos and Williams, Christopher I
Journal of chemical information and modeling, 2010, 50(9), 1549-1560
PMID: 20698562     doi: 10.1021/ci100204x

The variability of docking results as a function of variations in ligand input conformations was studied for the GOLD, Glide, FlexX, and Surflex programs. It is concluded that there are two major effects leading to such variability: the adequacy of conformational search during docking and random "chaotic" effects arising from sensitivity to small input perturbations. It is shown that although the former is generally the stronger effect, the latter is also highly significant for almost all docking engines. The strong target-to-target variation of the magnitude of these effects is emphasized. The performance of different packages is compared using these measures. Guidelines are provided for different programs to reduce variability and improve reproducibility, which involve using a small number of input conformations as starting points for docking, followed by the selection of the top scoring docked pose from the results as the best docked solution.

• Evaluation of the Performance of Four Molecular Docking Programs on a Diverse Set of Protein-Ligand Complexes
Li, Xun and Li, Yan and Cheng, Tiejun and Liu, Zhihai and Wang, Renxiao
Journal of computational chemistry, 2010, 31(11), 2109-2125
PMID: 20127741     doi: 10.1002/jcc.21498

Many molecular docking programs are available nowadays, and thus it is of great practical value to evaluate and compare their performance We have conducted an extensive evaluation of four popular commercial molecular docking programs, including Glide, GOLD. LigandFit. and Surflex Our test set consists of 195 protein-ligand complexes with high-resolution crystal structures (resolution <

• Comparison of Structure- and Ligand-Based Virtual Screening Protocols Considering Hit List Complementarity and Enrichment Factors
Krueger, Dennis M. and Evers, Andreas
Chemmedchem, 2010, 5(1), 148-158
PMID: 19908272     doi: 10.1002/cmdc.200900314

Structure- and ligand-based virtual-screening methods (clocking, 2D- and 3D-similarity searching) were analysed for their effectiveness in virtual screening against four different targets: angiotensin-converting enzyme (ACE), cyclooxygenase 2 (COX-1 2), thrombin and human immunodeficiency virus I (HIV-1) protease. The relative performance of the tools was compared by examining their ability to recognise known active compounds from a set of actives and nonactives. Furthermore, we investigated whether the application of different virtual-screening methods in parallel provides complementary or redundant hit lists. Docking was performed with GOLD, Glide, FlexX and Surflex. The obtained docking poses were rescored by using nine different scoring functions in addition to the scoring functions implemented as objective functions in the docking algorithms. Ligand-based virtual screening was done with ROCS (3D-similarity searching), Feature Trees and Scitegic Functional Fingerprints (2D-similarity searching). The results show that structure- and ligand-based virtual-screening methods provide comparable enrichments in detecting active compounds. Interestingly, the hit lists that are obtained from different virtual-screening methods are generally highly complementary. These results suggest that a parallel application of different structure- and ligand-based virtual-screening methods increases the chance of identifying more (and more diverse) active compounds from a virtual-screening campaign.

## 2009

• Docking ligands into flexible and solvated macromolecules. 4. Are popular scoring functions accurate for this class of proteins?
Englebienne, Pablo and Moitessier, Nicolas
Journal of chemical information and modeling, 2009, 49(6), 1568-1580
PMID: 19445499     doi: 10.1021/ci8004308

In our previous report, we investigated the impact of protein flexibility and the presence of water molecules on the pose-prediction accuracy of major docking programs. To complete these investigations, we report herein a study of the impact of these two aspects on the accuracy of scoring functions. To this effect, we developed two sets of protein/ligand complexes made up of ligands cross-docked or cocrystallized with a large variety of proteins, featuring bridging water molecules and demonstrating protein flexibility. Efforts were made to reduce the correlation between the molecular weights of the selected ligands and their binding affinities, a major bias in some previously reported benchmark sets. Using these sets, 18 available scoring functions have been assessed for their accuracy to predict binding affinities and to rank-order compounds by their affinity to cocrystallized proteins. This study confirmed the good and similar accuracy of Xscore, GlideScore, DrugScore(CSD), GoldScore, PLP1, ChemScore, RankScore, and the eHiTS scoring function. Our next investigations demonstrated that most of the assessed scoring functions were much less accurate when the correct protein conformation was not provided. This study also revealed that considering the water molecules for scoring does not greatly affect the accuracy. Finally, this work sheds light on the high correlation between scoring functions and the poor increase in accuracy one can expect from consensus scoring.

• Carborane clusters in computational drug design: a comparative docking evaluation using AutoDock, FlexX, Glide, and Surflex.
Tiwari, Rohit and Mahasenan, Kiran and Pavlovicz, Ryan and Li, Chenglong and Tjarks, Werner
Journal of chemical information and modeling, 2009, 49(6), 1581-1589
PMID: 19449853     doi: 10.1021/ci900031y

Compounds containing boron atoms play increasingly important roles in the therapy and diagnosis of various diseases, particularly cancer. However, computational drug design of boron-containing therapeutics and diagnostics is hampered by the fact that many software packages used for this purpose lack parameters for all or part of the various types of boron atoms. In the present paper, we describe simple and efficient strategies to overcome this problem, which are based on the replacement of boron atom types with carbon atom types. The developed methods were validated by docking closo- and nido-carboranyl antifolates into the active site of a human dihydrofolate reductase (hDHFR) using AutoDock, Glide, FlexX, and Surflex and comparing the obtained docking poses with the poses of their counterparts in the original hDHFR-carboranyl antifolate crystal structures. Under optimized conditions, AutoDock and Glide were equally good in docking of the closo-carboranyl antifolates followed by Surflex and FlexX, whereas Autodock, Glide, and Surflex proved to be comparably efficient in the docking of nido-carboranyl antifolates followed by FlexX. Differences in geometries and partial atom charges in the structures of the carboranyl antifolates resulting from different data sources and/or optimization methods did not impact the docking performances of AutoDock or Glide significantly. Binding energies predicted by all four programs were in accordance with experimental data.

• Validation of molecular docking programs for virtual screening against dihydropteroate synthase.
Hevener, Kirk E and Zhao, Wei and Ball, David M and Babaoglu, Kerim and Qi, Jianjun and White, Stephen W and Lee, Richard E
Journal of chemical information and modeling, 2009, 49(2), 444-460
PMID: 19434845     doi: 10.1021/ci800293n

Dihydropteroate synthase (DHPS) is the target of the sulfonamide class of antibiotics and has been a validated antibacterial drug target for nearly 70 years. The sulfonamides target the p-aminobenzoic acid (pABA) binding site of DHPS and interfere with folate biosynthesis and ultimately prevent bacterial replication. However, widespread bacterial resistance to these drugs has severely limited their effectiveness. This study explores the second and more highly conserved pterin binding site of DHPS as an alternative approach to developing novel antibiotics that avoid resistance. In this study, five commonly used docking programs, FlexX, Surflex, Glide, GOLD, and DOCK, and nine scoring functions, were evaluated for their ability to rank-order potential lead compounds for an extensive virtual screening study of the pterin binding site of B. anthracis DHPS. Their performance in ligand docking and scoring was judged by their ability to reproduce a known inhibitor conformation and to efficiently detect known active compounds seeded into three separate decoy sets. Two other metrics were used to assess performance; enrichment at 1% and 2% and Receiver Operating Characteristic (ROC) curves. The effectiveness of postdocking relaxation prior to rescoring and consensus scoring were also evaluated. Finally, we have developed a straightforward statistical method of including the inhibition constants of the known active compounds when analyzing enrichment results to more accurately assess scoring performance, which we call the 'sum of the sum of log rank' or SSLR. Of the docking and scoring functions evaluated, Surflex with Surflex-Score and Glide with GlideScore were the best overall performers for use in virtual screening against the DHPS target, with neither combination showing statistically significant superiority over the other in enrichment studies or pose selection. Postdocking ligand relaxation and consensus scoring did not improve overall enrichment.

• GARD: a Generally Applicable Replacement for RMSD.
Baber, J Christian and Thompson, David C and Cross, Jason B and Humblet, Christine
Journal of chemical information and modeling, 2009, 49(8), 1889-1900
PMID: 19618919     doi: 10.1021/ci9001074

The root-mean-squared deviation (rmsd) is a widely used measure of distance between two aligned objects - often chemical structures. However, rmsd has a number of known limitations including difficulty of interpretation, no limit on weighting for any portion of the alignment, and a lack of normalization. In this work, a Generally Applicable Replacement for rmsD (GARD) is proposed. In this implementation atomic contributions are weighted by their relative importance to binding, as determined statistically by Andrews et al. (1) , and as such this method is 'chemically aware'. This novel measure is normalized and does not have many of the failings of traditional rmsd. It is, thus, perfectly suited for a wide variety of uses, including the assessment of the quality of poses produced from molecular docking programs and the comparison of conformers. Rmsd and GARD are compared in their ability to assess docking software and multiple examples of the use of GARD to rescue essentially correct poses with a high rmsd are presented.

• Docking ligands into flexible and solvated macromolecules. 3. Impact of input ligand conformation, protein flexibility, and water molecules on the accuracy of docking programs.
Corbeil, Christopher R and Moitessier, Nicolas
Journal of chemical information and modeling, 2009, 49(4), 997-1009
PMID: 19391631     doi: 10.1021/ci8004176

Several modifications and additions to Fitted1.5 led to the development of Fitted2.6. Among the novel implementations are a matching algorithm-enhanced genetic algorithm and a ring conformational search algorithm. With these various optimizations, we also hoped to remove the biases and to develop a docking program that would provide results (i.e., poses) as independent as possible to the input ligand and protein conformations and used parameters, although keeping the options to provide additional experimental information. These biases were investigated within Fitted2.6 along with FlexX, GOLD, Glide, and Surflex. The input ligand conformation was found to have a major impact on the program accuracy as drops as large as 10-50% were observed with all the programs but Fitted. This comparative study also demonstrates that the accuracy of Fitted is similar to that of other widely used programs. We have also demonstrated that protein flexibility, displaceable water molecules, and ring conformational search algorithms, three of the main Fitted features, significantly increased its accuracy. Finally, we also proposed potential modifications to the available programs to further improve their accuracy in binding mode prediction.

• Blind docking of pharmaceutically relevant compounds using RosettaLigand.
Davis, Ian W and Raha, Kaushik and Head, Martha S and Baker, David
Protein science : a publication of the Protein Society, 2009, 18(9), 1998-2002
PMID: 19554568     doi: 10.1002/pro.192

It is difficult to properly validate algorithms that dock a small molecule ligand into its protein receptor using data from the public domain: the predictions are not blind because the correct binding mode is already known, and public test cases may not be representative of compounds of interest such as drug leads. Here, we use private data from a real drug discovery program to carry out a blind evaluation of the RosettaLigand docking methodology and find that its performance is on average comparable with that of the best commercially available current small molecule docking programs. The strength of RosettaLigand is the use of the Rosetta sampling methodology to simultaneously optimize protein sidechain, protein backbone and ligand degrees of freedom; the extensive benchmark test described here identifies shortcomings in other aspects of the protocol and suggests clear routes to improving the method.

• An improved adaptive genetic algorithm for protein-ligand docking.
Kang, Ling and Li, Honglin and Jiang, Hualiang and Wang, Xicheng
Journal of computer-aided molecular design, 2009, 23(1), 1-12
PMID: 18777161     doi: 10.1007/s10822-008-9232-5

A new optimization model of molecular docking is proposed, and a fast flexible docking method based on an improved adaptive genetic algorithm is developed in this paper. The algorithm takes some advanced techniques, such as multi-population genetic strategy, entropy-based searching technique with self-adaptation and the quasi-exact penalty. A new iteration scheme in conjunction with above techniques is employed to speed up the optimization process and to ensure very rapid and steady convergence. The docking accuracy and efficiency of the method are evaluated by docking results from GOLD test data set, which contains 134 protein-ligand complexes. In over 66.2% of the complexes, the docked pose was within 2.0 A root-mean-square deviation (RMSD) of the X-ray structure. Docking time is approximately in proportion to the number of the rotatable bonds of ligands.

• Comparative assessment of scoring functions on a diverse test set.
Cheng, Tiejun and Li, Xun and Li, Yan and Liu, Zhihai and Wang, Renxiao
Journal of chemical information and modeling, 2009, 49(4), 1079-1093
PMID: 19358517     doi: 10.1021/ci9000053

Scoring functions are widely applied to the evaluation of protein-ligand binding in structure-based drug design. We have conducted a comparative assessment of 16 popular scoring functions implemented in main-stream commercial software or released by academic research groups. A set of 195 diverse protein-ligand complexes with high-resolution crystal structures and reliable binding constants were selected through a systematic nonredundant sampling of the PDBbind database and used as the primary test set in our study. All scoring functions were evaluated in three aspects, that is, "docking power", "ranking power", and "scoring power", and all evaluations were independent from the context of molecular docking or virtual screening. As for "docking power", six scoring functions, including GOLD::ASP, DS::PLP1, DrugScore(PDB), GlideScore-SP, DS::LigScore, and GOLD::ChemScore, achieved success rates over 70% when the acceptance cutoff was root-mean-square deviation < 2.0 A. Combining these scoring functions into consensus scoring schemes improved the success rates to 80% or even higher. As for "ranking power" and "scoring power", the top four scoring functions on the primary test set were X-Score, DrugScore(CSD), DS::PLP, and SYBYL::ChemScore. They were able to correctly rank the protein-ligand complexes containing the same type of protein with success rates around 50%. Correlation coefficients between the experimental binding constants and the binding scores computed by these scoring functions ranged from 0.545 to 0.644. Besides the primary test set, each scoring function was also tested on four additional test sets, each consisting of a certain number of protein-ligand complexes containing one particular type of protein. Our study serves as an updated benchmark for evaluating the general performance of today's scoring functions. Our results indicate that no single scoring function consistently outperforms others in all three aspects. Thus, it is important in practice to choose the appropriate scoring functions for different purposes.

## 2008

• Consensus scoring with feature selection for structure-based virtual screening
Teramoto, Reiji and Fukunishi, Hiroaki
Journal of chemical information and modeling, 2008, 48(2), 288-295
doi: 10.1021/ci700239t

The evaluation of ligand conformations is a crucial aspect of structure-based virtual screening, and scoring functions play significant roles in it. While consensus scoring (CS) generally improves enrichment by compensating for the deficiencies of each scoring function, the strategy of how individual scoring functions are selected remains a challenging task when few known active compounds are available. To address this problem, we propose feature selection-based consensus scoring (FSCS), which performs supervised feature selection with docked native ligand conformations to select complementary scoring functions. We evaluated the enrichments of five scoring functions (F-Score, D-Score, PMF, G-Score, and ChemScore), FSCS, and RCS (rank-by-rank consensus scoring) for four different target proteins: acetylcholine esterase (AChE), thrombin (thrombin), phosphodiesterase 5 (PDE5), and peroxisome proliferator-activated receptor gamma (PPAR gamma). The results indicated that FSCS was able to select the complementary scoring functions and enhance ligand enrichments and that it outperformed RCS and the individual scoring functions for all target proteins. They also indicated that the performances of the single scoring functions were strongly dependent on the target protein. An especially favorable result with implications for practical drug screening is that FSCS performs well even if only one 3D structure of the protein-ligand complex is known. Moreover, we found that one can infer which scoring functions significantly enrich active compounds by using feature selection before actual docking and that the selected scoring functions are complementary.

• Bias, reporting, and sharing: computational evaluations of docking methods.
Jain, Ajay N
Journal of computer-aided molecular design, 2008, 22(3-4), 201-212
PMID: 18075713     doi: 10.1007/s10822-007-9151-x

Computational methods for docking ligands to protein binding sites have become ubiquitous in drug discovery. Despite the age of the field, no standards have been established with respect to methodological evaluation of docking accuracy, virtual screening utility, or scoring accuracy. There are critical issues relating to data sharing, data set design and preparation, and statistical reporting that have an impact on the degree to which a report will translate into real-world performance. These issues also have an impact on whether there is a transparent relationship between methodological changes and reported performance improvements. This paper presents detailed examples of pitfalls in each area and makes recommendations as to best practices.

• Evaluating docking programs: keeping the playing field level.
Liebeschuetz, John W
Journal of computer-aided molecular design, 2008, 22(3-4), 229-238
PMID: 18196461     doi: 10.1007/s10822-008-9169-8

Over recent years many enrichment studies have been published which purport to rigorously compare the performance of two or more docking protocols. It has become clear however that such studies often have flaws within their methodologies, which cast doubt on the rigour of the conclusions. Setting up such comparisons is fraught with difficulties and no best mode of practice is available to guide the experimenter. Careful choice of structural models and ligands appropriate to those models is important. The protein structure should be representative for the target. In addition the set of active ligands selected should be appropriate to the structure in cases where different forms of the protein bind different classes of ligand. Binding site definition is also an area in which errors arise. Particular care is needed in deciding which crystallographic waters to retain and again this may be predicated by knowledge of the likely binding modes of the ligands making up the active ligand list. Geometric integrity of the ligand structures used is clearly important yet it is apparent that published sets of actives + decoys may contain sometimes high proportions of incorrect structures. Choice of protocol for docking and analysis needs careful consideration as many programs can be tweaked for optimum performance. Should studies be run using 'black box' protocols supplied by the software provider? Lastly, the correct method of analysis of enrichment studies is a much discussed topic at the moment. However currently promoted approaches do not consider a crucial aspect of a successful virtual screen, namely that a good structural diversity of hits be returned. Overall there is much to consider in the experimental design of enrichment studies. Hopefully this study will be of benefit in helping others plan such experiments.

## 2007

• Supervised consensus scoring for docking and virtual screening
Teramoto, Reiji and Fukunishi, Hiroaki
Journal of chemical information and modeling, 2007, 47(2), 526-534
doi: 10.1021/ci6004993

Docking programs are widely used to discover novel ligands efficiently and can predict protein-ligand complex structures with reasonable accuracy and speed. However, there is an emerging demand for better performance from the scoring methods. Consensus scoring (CS) methods improve the performance by compensating for the deficiencies of each scoring function. However, conventional CS and existing scoring functions have the same problems, such as a lack of protein flexibility, inadequate treatment of salvation, and the simplistic nature of the energy function used. Although there are many problems in current scoring functions, we focus our attention on the incorporation of unbound ligand conformations. To address this problem, we propose supervised consensus scoring (SCS), which takes into account protein-ligand binding process using unbound ligand conformations with supervised learning. An evaluation of docking accuracy for 100 diverse protein-ligand complexes shows that SCS outperforms both CS and 11 scoring functions (PLP, F-Score, LigScore, DrugScore, LUDI, X-Score, AutoDock, PMF, G-Score, ChemScore, and D-score). The success rates of SCS range from 89% to 91% in the range of rmsd < 2 A, while those of CS range from 80% to 85%, and those of the scoring functions range from 26% to 76%. Moreover, we also introduce a method for judging whether a compound is active or inactive with the appropriate criterion for virtual screening. SCS performs quite well in docking accuracy and is presumably useful for screening large-scale compound databases before predicting binding affinity.

• Evaluation of docking programs for predicting binding of Golgi alpha-mannosidase II inhibitors: a comparison with crystallography.
Englebienne, Pablo and Fiaux, Hélène and Kuntz, Douglas A and Corbeil, Christopher R and Gerber-Lemaire, Sandrine and Rose, David R and Moitessier, Nicolas
Proteins, 2007, 69(1), 160-176
PMID: 17557336     doi: 10.1002/prot.21479

Golgi alpha-mannosidase II (GMII), a zinc-dependent glycosyl hydrolase, is a promising target for drug development in anti-tumor therapies. Using X-ray crystallography, we have determined the structure of Drosophila melanogaster GMII (dGMII) complexed with three different inhibitors exhibiting IC50's ranging from 80 to 1000 microM. These structures, along with those of seven other available dGMII/inhibitor complexes, were then used as a basis for the evaluation of seven docking programs (GOLD, Glide, FlexX, AutoDock, eHiTS, LigandFit, and FITTED). We found that small inhibitors could be accurately docked by most of the software, while docking of larger compounds (i.e., those with extended aromatic cycles or long aliphatic chains) was more problematic. Overall, Glide provided the best docking results, with the most accurately predicted binding around the active site zinc atom. Further evaluation of Glide's performance revealed its ability to extract active compounds from a benchmark library of decoys.

• Evaluations of molecular docking programs for virtual screening.
Onodera, Kenji and Satou, Kazuhito and Hirota, Hiroshi
Journal of chemical information and modeling, 2007, 47(4), 1609-1618
PMID: 17602548     doi: 10.1021/ci7000378

Structure-based virtual screening is carried out using molecular docking programs. A number of such docking programs are currently available, and the selection of docking program is difficult without knowing the characteristics or performance of each program. In this study, the screening performances of three molecular docking programs, DOCK, AutoDock, and GOLD, were evaluated with 116 target proteins. The screening performances were validated using two novel standards, along with a traditional enrichment rate measurement. For the evaluations, each docking run was repeated 1000 times with three initial conformations of a ligand. While each docking program has some merit over the other docking programs in some aspects, DOCK showed an unexpectedly better screening performance in the enrichment rates. Finally, we made several recommendations based on the evaluation results to enhance the screening performances of the docking programs.

• Comments on the article "On evaluating molecular-docking methods for pose prediction and enrichment factors".
Perola, Emanuele and Walters, W Patrick and Charifson, Paul
Journal of chemical information and modeling, 2007, 47(2), 251-253
PMID: 17260981     doi: 10.1021/ci600460h

The recent article "On Evaluating Molecular-Docking Methods for Pose Prediction and Enrichment Factors" (Chen H. et al. J. Chem. Inf. Model. 2006, 46, 401-415) contains a series of comments on a similar study we published in Proteins in 2004 (Perola et al. Proteins 2004, 56, 235-249). We believe that some of these comments are misleading, and we feel that an adequate response is in order.

• pso@autodock: a fast flexible molecular docking program based on Swarm intelligence.
Namasivayam, Vigneshwaran and Günther, Robert
Chemical biology & drug design, 2007, 70(6), 475-484
PMID: 17986206     doi: 10.1111/j.1747-0285.2007.00588.x

On the quest of novel therapeutics, molecular docking methods have proven to be valuable tools for screening large libraries of compounds determining the interactions of potential drugs with the target proteins. A widely used docking approach is the simulation of the docking process guided by a binding energy function. On the basis of the molecular docking program autodock, we present pso@autodock as a tool for fast flexible molecular docking. Our novel Particle Swarm Optimization (PSO) algorithms varCPSO and varCPSO-ls are suited for rapid docking of highly flexible ligands. Thus, a ligand with 23 rotatable bonds was successfully docked within as few as 100 000 computing steps (rmsd

• Comparison of topological, shape, and docking methods in virtual screening.
McGaughey, Georgia B and Sheridan, Robert P and Bayly, Christopher I and Culberson, J Chris and Kreatsoulas, Constantine and Lindsley, Stacey and Maiorov, Vladimir and Truchon, Jean-Francois and Cornell, Wendy D
Journal of chemical information and modeling, 2007, 47(4), 1504-1519
PMID: 17591764     doi: 10.1021/ci700052x

Virtual screening benchmarking studies were carried out on 11 targets to evaluate the performance of three commonly used approaches: 2D ligand similarity (Daylight, TOPOSIM), 3D ligand similarity (SQW, ROCS), and protein structure-based docking (FLOG, FRED, Glide). Active and decoy compound sets were assembled from both the MDDR and the Merck compound databases. Averaged over multiple targets, ligand-based methods outperformed docking algorithms. This was true for 3D ligand-based methods only when chemical typing was included. Using mean enrichment factor as a performance metric, Glide appears to be the best docking method among the three with FRED a close second. Results for all virtual screening methods are database dependent and can vary greatly for particular targets.

• Diverse, high-quality test set for the validation of protein-ligand docking performance.
Hartshorn, Michael J and Verdonk, Marcel L and Chessari, Gianni and Brewerton, Suzanne C. and Mooij, Wijnand T M and Mortenson, Paul N and Murray, Christopher W
Journal of medicinal chemistry, 2007, 50(4), 726-741
PMID: 17300160     doi: 10.1021/jm061277y

A procedure for analyzing and classifying publicly available crystal structures has been developed. It has been used to identify high-resolution protein-ligand complexes that can be assessed by reconstructing the electron density for the ligand using the deposited structure factors. The complexes have been clustered according to the protein sequences, and clusters have been discarded if they do not represent proteins thought to be of direct interest to the pharmaceutical or agrochemical industry. Rules have been used to exclude complexes containing non-drug-like ligands. One complex from each cluster has been selected where a structure of sufficient quality was available. The final Astex diverse set contains 85 diverse, relevant protein-ligand complexes, which have been prepared in a format suitable for docking and are to be made freely available to the entire research community (http://www.ccdc.cam.ac.uk). The performance of the docking program GOLD against the new set is assessed using a variety of protocols. Relatively unbiased protocols give success rates of approximately 80% for redocking into native structures, but it is possible to get success rates of over 90% with some protocols.

## 2006

• A method for induced-fit docking, scoring, and ranking of flexible ligands. Application to peptidic and pseudopeptidic beta-secretase (BACE 1) inhibitors.
Moitessier, Nicolas and Therrien, Eric and Hanessian, Stephen
Journal of medicinal chemistry, 2006, 49(20), 5885-5894
PMID: 17004704     doi: 10.1021/jm050138y

Inhibition of beta-secretase (BACE 1) has recently been investigated as a promising therapeutic approach in the treatment of Alzheimer's disease, and a growing number of BACE 1 inhibitors and crystal structures of BACE 1/inhibitors complexes have been reported. We report herein a predictive computational method and its application to potential BACE 1 inhibitors. Using a training set of 50 known highly flexible inhibitors, we developed a docking method that accounts for the flexibility of both the protein and the inhibitors. Protein flexibility is accounted for using a specifically designed genetic algorithm. We next developed a scoring function consisting of force field evaluation of the inhibitor/protein interactions and two additional terms for hydrogen bonding and entropy change upon binding. Discarding three outliers from the training set, our protocol was found to perform well with an rmsd of 1.19 kcal/mol. Evaluation of the predictive power was next carried out by virtual screening of 80 synthetic compounds. The significant enrichment at the top of the ranking list in active compounds demonstrated the ability of the docking and scoring protocol to rank the compounds relative to their activities.

• A critical assessment of docking programs and scoring functions.
Warren, Gregory L and Andrews, C Webster and Capelli, Anna-Maria and Clarke, Brian and LaLonde, Judith and Lambert, Millard H and Lindvall, Mika and Nevins, Neysa and Semus, Simon F and Senger, Stefan and Tedesco, Giovanna and Wall, Ian D and Woolven, James M and Peishoff, Catherine E and Head, Martha S
Journal of medicinal chemistry, 2006, 49(20), 5912-5931
PMID: 17004707     doi: 10.1021/jm050362n

Docking is a computational technique that samples conformations of small molecules in protein binding sites; scoring functions are used to assess which of these conformations best complements the protein binding site. An evaluation of 10 docking programs and 37 scoring functions was conducted against eight proteins of seven protein types for three tasks: binding mode prediction, virtual screening for lead identification, and rank-ordering by affinity for lead optimization. All of the docking programs were able to generate ligand conformations similar to crystallographically determined protein/ligand complex structures for at least one of the targets. However, scoring functions were less successful at distinguishing the crystallographic conformation from the set of docked poses. Docking programs identified active compounds from a pharmaceutically relevant pool of decoy compounds; however, no single program performed well for all of the targets. For prediction of compound affinity, none of the docking programs or scoring functions made a useful prediction of ligand binding affinity.

• On evaluating molecular-docking methods for pose prediction and enrichment factors.
Chen, Hongming and Lyne, Paul D and Giordanetto, Fabrizio and Lovell, Timothy and Li, Jin
Journal of chemical information and modeling, 2006, 46(1), 401-415
PMID: 16426074     doi: 10.1021/ci0503255

Four of the most well-known, commercially available docking programs, FlexX, GOLD, GLIDE, and ICM, have been examined for their ligand-docking and virtual-screening capabilities. The relative performance of the programs in reproducing the native ligand conformation from starting SMILES strings for 164 high-resolution protein-ligand complexes is presented and compared. Applying only the native scoring functions, the latest versions of these four docking programs were also used to conduct virtual screening for 12 protein targets of therapeutic interest, involving both publicly available structures and AstraZeneca in-house structures. The capability of the four programs to correctly rank-order target-specific active compounds over alternative binders and nonbinders (decoys plus randomly selected compounds) and thereby enrich a small subset of a screening library is compared. Enrichments from the virtual-screening experiments are contrasted with those obtained with alternative 3D shape-matching and 2D similarity database-search methods.

## 2005

• Comparison of automated docking programs as virtual screening tools.
Cummings, Maxwell D and Desjarlais, Renee L and Gibbs, Alan C and Mohan, Venkatraman and Jaeger, Edward P
Journal of medicinal chemistry, 2005, 48(4), 962-976
PMID: 15715466     doi: 10.1021/jm049798d

The performance of several commercially available docking programs is compared in the context of virtual screening. Five different protein targets are used, each with several known ligands. The simulated screening deck comprised 1000 molecules from a cleansed version of the MDL drug data report and 49 known ligands. For many of the known ligands, crystal structures of the relevant protein-ligand complexes were available. We attempted to run experiments with each docking method that were as similar as possible. For a given docking method, hit rates were improved versus what would be expected for random selection for most protein targets. However, the ability to prioritize known ligands on the basis of docking poses that resemble known crystal structures is both method- and target-dependent.

• Evaluation of library ranking efficacy in virtual screening
Kontoyianni, M and Sokol, GS and McClellan, LM
Journal of computational chemistry, 2005, 26(1), 11-22
PMID: 15526325     doi: 10.1002/jcc.20141

We present the results of a comprehensive study in which we explored how the docking procedure affects the performance of a virtual screening approach. We used four docking engines and applied 10 scoring functions to the top-ranked docking solutions of seeded databases against six target proteins. The scores of the experimental poses were placed within the total set to assess whether the scoring function required an accurate pose to provide the appropriate rank for the seeded compounds. This method allows a direct comparison of library ranking efficacy. Our results indicate that the LigandFit/Ligscore1 and LigandFit/GOLD docking/scoring combinations, and to a lesser degree FlexX/FlexX, Glide/Ligscore1, DOCK/PMF (Tripos implementation), LigandFit1/Ligscore2 and LigandFit/PMF (Tripos implementation) were able to retrieve the highest number of actives at a 10% fraction of the database when all targets were looked upon collectively. We also show that the scoring functions rank the observed binding modes higher than the inaccurate poses provided that the experimental poses are available. This finding stresses the discriminatory ability of the scoring algorithms, when better poses are available, and suggests that the number of false positives can be lowered with conformers closer to bioactive ones. (C) 2004 Wiley Periodicals, Inc.

• Comparing protein-ligand docking programs is difficult.
Cole, Jason C and Murray, Christopher W and Nissink, J Willem M and Taylor, Richard D and Taylor, Robin
Proteins, 2005, 60(3), 325-332
PMID: 15937897     doi: 10.1002/prot.20497

There is currently great interest in comparing protein-ligand docking programs. A review of recent comparisons shows that it is difficult to draw conclusions of general applicability. Statistical hypothesis testing is required to ensure that differences in pose-prediction success rates and enrichment rates are significant. Numerical measures such as root-mean-square deviation need careful interpretation and may profitably be supplemented by interaction-based measures and visual inspection of dockings. Test sets must be of appropriate diversity and of good experimental reliability. The effects of crystal-packing interactions may be important. The method used for generating starting ligand geometries and positions may have an appreciable effect on docking results. For fair comparison, programs must be given search problems of equal complexity (e.g. binding-site regions of the same size) and approximately equal time in which to solve them. Comparisons based on rescoring require local optimization of the ligand in the space of the new objective function. Re-implementations of published scoring functions may give significantly different results from the originals. Ostensibly minor details in methodology may have a profound influence on headline success rates.

## 2004

• Scoring functions for protein-ligand interactions: a critical perspective
Schulz-Gasch, T
Drug Discovery Today: Technologies, 2004, 1(3), 231-239

Scoring functions play an essential role in structure- based virtual screening. They are required to guide the docking of candidate compounds to structures of receptor binding sites, to select probable binding modes, and to discriminate binders from non-binders. Although many scoring functions have successfully been used to identify novel ligands for a wide variety of targets, much work remains to be done to avoid incorrect prediction of binding modes and high num- bers of false positives. This review gives an overview of the current state of the field and outlines key issues for the further development of scoring functions.

• A practical approach to docking of zinc metalloproteinase inhibitors.
Hu, Xin and Balaz, Stefan and Shelver, William H
Journal of molecular graphics & modelling, 2004, 22(4), 293-307
PMID: 15177081     doi: 10.1016/j.jmgm.2003.11.002

Forty zinc-dependent metalloproteinase/ligand complexes with known crystal structures were re-docked using five docking/scoring approaches (DOCK, FlexX, DrugScore, GOLD, and AutoDock). Correct geometry of the coordination bonds between the ligand's zinc binding group (ZBG) and the catalytic zinc is important for docking accuracy and scoring reliability. More than 75% of docked poses with RMSD less than 2A were found to have appropriate ZBG binding, but for poor ZBG binding, about 95% of poses failed to dock correctly. Elimination of poses with inappropriate zinc binding resulted in better binding energy predictions that were further improved by dividing the ligands into subsets according to the ZBG (carboxylates, hydroxamates, and phosphorus containing groups). After a subset re-scoring using the regression functions obtained for individual subsets, DrugScore was able to explain 77% and the consensus scoring scheme X-CSCORE even 88% of variance in binding energies. The approach combining ZBG-based pose selection and subset re-scoring improved the hit rate in virtual screening for metalloproteinase inhibitors for all tested methods by 4-16%.

• Evaluation and application of multiple scoring functions for a virtual screening experiment.
Xing, Li and Hodgkin, Edward and Liu, Qian and Sedlock, David
Journal of computer-aided molecular design, 2004, 18(5), 333-344
PMID: 15595460

In order to identify novel chemical classes of factor Xa inhibitors, five scoring functions (FlexX, DOCK, GOLD, ChemScore and PMF) were engaged to evaluate the multiple docking poses generated by FlexX. The compound collection was composed of confirmed potent factor Xa inhibitors and a subset of the LeadQuest screening compound library. Except for PMF the other four scoring functions succeeded in reproducing the crystal complex (PDB code: 1FAX). During virtual screening the highest hit rate (80%) was demonstrated by FlexX at an energy cutoff of -40 kJ/mol, which is about 40-fold over random screening (2.06%). Limited results suggest that presenting more poses of a single molecule to the scoring functions could deteriorate their enrichment factors. A series of promising scaffolds with favorable binding scores was retrieved from LeadQuest. Consensus scoring by pair-wise intersection failed to enrich the hit rate yielded by single scorings (i.e. FlexX). We note that reported successes of consensus scoring in hit rate enrichment could be artificial because their comparisons were based on a selected subset of single scoring and a markedly reduced subset of double or triple scoring. The findings presented in this report are based upon a single biological system and support further studies.

• A detailed comparison of current docking and scoring methods on systems of pharmaceutical relevance.
Perola, Emanuele and Walters, W Patrick and Charifson, Paul S
Proteins, 2004, 56(2), 235-249
PMID: 15211508     doi: 10.1002/prot.20088

A thorough evaluation of some of the most advanced docking and scoring methods currently available is described, and guidelines for the choice of an appropriate protocol for docking and virtual screening are defined. The generation of a large and highly curated test set of pharmaceutically relevant protein-ligand complexes with known binding affinities is described, and three highly regarded docking programs (Glide, GOLD, and ICM) are evaluated on the same set with respect to their ability to reproduce crystallographic binding orientations. Glide correctly identified the crystallographic pose within 2.0 A in 61% of the cases, versus 48% for GOLD and 45% for ICM. In general Glide appears to perform most consistently with respect to diversity of binding sites and ligand flexibility, while the performance of ICM and GOLD is more binding site-dependent and it is significantly poorer when binding is predominantly driven by hydrophobic interactions. The results also show that energy minimization and reranking of the top N poses can be an effective means to overcome some of the limitations of a given docking function. The same docking programs are evaluated in conjunction with three different scoring functions for their ability to discriminate actives from inactives in virtual screening. The evaluation, performed on three different systems (HIV-1 protease, IMPDH, and p38 MAP kinase), confirms that the relative performance of different docking and scoring methods is to some extent binding site-dependent. GlideScore appears to be an effective scoring function for database screening, with consistent performance across several types of binding sites, while ChemScore appears to be most useful in sterically demanding sites since it is more forgiving of repulsive interactions. Energy minimization of docked poses can significantly improve the enrichments in systems with sterically demanding binding sites. Overall Glide appears to be a safe general choice for docking, while the choice of the best scoring tool remains to a larger extent system-dependent and should be evaluated on a case-by-case basis.

• Comparative evaluation of eight docking tools for docking and virtual screening accuracy.
Kellenberger, Esther and Rodrigo, Jordi and Muller, Pascal and Rognan, Didier
Proteins, 2004, 57(2), 225-242
PMID: 15340911     doi: 10.1002/prot.20149

Eight docking programs (DOCK, FLEXX, FRED, GLIDE, GOLD, SLIDE, SURFLEX, and QXP) that can be used for either single-ligand docking or database screening have been compared for their propensity to recover the X-ray pose of 100 small-molecular-weight ligands, and for their capacity to discriminate known inhibitors of an enzyme (thymidine kinase) from randomly chosen "drug-like" molecules. Interestingly, both properties are found to be correlated, since the tools showing the best docking accuracy (GLIDE, GOLD, and SURFLEX) are also the most successful in ranking known inhibitors in a virtual screening experiment. Moreover, the current study pinpoints some physicochemical descriptors of either the ligand or its cognate protein-binding site that generally lead to docking/scoring inaccuracies.

• Evaluation of docking performance: comparative data on docking algorithms.
Kontoyianni, Maria and McClellan, Laura M and Sokol, Glenn S
Journal of medicinal chemistry, 2004, 47(3), 558-565
PMID: 14736237     doi: 10.1021/jm0302997

Docking molecules into their respective 3D macromolecular targets is a widely used method for lead optimization. However, the best known docking algorithms often fail to position the ligand in an orientation close to the experimental binding mode. It was reported recently that consensus scoring enhances the hit rates in a virtual screening experiment. This methodology focused on the top-ranked pose, with the underlying assumption that the orientation/conformation of the docked compound is the most accurate. In an effort to eliminate the scoring function bias, and assess the ability of the docking algorithms to provide solutions similar to the crystallographic modes, we investigated the most known docking programs and evaluated all of the resultant poses. We present the results of an extensive computational study in which five docking programs (FlexX, DOCK, GOLD, LigandFit, Glide) were investigated against 14 protein families (69 targets). Our findings show that some algorithms perform consistently better than others, and a correspondence between the nature of the active site and the best docking algorithm can be found.

• Assessment of docking poses: interactions-based accuracy classification (IBAC) versus crystal structure deviations.
Kroemer, Romano T and Vulpetti, Anna and McDonald, Joseph J and Rohrer, Douglas C and Trosset, Jean-Yves and Giordanetto, Fabrizio and Cotesta, Simona and McMartin, Colin and Kihlén, Mats and Stouten, Pieter F W
Journal of Chemical Information and Computer Sciences, 2004, 44(3), 871-881
PMID: 15154752     doi: 10.1021/ci049970m

Six docking programs (FlexX, GOLD, ICM, LigandFit, the Northwestern University version of DOCK, and QXP) were evaluated in terms of their ability to reproduce experimentally observed binding modes (poses) of small-molecule ligands to macromolecular targets. The accuracy of a pose was assessed in two ways: First, the RMS deviation of the predicted pose from the crystal structure was calculated. Second, the predicted pose was compared to the experimentally observed one regarding the presence of key interactions with the protein. The latter assessment is referred to as interactions-based accuracy classification (IBAC). In a number of cases significant discrepancies were found between IBAC and RMSD-based classifications. Despite being more subjective, the IBAC proved to be a more meaningful measure of docking accuracy in all these cases.

• Glide: A new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening
Halgren, TA and Murphy, RB and Friesner, RA and Beard, HS and Frye, LL and Pollard, WT and Banks, JL
Journal of medicinal chemistry, 2004, 47(7), 1750-1759
PMID: 15027866     doi: 10.1021/jm030644s

Glide's ability to identify active compounds in a database screen is characterized by applying Glide to a diverse set of nine protein receptors. In many cases, two, or even three, protein sites are employed to probe the sensitivity of the results to the site geometry. To make the database screens as realistic as possible, the screens use sets of "druglike" decoy ligands that have been selected to be representative of what we believe is likely to be found in the compound collection of a pharmaceutical or biotechnology company. Results are presented for releases 1.8, 2.0, and 2.5 of Glide. The comparisons show that average measures for both "early" and "global" enrichment for Glide 2.5 are 3 times higher than for Glide 1.8 and more than 2 times higher than for Glide 2.0 because of better results for the least well-handled screens. This improvement in enrichment stems largely from the better balance of the more widely parametrized GlideScore 2.5 function and the inclusion of terms that penalize ligand-protein interactions that violate established principles of physical chemistry, particularly as it concerns the exposure to solvent of charged protein and ligand groups. Comparisons to results for the thymidine kinase and estrogen receptors published by Rognan and co-workers (J. Med. Chem. 2000, 43, 4759-4767) show that Glide 2.5 performs better than GOLD 1.1, FlexX 1.8, or DOCK 4.01.

## 2003

• Comparative evaluation of 11 scoring functions for molecular docking.
Wang, Renxiao and Lu, Yipin and Wang, Shaomeng
Journal of medicinal chemistry, 2003, 46(12), 2287-2303
PMID: 12773034     doi: 10.1021/jm0203783

Eleven popular scoring functions have been tested on 100 protein-ligand complexes to evaluate their abilities to reproduce experimentally determined structures and binding affinities. They include four scoring functions implemented in the LigFit module in Cerius2 (LigScore, PLP, PMF, and LUDI), four scoring functions implemented in the CScore module in SYBYL (F-Score, G-Score, D-Score, and ChemScore), the scoring function implemented in the AutoDock program, and two stand-alone scoring functions (DrugScore and X-Score). These scoring functions are not tested in the context of a particular docking program. Instead, conformational sampling and scoring are separated into two consecutive steps. First, an exhaustive conformational sampling is performed by using the AutoDock program to generate an ensemble of docked conformations for each ligand molecule. This conformational ensemble is required to cover the entire conformational space as much as possible rather than to focus on a few energy minima. Then, each scoring function is applied to score this conformational ensemble to see if it can identify the experimentally observed conformation from all of the other decoys. Among all of the scoring functions under test, six of them, i.e., PLP, F-Score, LigScore, DrugScore, LUDI, and X-Score, yield success rates higher than the AutoDock scoring function. The success rates of these six scoring functions range from 66% to 76% if using root-mean-square deviation < or

• Virtual screening to enrich hit lists from high-throughput screening: a case study on small-molecule inhibitors of angiogenin.
Jenkins, Jeremy L and Kao, Richard Y T and Shapiro, Robert
Proteins, 2003, 50(1), 81-93
PMID: 12471601     doi: 10.1002/prot.10270

"Hit lists" generated by high-throughput screening (HTS) typically contain a large percentage of false positives, making follow-up assays necessary to distinguish active from inactive substances. Here we present a method for improving the accuracy of HTS hit lists by computationally based virtual screening (VS) of the corresponding chemical libraries and selecting hits by HTS/VS consensus. This approach was applied in a case study on the target-enzyme angiogenin, a potent inducer of angiogenesis. In conjunction with HTS of the National Cancer Institute Diversity Set and ChemBridge DIVERSet E (approximately 18,000 compounds total), VS was performed with two flexible library docking/scoring methods, DockVision/Ludi and GOLD. Analysis of the results reveals that dramatic enrichment of the HTS hit rate can be achieved by selecting compounds in consensus with one or both of the VS functions. For example, HTS hits ranked in the top 2% by GOLD included 42% of the true hits, but only 8% of the false positives; this represents a sixfold enrichment over the HTS hit rate. Notably, the HTS/VS method was effective in selecting out inhibitors with midmicromolar dissociation constants typical of leads commonly obtained in primary screens.

• Comparative study of several algorithms for flexible ligand docking.
Bursulaya, Badry D and Totrov, Maxim and Abagyan, Ruben and Brooks, Charles L
Journal of computer-aided molecular design, 2003, 17(11), 755-763
PMID: 15072435

We have performed a comparative assessment of several programs for flexible molecular docking: DOCK 4.0, FlexX 1.8, AutoDock 3.0, GOLD 1.2 and ICM 2.8. This was accomplished using two different studies: docking experiments on a data set of 37 protein-ligand complexes and screening a library containing 10,037 entries against 11 different proteins. The docking accuracy of the methods was judged based on the corresponding rank-one solutions. We have found that the fraction of molecules docked with acceptable accuracy is 0.47, 0.31, 0.35, 0.52 and 0.93 for, respectively, AutoDock, DOCK, FlexX, GOLD and ICM. Thus ICM provided the highest accuracy in ligand docking against these receptors. The results from the other programs are found to be less accurate and of approximately the same quality. A speed comparison demonstrated that FlexX was the fastest and AutoDock was the slowest among the tested docking programs. The database screening was performed using DOCK, FlexX and ICM. ICM was able to identify the original ligands within the top 1% of the total library in 17 cases. The corresponding number for DOCK and FlexX was 7 and 8, respectively. We have estimated that in virtual database screening, 50% of the potentially active compounds will be found among approximately 1.5% of the top scoring solutions found with ICM and among approximately 9% of the top scoring solutions produced by DOCK and FlexX.

• LigandFit: a novel method for the shape-directed rapid docking of ligands to protein active sites
Venkatachalam, CM and Jiang, X and Oldfield, T and Waldman, M
Journal of molecular graphics & modelling, 2003, 21(4), 289-307
PMID: 12479928

We present a new shape-based method, LigandFit, for accurately docking ligands into protein active sites. The method employs a cavity detection algorithm for detecting invaginations in the protein as candidate active site regions. A shape comparison filter is combined with a Monte Carlo conformational search for generating ligand poses consistent with the active site shape. Candidate poses are minimized in the context of the active site using a grid-based method for evaluating protein-ligand interaction energies. Errors arising from grid interpolation are dramatically reduced using a new non-linear interpolation scheme. Results are presented for 19 diverse protein-ligand complexes. The method appears quite promising, reproducing the X-ray structure ligand pose within an RMS of 2Angstrom in 14 out of the 19 complexes. A high-throughput screening study applied to the thymidine kinase receptor is also presented in which LigandFit, when combined with LigScore, an internally developed scoring function [1], yields very good hit rates for a ligand pool seeded with known actives. (C) 2002 Published by Elsevier Science Inc.

• Improved protein-ligand docking using GOLD.
Verdonk, Marcel L and Cole, Jason C and Hartshorn, Michael J and Murray, Christopher W and Taylor, Richard D
Proteins, 2003, 52(4), 609-623
PMID: 12910460     doi: 10.1002/prot.10465

The Chemscore function was implemented as a scoring function for the protein-ligand docking program GOLD, and its performance compared to the original Goldscore function and two consensus docking protocols, "Goldscore-CS" and "Chemscore-GS," in terms of docking accuracy, prediction of binding affinities, and speed. In the "Goldscore-CS" protocol, dockings produced with the Goldscore function are scored and ranked with the Chemscore function; in the "Chemscore-GS" protocol, dockings produced with the Chemscore function are scored and ranked with the Goldscore function. Comparisons were made for a "clean" set of 224 protein-ligand complexes, and for two subsets of this set, one for which the ligands are "drug-like," the other for which they are "fragment-like." For "drug-like" and "fragment-like" ligands, the docking accuracies obtained with Chemscore and Goldscore functions are similar. For larger ligands, Goldscore gives superior results. Docking with the Chemscore function is up to three times faster than docking with the Goldscore function. Both combined docking protocols give significant improvements in docking accuracy over the use of the Goldscore or Chemscore function alone. "Goldscore-CS" gives success rates of up to 81% (top-ranked GOLD solution within 2.0 A of the experimental binding mode) for the "clean list," but at the cost of long search times. For most virtual screening applications, "Chemscore-GS" seems optimal; search settings that give docking speeds of around 0.25-1.3 min/compound have success rates of about 78% for "drug-like" compounds and 85% for "fragment-like" compounds. In terms of producing binding energy estimates, the Goldscore function appears to perform better than the Chemscore function and the two consensus protocols, particularly for faster search settings. Even at docking speeds of around 1-2 min/compound, the Goldscore function predicts binding energies with a standard deviation of approximately 10.5 kJ/mol.

## 2002

• A new test set for validating predictions of protein-ligand interaction.
Nissink, J Willem M and Murray, Chris and Hartshorn, Mike and Verdonk, Marcel L and Cole, Jason C and Taylor, Robin
Proteins, 2002, 49(4), 457-471
PMID: 12402356     doi: 10.1002/prot.10232

We present a large test set of protein-ligand complexes for the purpose of validating algorithms that rely on the prediction of protein-ligand interactions. The set consists of 305 complexes with protonation states assigned by manual inspection. The following checks have been carried out to identify unsuitable entries in this set: (1) assessing the involvement of crystallographically related protein units in ligand binding; (2) identification of bad clashes between protein side chains and ligand; and (3) assessment of structural errors, and/or inconsistency of ligand placement with crystal structure electron density. In addition, the set has been pruned to assure diversity in terms of protein-ligand structures, and subsets are supplied for different protein-structure resolution ranges. A classification of the set by protein type is available. As an illustration, validation results are shown for GOLD and SuperStar. GOLD is a program that performs flexible protein-ligand docking, and SuperStar is used for the prediction of favorable interaction sites in proteins. The new CCDC/Astex test set is freely available to the scientific community (http://www.ccdc.cam.ac.uk).

## 2001

• Evaluation of docking functions for protein-ligand docking
Perez, C and Ortiz, AR
Journal of medicinal chemistry, 2001, 44(23), 3768-3785
doi: 10.1021/jm010141r

Docking functions are believed to be the essential component of docking algorithms. Both physically and statistically based functions have been proposed, but there is no consensus about their relative performances. Here, we propose an evaluation approach based on exhaustive enumeration of all possible docking solutions obtained with a discretized description of a rigid docking process. We apply the approach to study both molecular mechanics and statistical potentials. It is found that the statistical potential evaluated is less effective than the AMBER molecular mechanics function to provide an accurate description of the docking process when the exact experimental coordinates are used. However, when coordinates of crystal structures obtained with analogous ligands are used, similar performances are obtained in both cases. Possible reasons for the successes and failures of both docking schemes have been uncovered using linear discriminant analysis, on the basis of a set of physicochemical descriptors capturing the main physical effects at play during protein-ligand docking. In both types of potentials steric effects appear critical to obtain a successful docking. Our results also indicate that neglecting desolvation effects and the explicit treatment of hydrogen bonds are the main source of the failures observed with the molecular mechanics potential. On the other hand, detailed consideration of steric interactions, with a careful treatment of dispersive forces, seems to be needed when using statistical potentials derived from a structural database. The possibility of filtering combinatorial libraries in order to maximize the probability of correct docking is discussed.

## 2000

• Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations.
Bissantz, C and Folkers, G and Rognan, D
Journal of medicinal chemistry, 2000, 43(25), 4759-4767
PMID: 11123984

Three different database docking programs (Dock, FlexX, Gold) have been used in combination with seven scoring functions (Chemscore, Dock, FlexX, Fresno, Gold, Pmf, Score) to assess the accuracy of virtual screening methods against two protein targets (thymidine kinase, estrogen receptor) of known three-dimensional structure. For both targets, it was generally possible to discriminate about 7 out of 10 true hits from a random database of 990 ligands. The use of consensus lists common to two or three scoring functions clearly enhances hit rates among the top 5% scorers from 10% (single scoring) to 25-40% (double scoring) and up to 65-70% (triple scoring). However, in all tested cases, no clear relationships could be found between docking and ranking accuracies. Moreover, predicting the absolute binding free energy of true hits was not possible whatever docking accuracy was achieved and scoring function used. As the best docking/consensus scoring combination varies with the selected target and the physicochemistry of target-ligand interactions, we propose a two-step protocol for screening large databases: (i) screening of a reduced dataset containing a few known ligands for deriving the optimal docking/consensus scoring scheme, (ii) applying the latter parameters to the screening of the entire database.