# Bibliography of computer-aided Drug Design

Updated on 7/18/2014. Currently 2130 references

## Visualization / All

2014 / 2013 / 2012 / 2011 / 2010 / 2009 / 2008 / 2007 / 2006 / 2005 / 2003 /

## 2014

• HTS navigator: freely accessible cheminformatics software for analyzing high-throughput screening data.
Fourches, Denis and Sassano, Maria F and Roth, Bryan L and Tropsha, Alexander
Bioinformatics (Oxford, England), 2014, 30(4), 588-589
PMID: 24376084     doi: 10.1093/bioinformatics/btt718

SUMMARY:We report on the development of the high-throughput screening (HTS) Navigator software to analyze and visualize the results of HTS of chemical libraries. The HTS Navigator processes output files from different plate readers' formats, computes the overall HTS matrix, automatically detects hits and has different types of baseline navigation and correction features. The software incorporates advanced cheminformatics capabilities such as chemical structure storage and visualization, fast similarity search and chemical neighborhood analysis for retrieved hits. The software is freely available for academic laboratories.

• DiSCuS: an open platform for (not only) virtual screening results management.
Wójcikowski, Maciej and Zielenkiewicz, Piotr and Siedlecki, Pawel
Journal of chemical information and modeling, 2014, 54(1), 347-354
PMID: 24364790     doi: 10.1021/ci400587f

DiSCuS, a "Database System for Compound Selection", has been developed. The primary goal of DiSCuS is to aid researchers in the steps subsequent to generating high-throughput virtual screening (HTVS) results, such as selection of compounds for further study, purchase, or synthesis. To do so, DiSCuS provides (1) a storage facility for ligand-receptor complexes (generated with external programs), (2) a number of tools for validating these complexes, such as scoring functions, potential energy contributions, and med-chem features with ligand similarity estimates, and (3) powerful searching and filtering options with logical operators. DiSCuS supports multiple receptor targets for a single ligand, so it can be used either to evaluate different variants of an active site or for selectivity studies. DiSCuS documentation, installation instructions, and source code can be found at http://discus.ibb.waw.pl .

• iview: an interactive WebGL visualizer for protein-ligand complex.
Li, Hongjian and Leung, Kwong-Sak and Nakane, Takanori and Wong, Man-Hon
Bmc Bioinformatics, 2014, 15, 56
PMID: 24564583     doi: 10.1186/1471-2105-15-56

BACKGROUND:Visualization of protein-ligand complex plays an important role in elaborating protein-ligand interactions and aiding novel drug design. Most existing web visualizers either rely on slow software rendering, or lack virtual reality support. The vital feature of macromolecular surface construction is also unavailable.

## 2013

• A Searchable Map of PubChem
Deursen, Ruud van and Blum, Lorenz C and Reymond, Jean-Louis
Journal of chemical information and modeling, 2013, 50(11), 1924-1934

The database PubChem was classified using 42 integer value descriptors of molecular structure, here called molecular quantum numbers (MQNs), which count atoms and bond types, polar groups, and topological features. Principal component analysis of the MQN data set shows that PubChem compounds occupy a partially filled elliptical cone in the (PC1,PC2,PC3) space whose axis is the first principal component PC1 (65% variability) representing molecular size, and the ellipse axes are PC2 (18% variability, representing structural flexibility) and PC3 (7% variability, representing polarity). A visual overview of PubChem is provided by color-coded representations of the (PC2,PC3) plane. The MQNs form a scalar fingerprint which can be used to measure the similarity between pairs of molecules and enable ligand-based virtual screening, as illustrated for the enrichment of bioactives from the DUD data set from PubChem. An MQN-annotated version of PubChem with an MQN-similarity search tool is available at www.gdb.unibe.ch .

• Scaffold Explorer: An Interactive Tool for Organizing and Mining Structure−Activity Data Spanning Multiple Chemotypes
Agrafiotis, Dimitris K and Wiener, John J M
Journal of medicinal chemistry, 2013, 53(13), 5002-5011

We introduce Scaffold Explorer, an interactive tool that allows medicinal chemists to define hierarchies of chemical scaffolds and use them to explore their project data. Scaffold Explorer allows the user to construct a tree, where each node corresponds to a specific scaffold. Each node can have multiple children, each of which represents a more refined substructure relative to its parent node. Once the tree is defined, it can be mapped onto any collection of compounds and be used as a navigational tool to explore structure?activity relationships (SAR) across different chemotypes. The rich visual analytics of Scaffold Explorer afford the user a ?bird?s-eye? view of the chemical space spanned by a particular data set, map any physicochemical property or biological activity of interest onto the individual scaffold nodes, serve as an aggregator for the properties of the compounds represented by these nodes, and quickly distinguish promising chemotypes from less interesting or problematic ones. Unlike previous approaches, which focused on automated extraction and classification of scaffolds, the utility of the new tool rests on its interactivity and ability to accommodate the medicinal chemists? intuition by allowing the use of arbitrary substructures containing variable atoms, bonds, and/or substituents such as those employed in substructure search.

• Toward an Efficient Approach to Identify Molecular Scaffolds Possessing Selective or Promiscuous Compounds
Yongye, Austin B and Medina-Franco, José L
Chemical biology & drug design, 2013, 82(4), 367-375
PMID: 23659738     doi: 10.1111/cbdd.12162

The concept of a recurrent scaffold present in a series of structures is common in medicinal drug discovery. We present a scaffold analysis of compounds screened across 100 sequence-unrelated proteins to identify scaffolds that drive promiscuity or selectivity. Selectivity and promiscuity play a major role in traditional and poly-pharmacological drug design considerations. The collection employed here is the first publicly available data set containing the complete screening profiles of more than 15 000 compounds from different sources. In addition, no scaffold analysis of this data set has been reported. The protocol described here employs the Molecular Equivalence Index tool to facilitate the selection of Bemis-Murcko frameworks in the data set, which contain at least five compounds and Scaffold Hunter to generate a hierarchical tree of scaffolds. The annotation of the scaffold tree with protein-binding profile data enabled the successful identification of mostly highly specific compounds, due to data set constraints. We also applied this approach to a public set of 1497 small molecules screened non-uniformly across a panel of 172 protein kinases. The approach is general and can be applied to any other data sets and activity readout.

• Conditional probabilities of activity landscape features for individual compounds.
Vogt, Martin and Iyer, Preeti and Maggiora, Gerald M and Bajorath, Jürgen
Journal of chemical information and modeling, 2013, 53(7), 1602-1612
PMID: 23789585     doi: 10.1021/ci400288r

Activity landscape representations aid in the analysis of structure-activity relationships (SARs) of large compound data sets. Landscapes are characterized by features with different SAR information content such as, for example, regions formed by structurally diverse compounds having similar activity or, alternatively, structurally similar compounds with large activity differences, so-called activity cliffs. Modeling of activity landscapes typically requires pairwise comparisons of molecular similarity and potency relationships of compounds in a data set. Consequently, landscape features are generally resolved at the level of compound pairs. Herein, we introduce a methodology to assign feature probabilities to individual compounds. This makes it possible to organize compounds comprising activity landscapes into well-defined SAR categories. Specifically, the calculation of conditional feature probabilities of active compounds provides a balanced and further refined view of activity landscapes with a focus on individual molecules.

• Quantifying the Fingerprint Descriptor Dependence of Structure-Activity Relationship Information on a Large Scale
Dimova, Dilyana and Stumpfe, Dagmar and Bajorath, Jürgen
Journal of chemical information and modeling, 2013, 53(9), 2275-2281
PMID: 23968259     doi: 10.1021/ci4004078

It is well-known that different molecular representations, e.g., graphs, numerical descriptors, fingerprints, or 3D models, change the numerical results of molecular similarity calculations. Because the assessment of structure-activity relationships (SARs) requires similarity and potency comparisons of active compounds, this representation dependence inevitably also affects SAR analysis. But to what extent? How exactly does SAR information change when alternative fingerprints are used as descriptors? What is the proportion of active compounds with substantial changes in SAR information induced by different fingerprints? To provide answers to these questions, we have quantified changes in SAR information across many different compound classes using six different fingerprints. SAR profiling was carried out on 128 target-based data sets comprising more than 60 000 compounds with high-confidence activity annotations. A numerical measure of SAR discontinuity was applied to assess SAR information on a per compound basis. For ∼70% of all test compounds, changes in SAR characteristics were detected when different fingerprints were used as molecular representations. Moreover, the SAR phenotype of ∼30% of the compounds changed, and distinct fingerprint-dependent local SAR environments were detected. The fingerprints we compared were found to generate SAR models that were essentially not comparable. Atom environment and pharmacophore fingerprints produced the largest differences in compound-associated SAR information. Taken together, the results of our systematic analysis reveal larger fingerprint-dependent changes in compound-associated SAR information than would have been anticipated.

• Progress in the Visualization and Mining of Chemical and Target Spaces
Medina-Franco, José L and Aguayo-Ortiz, Rodrigo
Molecular Informatics, 2013, 32(11-12), 942-953
doi: 10.1002/minf.201300041

Chemogenomics is a growing field that aims to integrate the chemical and target spaces. As part of a multi-disciplinary effort to achieve this goal, computational methods initially developed to visualize the chemical space of compound collections and mine single-target structure-activity relationships, are being adapted to visualize and mine complex relationships in chemogenomics data sets. Similarly, the growing evidence that clinical effects are many times due to the interaction of single or multiple drugs with multiple targets, is encouraging the development of novel methodologies that are integrated in multi-target drug discovery endeavors. Herein we review advances in the development and application of approaches to generate visual representations of chemical space with particular emphasis on methods that aim to explore and uncover relationships between chemical and target spaces. Also, progress in the data mining of the structure-activity relationships of sets of compounds screened across multiple targets are discussed in light of the concept of activity landscape modeling.

• Extracting SAR Information from a Large Collection of Anti-Malarial Screening Hits by NSG-SPT Analysis
Wawer, Mathias and Bajorath, Jürgen
ACS Medicinal Chemistry Letters, 2013, 2(3), 201-206

We combine two graphical SAR analysis methods, Network-like Similarity Graphs (NSGs) and Similarity-Potency Trees (SPTs), to search for SAR information in a large and heterogeneous compound data set containing more than 13,000 antimalarial screening hits that was recently released by GlaxoSmithKline (GSK). The NSG-SPT approach first identifies subsets of compounds inducing local SAR discontinuity in data sets and then extracts available SAR information from these subsets in a graphically intuitive manner. Applying the NSG-SPT analysis scheme, we have identified in the GSK collection compound subsets of high local SAR information content including both known and previously unknown antimalarial chemotypes, which yielded interpretable SAR patterns. This information should be helpful to prioritize and select antimalarial candidate compounds for further chemical exploration. Furthermore, the NSG-SPT tools are publicly available, and our study also shows how to practically apply these SAR analysis methods to study large compound data sets.

## 2012

• CheS-Mapper - Chemical Space Mapping and Visualization in 3D.
Gütlein, Martin and Karwath, Andreas and Kramer, Stefan
Journal of cheminformatics, 2012, 4(1), 7
PMID: 22424447     doi: 10.1186/1758-2946-4-7

Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis.

• Chemotography for multi-target SAR analysis in the context of biological pathways.
Lounkine, Eugen and Kutchukian, Peter and Petrone, Paula and Davies, John W and Glick, Meir
Bioorganic & Medicinal Chemistry, 2012, 20(18), 5416-5427
PMID: 22405595     doi: 10.1016/j.bmc.2012.02.034

The increasing amount of chemogenomics data, that is, activity measurements of many compounds across a variety of biological targets, allows for better understanding of pharmacology in a broad biological context. Rather than assessing activity at individual biological targets, today understanding of compound interaction with complex biological systems and molecular pathways is often sought in phenotypic screens. This perspective poses novel challenges to structure-activity relationship (SAR) assessment. Today, the bottleneck of drug discovery lies in the understanding of SAR of rich datasets that go beyond single targets in the context of biological pathways, potential off-targets, and complex selectivity profiles. To aid in the understanding and interpretation of such complex SAR, we introduce Chemotography (chemotype chromatography), which encodes chemical space using a color spectrum by combining clustering and multidimensional scaling. Rich biological data in our approach were visualized using spatial dimensions traditionally reserved for chemical space. This allowed us to analyze SAR in the context of target hierarchies and phylogenetic trees, two-target activity scatter plots, and biological pathways. Chemotography, in combination with the Kyoto Encyclopedia of Genes and Genomes (KEGG), also allowed us to extract pathway-relevant SAR from the ChEMBL database. We identified chemotypes showing polypharmacology and selectivity-conferring scaffolds, even in cases where individual compounds have not been tested against all relevant targets. In addition, we analyzed SAR in ChEMBL across the entire Kinome, going beyond individual compounds. Our method combines the strengths of chemical space visualization for SAR analysis and graphical representation of complex biological data. Chemotography is a new paradigm for chemogenomic data visualization and its versatile applications presented here may allow for improved assessment of SAR in biological context, such as phenotypic assay hit lists.

• The Molecule Cloud - compact visualization of large collections of molecules.
Ertl, Peter and Rohde, Bernhard
Journal of cheminformatics, 2012, 4(1), 12
PMID: 22769057     doi: 10.1186/1758-2946-4-12

BACKGROUND:Analysis and visualization of large collections of molecules is one of the most frequent challenges cheminformatics experts in pharmaceutical industry are facing. Various sophisticated methods are available to perform this task, including clustering, dimensionality reduction or scaffold frequency analysis. In any case, however, viewing and analyzing large tables with molecular structures is necessary. We present a new visualization technique, providing basic information about the composition of molecular data sets at a single glance.

• ChemBioServer: a web-based pipeline for filtering, clustering and visualization of chemical compounds used in drug discovery
Athanasiadis, Emmanouil and Cournia, Zoe and Spyrou, George
Bioinformatics (Oxford, England), 2012, 28(22), 3002-3003
PMID: 22962344     doi: 10.1093/bioinformatics/bts551

Summary: ChemBioServer is a publicly available web application for effectively mining and filtering chemical compounds used in drug discovery. It provides researchers with the ability to (i) browse and visualize compounds along with their properties, (ii) filter chemical compounds for a variety of properties such as steric clashes and toxicity, (iii) apply perfect match substructure search, (iv) cluster compounds according to their physicochemical properties providing representative compounds for each cluster, (v) build custom compound mining pipelines and (vi) quantify through property graphs the top ranking compounds in drug discovery procedures. ChemBioServer allows for pre-processing of compounds prior to an in silico screen, as well as for post-processing of top-ranked molecules resulting from a docking exercise with the aim to increase the efficiency and the quality of compound selection that will pass to the experimental test phase.Availability: The ChemBioServer web application is available at: http://bioserver-3.bioacademy.gr/Bioserver/ChemBioServer/.Contact: gspyrou@bioacademy.gr

• Directed R-group combination graph: a methodology to uncover structure-activity relationship patterns in a series of analogues.
Wassermann, Anne Mai and Bajorath, Jürgen
Journal of medicinal chemistry, 2012, 55(3), 1215-1226
PMID: 22248436     doi: 10.1021/jm201362h

A graphical method is introduced to study details of structure-activity relationships (SARs) in analogue series that further extends conventional analysis of analogues using R-group tables or related approaches and that provides additional and more differentiated SAR information. The newly designed graph structure represents entire series of analogues in a consistent manner, regardless of their size and complexity of substitution patterns. The approach is specifically tailored toward a systematic exploration and intuitive interpretation of SAR features involving different R-groups and their combinations. Analogues and their potency information are systematically organized on the basis of R-group combinations that are present in a series. This organization scheme results in graph components that represent well-defined SAR patterns. Analysis of these patterns provides an immediate access to critical substitution sites and R-group combinations, favorable and unfavorable R-groups, or nonadditive potency effects of multisite substitutions. Furthermore, the data structure makes it possible to design new analogues by combining favorable R-group combinations derived from different compounds.

• Data mining of protein-binding profiling data identifies structural modifications that distinguish selective and promiscuous compounds.
Yongye, Austin B and Medina-Franco, José L
Journal of chemical information and modeling, 2012, 52(9), 2454-2461
PMID: 22856455     doi: 10.1021/ci3002606

Activity profiling of compound collections across multiple targets is increasingly being used in probe and drug discovery. Herein, we discuss an approach to systematically analyzing the structure-activity relationships of a large screening profile data with emphasis on identifying structural changes that have a significant impact on the number of proteins to which a compound binds. As a case study, we analyzed a recently released public data set of more than 15 000 compounds screened across 100 sequence-unrelated proteins. The screened compounds have different origins and include natural products, synthetic molecules from academic groups, and commercial compounds. Similar synthetic structures from academic groups showed, overall, greater promiscuity differences than do natural products and commercial compounds. The method implemented in this work readily identified structural changes that differentiated highly specific from promiscuous compounds. This approach is general and can be applied to analyze any other large-scale protein-binding profile data.

• Exploring SAR continuity in the vicinity of activity cliffs.
Namasivayam, Vigneshwaran and Iyer, Preeti and Bajorath, Jürgen
Chemical biology & drug design, 2012, 79(1), 22-29
PMID: 21985661     doi: 10.1111/j.1747-0285.2011.01256.x

Activity cliffs are formed by structurally similar compounds with significant differences in potency and represent an extreme form of structure-activity relationships discontinuity. By contrast, regions of structure-activity relationships continuity in compound data sets result from the presence of structurally increasingly diverse compounds retaining similar activity. Previous studies have revealed that structure-activity relationships information extracted from large compound data sets is often heterogeneous in nature containing both continuous and discontinuous structure-activity relationships components. Structure-activity relationships discontinuity and continuity are often represented by different compound series, independent of each other. Here, we have searched different compound data sets for the presence of structure-activity relationships continuity within the vicinity of prominent activity cliffs. For this purpose, we have designed and implemented a computational approach utilizing particle swarm optimization to examine the structural neighborhood of activity cliffs for continuous structure-activity relationships components. Structure-activity relationships continuity in the structural neighborhood of activity cliffs was relatively rarely observed. However, in a number of cases, notable structure-activity relationships continuity was detected in the vicinity of prominent activity cliffs. Exemplary local structure-activity relationships environments displaying these characteristics were analyzed in detail. Thus, the structure-activity relationships environment of activity cliffs must not necessarily be discontinuous in nature, and local structure-activity relationships continuity and discontinuity can occur in a concerted manner in series of structurally related compounds.

• Graph mining for SAR transfer series.
Gupta-Ostermann, Disha and Wawer, Mathias and Wassermann, Anne Mai and Bajorath, Jürgen
Journal of chemical information and modeling, 2012, 52(4), 935-942
PMID: 22436016     doi: 10.1021/ci300071y

The transfer of SAR information from one analog series to another is a difficult, yet highly attractive task in medicinal chemistry. At present, the evaluation of SAR transfer potential from a data mining perspective is still in its infancy. Only recently, a first computational approach has been introduced to evaluate SAR transfer events. Here, a substructure relationship-based molecular network representation has been used as a starting point to systematically identify SAR transfer series in large compound data sets. For this purpose, a methodology is introduced that consists of two stages. For graph mining, an algorithm has been designed that extracts all parallel series from compound data sets. A parallel series is formed by two series of analogs with different core structures but pairwise corresponding substitution patterns. The SAR transfer potential of identified parallel series is then evaluated using a scoring function that emphasizes corresponding potency progression over many analog pairs and large potency ranges. The substructure relationship-based molecular network in combination with the graph mining algorithm currently represents the only generally applicable approach to systematically detect SAR transfer events in large compound data sets. The combined approach has been evaluated on a large number of compound data sets and shown to systematically identify SAR transfer series.

• Introducing the LASSO graph for compound data set representation and structure-activity relationship analysis.
Gupta-Ostermann, Disha and Hu, Ye and Bajorath, Jürgen
Journal of medicinal chemistry, 2012, 55(11), 5546-5553
PMID: 22571406     doi: 10.1021/jm3004762

A graphical method is introduced for compound data mining and structure-activity relationship (SAR) data analysis that is based upon a canonical structural organization scheme and captures a compound-scaffold-skeleton hierarchy. The graph representation has a constant layout, integrates compound activity data, and provides direct access to SAR information. Characteristic SAR patterns that emerge from the graph are easily identified. The molecular hierarchy enables "forward-backward" analysis of compound data and reveals both global and local SAR patterns. For example, in heterogeneous data sets, compound series are immediately identified that convey interpretable SAR information in isolation or in the structural context of related series, which often define SAR pathways through data sets.

• SAR Matrices: Automated Extraction of Information-Rich SAR Tables from Large Compound Data Sets.
Wassermann, Anne Mai and Haebel, Peter and Weskamp, Nils and Bajorath, Jürgen
Journal of chemical information and modeling, 2012, 52(7), 1769-1776
PMID: 22657271     doi: 10.1021/ci300206e

We introduce the SAR matrix data structure that is designed to elucidate SAR patterns produced by groups of structurally related active compounds, which are extracted from large data sets. SAR matrices are systematically generated and sorted on the basis of SAR information content. Matrix generation is computationally efficient and enables processing of large compound sets. The matrix format is reminiscent of SAR tables, and SAR patterns revealed by different categories of matrices are easily interpretable. The structural organization underlying matrix formation is more flexible than standard R-group decomposition schemes. Hence, the resulting matrices capture SAR information in a comprehensive manner.

• Multiobjective particle swarm optimization: automated identification of structure-activity relationship-informative compounds with favorable physicochemical property distributions.
Namasivayam, Vigneshwaran and Bajorath, Jürgen
Journal of chemical information and modeling, 2012, 52(11), 2848-2855
PMID: 23039232     doi: 10.1021/ci300402g

The selection of active compounds for chemical optimization efforts typically requires the consideration of multiple properties beyond potency. Herein we introduce a multiobjective particle swarm optimization approach to automatically extract compound subsets from large data sets that reveal structure-activity relationship (SAR) information and display physicochemical property distributions that are indicative of favorable absorption, distribution, metabolism, and excretion (ADME) characteristics. The approach is based on Pareto optimization of multiple objectives and does not require subjective intervention. It is automated and can be easily modified. We have applied the method to screen 10 compound data sets of different composition and global SAR phenotypes. In five of these data sets, between one and more than hundred compound subsets were identified that represented discontinuous local SARs and had desirable property distributions.

• Design of a three-dimensional multitarget activity landscape.
de la Vega de León, Antonio and Bajorath, Jürgen
Journal of chemical information and modeling, 2012, 52(11), 2876-2883
PMID: 23113585     doi: 10.1021/ci300444p

The design of activity landscape representations is challenging when compounds are active against multiple targets. Going beyond three or four targets, the complexity of underlying activity spaces is difficult to capture in conventional activity landscape views. Previous attempts to generate multitarget activity landscapes have predominantly utilized extensions of molecular network representations or plots of activity versus chemical similarity for pairs of active compounds. Herein, we introduce a three-dimensional multitarget activity landscape design that is based upon principles of radial coordinate visualization. Circular representations of multitarget activity and chemical reference space are combined to generate a spherical view into which compound sets are projected for interactive analysis. Interpretation of landscape content is facilitated by following three canonical views of activity, chemical, and combined activity/chemical space, respectively. These views focus on different planes of the underlying coordinate system. From the activity and combined views, compounds with well-defined target selectivity and structure-activity profile relationships can be extracted. In the activity landscape, such compounds display characteristic spatial arrangements and target activity patterns.

• Systematic assessment of compound series with SAR transfer potential.
Zhang, Bijun and Wassermann, Anne Mai and Vogt, Martin and Bajorath, Jürgen
Journal of chemical information and modeling, 2012, 52(12), 3138-3143
PMID: 23186159     doi: 10.1021/ci300481d

Compound series with different core structures that contain pairs of analogs with corresponding substitution patterns and similar activity represent structure-activity relationship (SAR) transfer events. On the basis of the matched molecular pair (MMP) formalism and linear regression analysis of compound potencies, a general approach is introduced for the identification of SAR transfer series (SAR-TS) and SAR-TS with regular potency progression (SAR-TS-RP). We have systematically extracted such series from public domain compound data and analyzed their size distribution and structural characteristics. More than 900 SAR-TS and 500 SAR-TS-RP with high-confidence potency annotations were identified in various compound activity classes. These series provide a substantial knowledge base for the analysis and prediction of SAR transfer and are made publicly available.

• Identifying Activity Cliff Generators of PPAR Ligands Using SAS Maps
Méndez-Lucio, Oscar and Pérez-Villanueva, Jaime and Castillo, Rafael and Medina-Franco, José L
Molecular Informatics, 2012, 31(11-12), 837-846
doi: 10.1002/minf.201200078

Structure-activity relationships (SAR) of compound databases play a key role in hit identification and lead optimization. In particular, activity cliffs, defined as a pair of structurally similar molecules that present large changes in potency, provide valuable SAR information. Herein, we introduce the concept of activity cliff generator, defined as a molecular structure that has a high probability to form activity cliffs with molecules tested in the same biological assay. To illustrate this concept, we discuss a case study where Structure-Activity Similarity maps were used to systematically identify and analyze activity cliff generators present in a dataset of 168 compounds tested against three peroxisome-proliferator-activated receptor (PPAR) subtypes. Single-target and dual-target activity cliff generators for PPAR$\alpha$ and $\delta$ were identified. In addition, docking calculations of compounds that were classified as cliff generators helped to suggest a hot spot in the target protein responsible of activity cliffs and to analyze its implication in ligand-enzyme interaction.

## 2011

• Lessons learned from molecular scaffold analysis
Hu, Y and Stumpfe, D and Bajorath, J
Journal of chemical information and\ldots}, 2011

• Mining for bioactive scaffolds with scaffold networks: improved compound set enrichment from primary screening data.
Varin, Thibault and Schuffenhauer, Ansgar and Ertl, Peter and Renner, Steffen
Journal of chemical information and modeling, 2011, 51(7), 1528-1538
PMID: 21615076     doi: 10.1021/ci2000924

Identification of meaningful chemical patterns in the increasing amounts of high-throughput-generated bioactivity data available today is an increasingly important challenge for successful drug discovery. Herein, we present the scaffold network as a novel approach for mapping and navigation of chemical and biological space. A scaffold network represents the chemical space of a library of molecules consisting of all molecular scaffolds and smaller "parent" scaffolds generated therefrom by the pruning of rings, effectively leading to a network of common scaffold substructure relationships. This algorithm provides an extension of the scaffold tree algorithm that, instead of a network, generates a tree relationship between a heuristically rule-based selected subset of parent scaffolds. The approach was evaluated for the identification of statistically significantly active scaffolds from primary screening data for which the scaffold tree approach has already been shown to be successful. Because of the exhaustive enumeration of smaller scaffolds and the full enumeration of relationships between them, about twice as many statistically significantly active scaffolds were identified compared to the scaffold-tree-based approach. We suggest visualizing scaffold networks as islands of active scaffolds.

• Visualization of molecular fingerprints.
Owen, John R and Nabney, Ian T and Medina-Franco, José L and López-Vallejo, Fabian
Journal of chemical information and modeling, 2011, 51(7), 1552-1563
PMID: 21696145     doi: 10.1021/ci1004042

A visualization plot of a data set of molecular data is a useful tool for gaining insight into a set of molecules. In chemoinformatics, most visualization plots are of molecular descriptors, and the statistical model most often used to produce a visualization is principal component analysis (PCA). This paper takes PCA, together with four other statistical models (NeuroScale, GTM, LTM, and LTM-LIN), and evaluates their ability to produce clustering in visualizations not of molecular descriptors but of molecular fingerprints. Two different tasks are addressed: understanding structural information (particularly combinatorial libraries) and relating structure to activity. The quality of the visualizations is compared both subjectively (by visual inspection) and objectively (with global distance comparisons and local k-nearest-neighbor predictors). On the data sets used to evaluate clustering by structure, LTM is found to perform significantly better than the other models. In particular, the clusters in LTM visualization space are consistent with the relationships between the core scaffolds that define the combinatorial sublibraries. On the data sets used to evaluate clustering by activity, LTM again gives the best performance but by a smaller margin. The results of this paper demonstrate the value of using both a nonlinear projection map and a Bernoulli noise model for modeling binary data.

• Single R-Group Polymorphisms (SRPs) and R-Cliffs: An Intuitive Framework for Analyzing and Visualizing Activity Cliffs in a Single Analog Series.
Agrafiotis, Dimitris K and Wiener, John J M and Skalkin, Andrew and Kolpak, Jeremy
Journal of chemical information and modeling, 2011, 51(5), 1122-1131
PMID: 21504183     doi: 10.1021/ci200054u

We introduce Single R-Group Polymorphisms (SRPs, pronounced 'sharps'), an intuitive framework for analyzing substituent effects and activity cliffs in a single congeneric series. A SRP is a pair of compounds that differ only in a single R-group position. Because the same substituent pair may occur in multiple SRPs in the series (i.e., with different combinations of substituents at the other R-group positions), SRP analysis makes it easy to identify systematic substituent effects and activity cliffs at each point of variation (R-cliffs). SRPs can be visualized as a symmetric heatmap where each cell represents a particular pair of substituents color-coded by the average difference in activity between the compounds that contain that particular SRP. SRP maps offer several advantages over existing techniques for visualizing activity cliffs: 1) the chemical structures of all the substituents are displayed simultaneously on a single map, thus directly engaging the pattern recognition abilities of the medicinal chemist; 2) it is based on R-group decomposition, a natural paradigm for generating and rationalizing SAR; 3) it uses a heatmap representation that makes it easy to identify systematic trends in the data; 4) it generalizes the concept of activity cliffs beyond similarity by allowing the analyst to sort the substituents according to any property of interest or place them manually in any desired order.

• Consensus models of activity landscapes with multiple chemical, conformer, and property representations.
Yongye, Austin B and Byler, Kendall and Santos, Radleigh and Martínez-Mayorga, Karina and Maggiora, Gerald M and Medina-Franco, José L
Journal of chemical information and modeling, 2011, 51(6), 1259-1270
PMID: 21609014     doi: 10.1021/ci200081k

We report consensus Structure-Activity Similarity (SAS) maps that address the dependence of activity landscapes on molecular representation. As a case study, we characterized the activity landscape of 54 compounds with activities against human cathepsin B (hCatB), human cathepsin L (hCatL), and Trypanosoma brucei cathepsin B (TbCatB). Starting from an initial set of 28 descriptors we selected ten representations that capture different aspects of the chemical structures. These included four 2D (MACCS keys, GpiDAPH3, pairwise, and radial fingerprints) and six 3D (4p and piDAPH4 fingerprints with each including three conformers) representations. Multiple conformers are used for the first time in consensus activity landscape modeling. The results emphasize the feasibility of identifying consensus data points that are consistently formed in different reference spaces generated with several fingerprint models, including multiple 3D conformers. Consensus data points are not meant to eliminate data, disregarding, for example, "true" activity cliffs that are not identified by some molecular representations. Instead, consensus models are designed to prioritize the SAR analysis of activity cliffs and other consistent regions in the activity landscape that are captured by several molecular representations. Systematic description of the SARs of two targets give rise to the identification of pairs of compounds located in the same region of the activity landscape of hCatL and TbCatB suggesting similar mechanisms of action for the pairs involved. We also explored the relationship between property similarity and activity similarity and found that property similarities are suitable to characterize SARs. We also introduce the concept of structure-property-activity (SPA) similarity in SAR studies.

• SAR monitoring of evolving compound data sets using activity landscapes.
Iyer, Preeti and Hu, Ye and Bajorath, Jürgen
Journal of chemical information and modeling, 2011, 51(3), 532-540
PMID: 21322535     doi: 10.1021/ci100505m

In pharmaceutical research, collections of active compounds directed against specific therapeutic targets usually evolve over time. Small molecule discovery is an iterative process. New compounds are discovered, alternative compound series explored, some series discontinued, and others prioritized. The design of new compounds usually takes into consideration prior chemical and structure-activity relationship (SAR) knowledge. Hence, historically grown compound collections represent a viable source of chemical and SAR information that might be utilized to retrospectively analyze roadblocks in compound optimization and further guide discovery projects. However, SAR analysis of large and heterogeneous sets of active compounds is also principally complicated. We have subjected evolving compound data sets to SAR monitoring using activity landscape models in order to evaluate how composition and SAR characteristics might change over time. Chemotype and potency distributions in evolving data sets directed against different therapeutic targets were analyzed and alternative activity landscape representations generated at different points in time to monitor the progression of global and local SAR features. Our results show that the evolving data sets studied here have predominantly grown around seed clusters of active compounds that often emerged early on, while other SAR islands remained largely unexplored. Moreover, increasing scaffold diversity in evolving data sets did not necessarily yield new SAR patterns, indicating a rather significant influence of "me-too-ism" (i.e., introducing new chemotypes that are similar to already known ones) on the composition and SAR information content of the data sets.

• Activity profile sequences: a concept to account for the progression of compound activity in target space and to extract SAR information from analogue series with multiple target annotations.
Hu, Ye and Bajorath, Jürgen
Chemmedchem, 2011, 6(12), 2150-2154
PMID: 22052747     doi: 10.1002/cmdc.201100395

• Assessing the confidence level of public domain compound activity data and the impact of alternative potency measurements on SAR analysis.
Stumpfe, Dagmar and Bajorath, Jürgen
Journal of chemical information and modeling, 2011, 51(12), 3131-3137
PMID: 22059677     doi: 10.1021/ci2004434

Publicly available compound activity data have been analyzed to distinguish between compounds for which single or multiple potency measurements were available and gain insight into data confidence levels. Different potency measurements with defined end points and alternative ways to represent multiple potency values for active compounds have been evaluated in the context of SAR analysis. Approximately 78% of all compounds with multiple potency measurements were found to represent high-confidence data, which corresponded to ∼10% of all activity data. The use of different types of potency measurements and alternative representations of multiple potency values changed the SAR information content of compound data sets and resulted in different activity cliff distributions. Thus, the types of activity measurements that were available and how they were used substantially impacted SAR analysis. Compounds with multiple K(i) measurements provided the most reliable basis for SAR exploration.

• Design of multitarget activity landscapes that capture hierarchical activity cliff distributions.
Dimova, Dilyana and Wawer, Mathias and Wassermann, Anne Mai and Bajorath, Jürgen
Journal of chemical information and modeling, 2011, 51(2), 258-266
PMID: 21275393     doi: 10.1021/ci100477m

An activity landscape model of a compound data set can be rationalized as a graphical representation that integrates molecular similarity and potency relationships. Activity landscape representations of different design are utilized to aid in the analysis of structure-activity relationships and the selection of informative compounds. Activity landscape models reported thus far focus on a single target (i.e., a single biological activity) or at most two targets, giving rise to selectivity landscapes. For compounds active against more than two targets, landscapes representing multitarget activities are difficult to conceptualize and have not yet been reported. Herein, we present a first activity landscape design that integrates compound potency relationships across multiple targets in a formally consistent manner. These multitarget activity landscapes are based on a general activity cliff classification scheme and are visualized in graph representations, where activity cliffs are represented as edges. Furthermore, the contributions of individual compounds to structure-activity relationship discontinuity across multiple targets are monitored. The methodology has been applied to derive multitarget activity landscapes for compound data sets active against different target families. The resulting landscapes identify single-, dual-, and triple-target activity cliffs and reveal the presence of hierarchical cliff distributions. From these multitarget activity landscapes, compounds forming complex activity cliffs can be readily selected.

• Rationalizing the role of SAR tolerance for ligand-based virtual screening.
Ripphausen, Peter and Nisius, Britta and Wawer, Mathias and Bajorath, Jürgen
Journal of chemical information and modeling, 2011, 51(4), 837-842
PMID: 21438544     doi: 10.1021/ci200064c

It is well appreciated that the results of ligand-based virtual screening (LBVS) are much influenced by methodological details, given the generally strong compound class dependence of LBVS methods. It is less well understood to what extent structure-activity relationship (SAR) characteristics might influence the outcome of LBVS. We have assessed the hypothesis that the success of prospective LBVS depends on the SAR tolerance of screening targets, in addition to methodological aspects. In this context, SAR tolerance is rationalized as the ability of a target protein to specifically interact with series of structurally diverse active compounds. In compound data sets, SAR tolerance articulates itself as SAR continuity, i.e., the presence of structurally diverse compounds having similar potency. In order to analyze the role of SAR tolerance for LBVS, activity landscape representations of compounds active against 16 different target proteins were generated for which successful LBVS applications were reported. In all instances, the activity landscapes of known active compounds contained multiple regions of local SAR continuity. When analyzing the location of newly identified LBVS hits and their SAR environments, we found that these hits almost exclusively mapped to regions of distinct local SAR continuity. Taken together, these findings indicate the presence of a close link between SAR tolerance at the target level, SAR continuity at the ligand level, and the probability of LBVS success.

• From Virtual Screening to Bioactive Compounds by Visualizing and Clustering of Chemical Space
Klenner, Alexander and Hähnke, Volker and Geppert, Tim and Schneider, Petra and Zettl, Heiko and Haller, Sarah and Rodrigues, Tiago and Reisen, Felix and Hoy, Benjamin and Schaible, Anja Maria and Werz, Oliver and Wessler, Silja and Schneider, Gisbert
Molecular Informatics, 2011, 31(1), 21-26
doi: 10.1002/minf.201100147

• Local Structural Changes, Global Data Views: Graphical Substructure−Activity Relationship Trailing
Wawer, Mathias and Bajorath, Jürgen
Journal of medicinal chemistry, 2011, 54(8), 2944-2951
PMID: 21443196     doi: 10.1021/jm200026b

## 2010

• SARANEA: a freely available program to mine structure-activity and structure-selectivity relationship information in compound data sets.
Lounkine, Eugen and Wawer, Mathias and Wassermann, Anne Mai and Bajorath, Jürgen
Journal of chemical information and modeling, 2010, 50(1), 68-78
PMID: 20053000     doi: 10.1021/ci900416a

We introduce SARANEA, an open-source Java application for interactive exploration of structure-activity relationship (SAR) and structure-selectivity relationship (SSR) information in compound sets of any source. SARANEA integrates various SAR and SSR analysis functions and utilizes a network-like similarity graph data structure for visualization. The program enables the systematic detection of activity and selectivity cliffs and corresponding key compounds across multiple targets. Advanced SAR analysis functions implemented in SARANEA include, among others, layered chemical neighborhood graphs, cliff indices, selectivity trees, editing functions for molecular networks and pathways, bioactivity summaries of key compounds, and markers for bioactive compounds having potential side effects. We report the application of SARANEA to identify SAR and SSR determinants in different sets of serine protease inhibitors. It is found that key compounds can influence SARs and SSRs in rather different ways. Such compounds and their SAR/SSR characteristics can be systematically identified and explored using SARANEA. The program and source code are made freely available under the GNU General Public License.

• 2D Depiction of Fragment Hierarchies
Clark, Alex M
Journal of chemical information and modeling, 2010, 50(1), 37-46
doi: 10.1021/ci900350h

Drug discovery projects often involve organizing compounds in the form of a hierarchical tree, where each node is a substructure fragment shared by all of its descendent nodes. A method is described for producing 2D depiction layout coordinates for each of the nodes in such a tree, ensuring that common fragments within molecular structures are drawn in an identical way, and arranged with a consistent orientation. This is achieved by first deriving a common numbering scheme for common fragments, then using this scheme to redepict each of the molecules, one fragment at a time, so that common fragments have common depiction motifs. Once complete, the distinct root branches can be overlaid onto each other, after which all of the fragments and whole molecules have a common layout and orientation. Several methods are described for preparing visual representations of molecular structure hierarchies alongside activity information. Combining high level tree display and structure depiction showing common features readily facilitates insight into structure-activity relationships.

• Bioactivity-Guided Navigation of Chemical Space
Bon, Robin S and Waldmann, Herbert
Accounts of chemical research, 2010, 43(8), 1103-1114
PMID: 20481515     doi: 10.1021/ar100014h

• Data structures and computational tools for the extraction of SAR information from large compound sets.
Wawer, Mathias and Lounkine, Eugen and Wassermann, Anne M and Bajorath, Jürgen
Drug discovery\ldots}, 2010, 15(15-16), 630-639
PMID: 20547243     doi: 10.1016/j.drudis.2010.06.004

Computational data mining and visualization techniques play a central part in the extraction of structure-activity relationship (SAR) information from compound sets including high-throughput screening data. Standard statistical and classification techniques can be used to organize data sets and evaluate the chemical neighborhood of potent hits; however, such methods are limited in their ability to extract complex SAR patterns from data sets and make them readily accessible to medicinal chemists. Therefore, new approaches and data structures are being developed that explicitly focus on molecular structure and its relationship to biological activity across multiple targets. Here, we review standard techniques for compound data analysis and describe new data structures and computational tools for SAR mining of large compound data sets.

• Scaffold Hunter - Interactive Exploration of Chemical Space
Klein, Karsten and Kriege, Nils and Mutzel, Petra and Waldmann, Herbert and Wetzel, Stefan
, 2010, 426-427
doi: 10.1007/978-3-642-11805-0_47

Scaffold Hunter is a Java-based software tool for the analysis of structure-related biochemical data. It facilitates the interactive exploration of chemical space by enabling generation of and navigation in a scaffold tree hierarchy annotated with various data. The graphical visualization of structural relationships allows to analyze large data sets, e.g., to correlate chemical structure and biochemical activity.

• Cheminformatics approaches to analyze diversity in compound screening libraries.
Akella, Lakshmi B and DeCaprio, David
Current opinion in chemical biology, 2010, 14(3), 325-330
PMID: 20457001     doi: 10.1016/j.cbpa.2010.03.017

As high-throughput screening matures as a discipline, cheminformatics is playing an increasingly important role in selecting new compounds for diverse screening libraries. New visualization techniques such as multi-fusion similarity maps, scaffold trees, and principal moments of inertia plots provide complementary information on compound libraries and enable identification of unexplored regions of chemical space with potential biological relevance. Quantitative metrics have been developed to analyze libraries for properties such as natural product-likeness and shape complexity. Analysis of high-throughput screening results and drug discovery programs identify compounds problematic for screening. Taken together these approaches allow us to increase the diversity of biological outcomes available in compound screening libraries and improve the success rates of high-throughput screening against new targets without making significant increases in the size of compound libraries.

• Compound set enrichment: a novel approach to analysis of primary HTS data.
Varin, Thibault and Gubler, Hanspeter and Parker, Christian N and Zhang, Ji-Hu and Raman, Pichai and Ertl, Peter and Schuffenhauer, Ansgar
Journal of chemical information and modeling, 2010, 50(12), 2067-2078
PMID: 21073183     doi: 10.1021/ci100203e

The main goal of high-throughput screening (HTS) is to identify active chemical series rather than just individual active compounds. In light of this goal, a new method (called compound set enrichment) to identify active chemical series from primary screening data is proposed. The method employs the scaffold tree compound classification in conjunction with the Kolmogorov-Smirnov statistic to assess the overall activity of a compound scaffold. The application of this method to seven PubChem data sets (containing between 9389 and 263679 molecules) is presented, and the ability of this method to identify compound classes with only weakly active compounds (potentially latent hits) is demonstrated. The analysis presented here shows how methods based on an activity cutoff can distort activity information, leading to the incorrect activity assignment of compound series. These results suggest that this method might have utility in the rational selection of active classes of compounds (and not just individual active compounds) for followup and validation.

• Scaffold explorer: an interactive tool for organizing and mining structure-activity data spanning multiple chemotypes.
Agrafiotis, Dimitris K and Wiener, John J M
Journal of medicinal chemistry, 2010, 53(13), 5002-5011
PMID: 20524668     doi: 10.1021/jm1004495

We introduce Scaffold Explorer, an interactive tool that allows medicinal chemists to define hierarchies of chemical scaffolds and use them to explore their project data. Scaffold Explorer allows the user to construct a tree, where each node corresponds to a specific scaffold. Each node can have multiple children, each of which represents a more refined substructure relative to its parent node. Once the tree is defined, it can be mapped onto any collection of compounds and be used as a navigational tool to explore structure-activity relationships (SAR) across different chemotypes. The rich visual analytics of Scaffold Explorer afford the user a "bird's-eye" view of the chemical space spanned by a particular data set, map any physicochemical property or biological activity of interest onto the individual scaffold nodes, serve as an aggregator for the properties of the compounds represented by these nodes, and quickly distinguish promising chemotypes from less interesting or problematic ones. Unlike previous approaches, which focused on automated extraction and classification of scaffolds, the utility of the new tool rests on its interactivity and ability to accommodate the medicinal chemists' intuition by allowing the use of arbitrary substructures containing variable atoms, bonds, and/or substituents such as those employed in substructure search.

• Systematic analysis of public domain compound potency data identifies selective molecular scaffolds across druggable target families.
Hu, Ye and Wassermann, Anne Mai and Lounkine, Eugen and Bajorath, Jürgen
Journal of medicinal chemistry, 2010, 53(2), 752-758
PMID: 20000355     doi: 10.1021/jm9014229

Molecular scaffolds that yield target family-selective compounds are of high interest in pharmaceutical research. There continues to be considerable debate in the field as to whether chemotypes with a priori selectivity for given target families and/or targets exist and how they might be identified. What do currently available data tell us? We present a systematic and comprehensive selectivity-centric analysis of public domain target-ligand interactions. More than 200 molecular scaffolds are identified in currently available active compounds that are selective for established target families. A subset of these scaffolds is found to produce compounds with high selectivity for individual targets among closely related ones. These scaffolds are currently underrepresented in approved drugs.

• Similarity-potency trees: a method to search for SAR information in compound data sets and derive SAR rules.
Wawer, Mathias and Bajorath, Jürgen
Journal of chemical information and modeling, 2010, 50(8), 1395-1409
PMID: 20726598     doi: 10.1021/ci100197b

An intuitive and generally applicable analysis method, termed similarity-potency tree (SPT), is introduced to mine structure-activity relationship (SAR) information in compound data sets of any source. Only compound potency values and nearest-neighbor similarity relationships are considered. Rather than analyzing a data set as a whole, in part overlapping compound neighborhoods are systematically generated and represented as SPTs. This local analysis scheme simplifies the evaluation of SAR information and SPTs of high SAR information content are easily identified. By inspecting only a limited number of compound neighborhoods, it is also straightforward to determine whether data sets contain only little or no interpretable SAR information. Interactive analysis of SPTs is facilitated by reading the trees in two directions, which makes it possible to extract SAR rules, if available, in a consistent manner. The simplicity and interpretability of the data structure and the ease of calculation are characteristic features of this approach. We apply the methodology to high-throughput screening and lead optimization data sets, compare the approach to standard clustering techniques, illustrate how SAR rules are derived, and provide some practical guidance how to best utilize the methodology. The SPT program is made freely available to the scientific community.

• Activity Landscape Representations for Structure−Activity Relationship Analysis
Wassermann, Anne Mai and Wawer, Mathias and Bajorath, Jürgen
Journal of medicinal chemistry, 2010, 53(23), 8209-8223
PMID: 20845971     doi: 10.1021/jm100933w

• Computational characterization of SAR microenvironments in high-throughput screening data
Wawer, M and Sun, S and Bajorath, J
International Journal of High Throughput Screening, 2010, 15
doi: 10.2147/IJHTS.S7534

Purpose: A computational approach is described to analyze structure-activity relationship (SAR) information contained in compound and screening data sets. The methodology is designed to explore SAR information in a systematic and compound-centric manner in ...

• Computational characterization of SAR microenvironments in high-throughput screening data
Wawer, M and Sun, S and Bajorath, J
International Journal of High Throughput Screening, 2010, 15
doi: 10.2147/IJHTS.S7534

Purpose: A computational approach is described to analyze structure-activity relationship (SAR) information contained in compound and screening data sets. The methodology is designed to explore SAR information in a systematic and compound-centric manner in ...

## 2009

• Chemical biology: Branching out into chemical space
Harrison, Charlotte
Nature reviews. Drug discovery, 2009, 8(8), 615-615

• Staring off into chemical space
Irwin, John J
Nature chemical biology, 2009, 5(8), 536-537
doi: 10.1038/nchembio0809-536

New software to browse chemical space, with structures organized by rings, will enable chemical insight.

• Interactive exploration of chemical space with Scaffold Hunter
Wetzel, Stefan and Klein, Karsten and Renner, Steffen and Rauh, Daniel and Oprea, Tudor I and Mutzel, Petra and Waldmann, Herbert
Nature chemical biology, 2009, 5(8), 581-583
PMID: 19561620     doi: 10.1038/nchembio.187

Abstract We describe Scaffold Hunter , a highly interactive computer-based tool for navigation in chemical space that fosters intuitive recognition of complex structural relationships associated with bioactivity. The program reads compound structures and ...

• Interactive exploration of chemical space with Scaffold Hunter
Wetzel, Stefan and Klein, Karsten and Renner, Steffen and Rauh, Daniel and Oprea, Tudor I and Mutzel, Petra and Waldmann, Herbert
Nature chemical biology, 2009, 5(8), 581-583
PMID: 19561620     doi: 10.1038/nchembio.187

Abstract We describe Scaffold Hunter , a highly interactive computer-based tool for navigation in chemical space that fosters intuitive recognition of complex structural relationships associated with bioactivity. The program reads compound structures and ...

• Bioactivity-guided mapping and navigation of chemical space.
Renner, Steffen and van Otterlo, Willem A L and Dominguez Seoane, Marta and Möcklinghoff, Sabine and Hofmann, Bettina and Wetzel, Stefan and Schuffenhauer, Ansgar and Ertl, Peter and Oprea, Tudor I and Steinhilber, Dieter and Brunsveld, Luc and Rauh, Daniel and Waldmann, Herbert
Nature chemical biology, 2009, 5(8), 585-592
PMID: 19561619     doi: 10.1038/nchembio.188

The structure- and chemistry-based hierarchical organization of library scaffolds in tree-like arrangements provides a valid, intuitive means to map and navigate chemical space. We demonstrate that scaffold trees built using bioactivity as the key selection criterion for structural simplification during tree construction allow efficient and intuitive mapping, visualization and navigation of the chemical space defined by a given library, which in turn allows correlation of this chemical space with the investigated bioactivity and further compound design. Brachiation along the branches of such trees from structurally complex to simple scaffolds with retained yet varying bioactivity is feasible at high frequency for the five major pharmaceutically relevant target classes and allows for the identification of new inhibitor types for a given target. We provide proof of principle by identifying new active scaffolds for 5-lipoxygenase and the estrogen receptor ERalpha.

• Detection and assignment of common scaffolds in project databases of lead molecules.
Clark, Alex M and Labute, Paul
Journal of medicinal chemistry, 2009, 52(2), 469-483
PMID: 19093885     doi: 10.1021/jm801098a

A method is presented for the detection and analysis of multiple common scaffolds for small collections of pharmaceutically relevant molecules that share a set of common structural motifs. The input consists of the molecules themselves, possibly some of the scaffolds, and possibly information about the relation between the substitution points of these scaffolds. Three new algorithms are presented: multiple scaffold detection, common scaffold alignment, and scaffold substructure assignment. Each of these steps is relevant for cases when either none, some, or all information about the common scaffolds and their substitution patterns is available. Each of these problems must be solved in an optimal way in order to produce useful structure-activity correlations. The output consists of a collection of scaffolds, a common numbering system, and a unique mapping of each molecule to a single scaffold substructure. This information can then be used to produce data for structure-activity analysis of medicinal chemistry project databases.

• Elucidation of structure-activity relationship pathways in biological screening data.
Wawer, Mathias and Peltason, Lisa and Bajorath, Jürgen
Journal of medicinal chemistry, 2009, 52(4), 1075-1080
PMID: 19140668     doi: 10.1021/jm8014102

A computational molecular network analysis of various high-throughput screening (HTS) data sets including inhibition assays and cell-based screens organizes screening hits according to different local structure-activity relationships (SARs). The resulting network representations make it possible to focus on different local SAR environments in screening data. We have designed a simple scoring function accounting for similarity and potency relationships among hits that identifies SAR pathways leading from active compounds in different SAR contexts to key compounds forming activity cliffs. From these pathways, SAR information can be extracted and utilized to select hits for further analysis. In clusters of hits related by different local SARs, alternative pathways can be systematically explored and ranked according to SAR information content, which makes it possible to prioritize hits in a consistent manner.

• Navigating structure-activity landscapes.
Bajorath, Jürgen and Peltason, Lisa and Wawer, Mathias and Guha, Rajarshi and Lajiness, Michael S and Van Drie, John H
Drug discovery today, 2009, 14(13-14), 698-705
PMID: 19410012     doi: 10.1016/j.drudis.2009.04.003

The problem of how to explore structure-activity relationships (SARs) systematically is still largely unsolved in medicinal chemistry. Recently, data analysis tools have been introduced to navigate activity landscapes and to assess SARs on a large scale. Initial investigations reveal a surprising heterogeneity among SARs and shed light on the relationship between 'global' and 'local' SAR features. Moreover, insights are provided into the fundamental issue of why modeling tools work well in some cases, but not in others.

• Systematic extraction of structure-activity relationship information from biological screening data.
Wawer, Mathias and Bajorath, Jürgen
Chemmedchem, 2009, 4(9), 1431-1438
PMID: 19621333     doi: 10.1002/cmdc.200900222

A data mining approach is introduced that automatically extracts SAR information from high-throughput screening data sets and that helps to select active compounds for chemical exploration and hit-to-lead projects. SAR pathways are systematically identified consisting of sequences of similar active compounds with gradual increases in potency. Fully enumerated SAR pathway sets are subjected to pathway scoring, filtering, and mining, and pathways with the most significant SAR information content are prioritized. High-scoring SAR pathways often reveal activity cliffs contained in screening data. Subsets of SAR pathways are analyzed in SAR trees that make it possible to identify microenvironments of significant SAR discontinuity from which hits are preferentially selected. SAR trees of alternative pathways leading to activity cliffs identify key compounds and help to develop chemically intuitive SAR hypotheses.

• Systematic computational analysis of structure-activity relationships: concepts, challenges and recent advances
Peltason, Lisa and Bajorath, Jürgen
Future medicinal chemistry, 2009, 1(3), 451-466
doi: 10.4155/fmc.09.41

The exploration of structure-activity relationships (SARs) of small molecules is a central aspect of medicinal chemistry. Typically, SARs are analyzed on a one-by-one basis, and chemical intuition and experience play an important role in this process. Since the 1960s, computational approaches have been developed to aid in SAR exploration that largely, but not exclusively, rely on the quantitative (Q)SAR paradigm. Accordingly, QSAR analysis has long been a mainstay of compound optimization efforts. However, the strong compound class dependence of SAR features and their intrinsic heterogeneity often pose severe constraints on the applicability of these methods. In addition to QSAR approaches, conceptually different molecular similarity methods are also applied to identify novel active compounds. In order to complement and further extend the current repertoire of computational methods, SAR analysis functions have recently been introduced that evaluate and compare SAR features on a large scale, extract SAR in...

## 2008

• Structure-activity landscape index: identifying and quantifying activity cliffs.
Guha, Rajarshi and Van Drie, John H
Journal of chemical information and modeling, 2008, 48(3), 646-658
PMID: 18303878     doi: 10.1021/ci7004093

A new method for analyzing a structure-activity relationship is proposed. By use of a simple quantitative index, one can readily identify "structure-activity cliffs": pairs of molecules which are most similar but have the largest change in activity. We show how this provides a graphical representation of the entire SAR, in a way that allows the salient features of the SAR to be quickly grasped. In addition, the approach allows us view the SARs in a data set at different levels of detail. The method is tested on two data sets that highlight its ability to easily extract SAR information. Finally, we demonstrate that this method is robust using a variety of computational control experiments and discuss possible applications of this technique to QSAR model evaluation.

• Assessing how well a modeling protocol captures a structure-activity landscape.
Guha, Rajarshi and Van Drie, John H
Journal of chemical information and modeling, 2008, 48(8), 1716-1728
PMID: 18686944     doi: 10.1021/ci8001414

We introduce the notion of structure-activity landscape index (SALI) curves as a way to assess a model and a modeling protocol, applied to structure-activity relationships. We start from our earlier work [ J. Chem. Inf. Model., 2008, 48, 646-658], where we show how to study a structure-activity relationship pairwise, based on the notion of "activity cliffs"-pairs of molecules that are structurally similar but have large differences in activity. There, we also introduced the SALI parameter, which allows one to identify cliffs easily, and which allows one to represent a structure-activity relationship as a graph. This graph orders every pair of molecules by their activity. Here, we introduce the new idea of a SALI curve, which tallies how many of these orderings a model is able to predict. Empirically, testing these SALI curves against a variety of models, ranging over two-dimensional quantitative structure-activity relationship (2D-QSAR), three-dimensional quantitative structure-activity relationship (3D-QSAR), and structure-based design models, the utility of a model seems to correspond to characteristics of these curves. In particular, the integral of these curves, denoted as SCI and being a number ranging from -1.0 to 1.0, approaches a value of 1.0 for two literature models, which are both known to be prospectively useful.

## 2007

• Exploring the other side of biologically relevant chemical space: insights into carboxylic, sulfonic and phosphonic acid bioisosteric relationships.
Macchiarulo, Antonio and Pellicciari, Roberto
Journal of molecular graphics & modelling, 2007, 26(4), 728-739
PMID: 17544772     doi: 10.1016/j.jmgm.2007.04.010

Bioisosteric replacements have been widely and successfully applied to develop bioisosteric series of biologically active compounds in medicinal chemistry. In this work, the concept of bioisosterism is revisited using a novel approach based on charting the "other side" of biologically relevant chemical space. This space is composed by the ensemble of binding sites of protein structures. Explorations into the "other side" of biologically relevant chemical space are exploited to gain insight into the principles that rules molecular recognition and bioisosteric relationships of molecular fragments. We focused, in particular, on the construction of the "other side" of chemical space covered by binding sites of small molecules containing carboxylic, sulfonic, and phosphonic acidic groups. The analysis of differences in the occupation of that space by distinct types of binding sites unveils how evolution has worked in assessing principles that rule the selectivity of molecular recognition, and improves our knowledge on the molecular basis of bioisosteric relationships among carboxylic, sulfonic, and phosphonic acidic groups.

• 2D depiction of protein-ligand complexes.
Clark, Alex M and Labute, Paul
Journal of chemical information and modeling, 2007, 47(5), 1933-1944
PMID: 17715911     doi: 10.1021/ci7001473

A method is presented for the automated preparation of schematic diagrams for protein-ligand complexes, in which the ligand is displayed in conventional 2D form, and the interactions to and between the residues in its vicinity are summarized in a concise and information-rich manner. The structural entities are arranged to maximize aesthetic ideals and to properly convey important distance relationships. The diagram is annotated with calculated hydrogen bonds, a substitution contour, solvent exposure, chelated metals, covalently bound linkages, pi-pi and pi-cation interactions, and, for series of complexes, conserved residues and interactions. Residues, cofactors, ions, and solvent components are drawn in cartoon form as adjuncts to the ligand. The method can be applied to aligned sets which contain multiple ligands, or multiple members of a protein family, in which case the ligand orientations and protein residue placement will show consistent trends throughout the series.

• A similarity-based data-fusion approach to the visual characterization and comparison of compound databases.
Medina-Franco, José L and Maggiora, Gerald M and Giulianotti, Marc A and Pinilla, Clemencia and Houghten, Richard A
Chemical biology & drug design, 2007, 70(5), 393-412
PMID: 17927720     doi: 10.1111/j.1747-0285.2007.00579.x

A low-dimensional method, based on the use of multiple fusion-based similarity measures, is described for graphically depicting and characterizing relationships among molecules in compound databases. The measures are used to construct multi-fusion similarity maps that characterize the relationship of a set of 'test' molecules to a set of 'reference' molecules. The reference set is very general and can be made of molecules from, for example, the set of test molecules itself (the self-referencing case), from a small library or large compound collection, or from actives in a given assay or group of assays. The test set is any collection of compounds to be analyzed with respect to the specified reference set. Multiple fusion similarity measures tend to provide more information than single fusion-based measures, including information on the nature of the chemical-space neighborhoods surrounding reference-set molecules. A general discussion is presented on how to interpret multi-fusion similarity maps, and several examples are given that illustrate how these maps can be used to compare compound libraries or collections, to select compounds for screening or acquisition, and to identify new active molecules using ligand-based virtual screening.

## 2006

• Molecular complexes at a glance: automated generation of two-dimensional complex diagrams.
Stierand, Katrin and Maass, Patrick C and Rarey, Matthias
Bioinformatics (Oxford, England), 2006, 22(14), 1710-1716
PMID: 16632493     doi: 10.1093/bioinformatics/btl150

MOTIVATION:In this paper a new algorithmic approach is presented, which automatically generates structure diagrams of molecular complexes. A complex diagram contains the ligand, the amino acids of the protein interacting with the ligand and the hydrophilic interactions schematized as dashed lines between the corresponding atoms. The algorithm is based on a combinatorial optimization strategy which solves parts of the layout problem non-heuristically. The depicted molecules are represented as structure diagrams according to the chemical nomenclature. Due to the frequent usage of complex diagrams in the scientific literature as well as in text books dealing with structural biology, biochemistry and medicinal chemistry, the new algorithm is a key element for computer applications in these areas.

## 2005

• Charting biologically relevant chemical space: a structural classification of natural products (SCONP).
Koch, Marcus A and Schuffenhauer, Ansgar and Scheck, Michael and Wetzel, Stefan and Casaulta, Marco and Odermatt, Alex and Ertl, Peter and Waldmann, Herbert
Proceedings of the National Academy of Sciences of the United States of America, 2005, 102(48), 17272-17277
PMID: 16301544     doi: 10.1073/pnas.0503647102

The identification of small molecules that fall within the biologically relevant subfraction of vast chemical space is of utmost importance to chemical biology and medicinal chemistry research. The prerequirement of biological relevance to be met by such molecules is fulfilled by natural product-derived compound collections. We report a structural classification of natural products (SCONP) as organizing principle for charting the known chemical space explored by nature. SCONP arranges the scaffolds of the natural products in a tree-like fashion and provides a viable analysis- and hypothesis-generating tool for the design of natural product-derived compound collections. The validity of the approach is demonstrated in the development of a previously undescribed class of selective and potent inhibitors of 11beta-hydroxysteroid dehydrogenase type 1 with activity in cells guided by SCONP and protein structure similarity clustering. 11beta-hydroxysteroid dehydrogenase type 1 is a target in the development of new therapies for the treatment of diabetes, the metabolic syndrome, and obesity.

## 2003

• A Chemical Class-Based Approach to Predictive Model Generation
Miller, D W
Journal of chemical information and modeling, 2003, 43(2), 568-578
doi: 10.1021/ci025606g

We make a quantitative comparison of two distinct approaches to predictive model generation in the context of diverse screening data. In the default approach, a single recursive partitioning model is constructed using all of the training data at one time. In the class-based'' approach, the same data are first partitioned into homogeneous, scaffold-based classes, and models are constructed within each class independently. Both approaches are tested on the identical set of hold-out data, using a formal protocol that includes consensus scoring to handle the multiple class-based models. The entire process is performed using three different descriptor sets and is repeated using five separate random trials, such that the trial-averaged prediction rates for the two approaches can be quantitatively compared. We find that although the predictive performances of the class-based and default approaches are similar, the former has at least two distinct advantages. The first is greater interpretability, in that chemist...