Machine Readable Descriptions of Chemical Structure

In order to undertake virtual screening, it is first necessary to convert chemical structure into an easy to interpret machine readable format. While a number of methods have been proposed for two-dimensional (2D) structure depiction of chemical entities,10 connection tables are the most important representation to emerge. Perhaps the most widely applied of these is the structure data (SD) file. This file was designed to permit the movement of large numbers of molecules and their associated data between databases. Chemical structures are stored in a connection table which houses x and y atom coordinates based on bond lengths (z coordinates can be added when three-dimensional (3D) data is to be stored) together with associated atom type, chirality, and bond connection data.11 A connection table is in essence a graph containing the complete and explicit description of molecular topology and forms an easily analyzable repository of 2D chemical data for VS. Graph theory forms the mathematical model at the core of topology description,12-15 and many of the key concepts of molecular graph theory are highlighted in Figure 1. SD files and other extended connection table formats (e.g., mol2 files16) provide a perfectly usable means of structure data transport. Their inflexible format requirements and somewhat inefficient storage needs led to efforts to devise other methods for chemical structure interpretation. The most widely applied of these is the Simplified Molecular Input Line Entry System (SMILES).17-18 SMILES is a line notation (a typographical method using printable characters) for entering and representing molecules. While a SMILES string contains the same information as an extended connection table, it is in essence a chemical language with a vocabulary (atom and bond symbols) and grammatical rules (e.g., for substitution pattern recognition). SMILES representations of structure can in turn be used as 'words' in the vocabulary of other languages designed for chemical storage. A typical SMILES will take 50% to 70% less space than an equivalent connection table, even in binary format (typically under two bytes per molecule). Other chemical languages include Sybyl Line Notation (SLN),19 which was designed as an extension of SMILES capable of substructure search and Markush20 structure specification. Figure 2 shows an example of melatonin converted to a number of machine readable connectivity formats.

0 0

Post a comment