Representations of Molecules

Chemists communicate about chemical structures using structure diagrams. While many software packages exist for entering structures so that they can be viewed on the screen and subsequently imported into documents, the images created by these drawing packages are not suitable for database applications. Various systems for chemical nomenclature have been developed that allow structures to be stored as text; however, the naming conventions tend to be rather complex and are therefore difficult to apply. Figure 1a shows the structure of aspirin and Figure 1b illustrates a systematic name for aspirin. Line notations have also been developed to provide a concise representation of structures based on alphanumeric characters and can be used to communicate structures, to enter structures into databases or structure drawing packages, or to transfer compounds between systems. The most well known line representation is SMILES (simplified molecular input line entry specification).9'10 A SMILES representation of aspirin is shown in Figure 1c. Although line notations are widely used they must be converted into chemically aware computer-readable representations to allow structural searches to be carried out.

The specialized nature of chemical structural data means that specialized software is required in order to be able to store and search for it within databases. Such computer-based methods for processing databases of structures were first developed in the 1960s. Nowadays, a typical corporate database could consist of 106 compounds and the largest publicly available database contains over 26 million small organic and inorganic compounds.11 Current systems are able to search for structures and data in these databases and have the results returned within seconds. These databases contain compounds that are real in the sense that they have already been synthesized. A more recent development in chemoinformatics involves the creation of databases containing 'virtual' molecules. These are compounds that do not exist yet but which could be synthesized readily, often using combinatorial chemistry techniques. The sizes of these virtual collections can run to billions of structures.

3.12.2.1 Structure and Substructure Searching

The simplest type of search that a chemist may wish to carry out is to find information about a particular compound. For example, has the compound been synthesized already, either within the company or elsewhere, what quantities (if any) are available within the company stores, and what properties have been measured for it. This type of query is known as an exact search. Substructure searching, on the other hand, provides a more general type of search in which a chemist may search for related compounds of interest. It is used to find all compounds within a database that contain a given substructure, for example, all compounds that contain a penicillin ring system.

Most database search systems encode chemical structures as molecular graphs and make use of graph theory to provide tools for searching the databases. A graph is a mathematical object consisting of nodes that are connected by edges. In a molecular graph, the nodes correspond to the atoms and the edges correspond to the bonds. The graphs are typically labeled: the nodes are labeled by atom type, for example, carbon, oxygen, nitrogen etc.; and the edges are

Acetyl salicylic acid

-ISIS- 09270222202D

The first atom is a carbon

-ISIS- 09270222202D

13

0 0

0

0 0

0

0 0999

00

-3.

4639

-1.

5375

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

-3.

4651

-2.

3648

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

-2.

7503

-2.

7777

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

-2.

0338

-2.

3644

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

-2.

0367

-1.

5338

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

-2.

7521

-1.

1247

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

-2.

7545

-0.

2997

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

-2.

0413

0.

. 1149

0.

0000

O

0

0

0

0

0

0

0

0

0

0

0

0

-3.

4702

0.

1107

0.

0000

O

0

0

0

0

0

0

0

0

0

0

0

0

-1.

3238

-1.

1186

0.

0000

O

0

0

0

0

0

0

0

0

0

0

0

0

-0.

6125

-1.

5292

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

-0.

6167

-2.

3542

0.

0000

O

0

0

0

0

0

0

0

0

0

0

0

0

0.

1000

-1.

. 1125

0.

0000

C

0

0

0

0

0

0

0

0

0

0

0

0

1

2

2

0

0

0

0

6

7

1

0

0

0

0

3

4

2

0

0

0

0

7

8

1

0

0

0

0

7

9

2

0

0

0

0

4

5

1

0

0

0

0

5

10

1

0

0

0

0

2

3

1

0

0

0

0

10

11

1

0

0

0

0

5

6

2

0

0

0

0

11

12

2

0

0

0

0

6

1

1

0

0

0

0

11

13

1

0

0

0

The first bond is between atoms 1 and 2 and has order 2

M END

Figure 1 Different representations of aspirin: (a) structure diagram; (b) systematic name; (c) SMILES notation; (d) graph representation; and (e) connection table in MDL format.

labeled by bond type, for example, single, double, triple, or aromatic. The graph representation of aspirin is shown in Figure 1d.

A molecular graph is represented internally within a computer as a connection table that records the atoms present in the molecule, together with any associated properties, and the bonds between the atoms. Connection tables are usually hydrogen suppressed, which means that the hydrogen atoms are omitted; they can, however, be easily inferred from chemical valency rules. They can also be redundant, whereby each bond is listed twice, for example, the bond between atoms a and b is associated with both atom a and again with atom b, or nonredundant in which each bond is given once only. Connection tables can include x, y coordinates of the atoms, which enable the graph to be drawn on a page or screen as the familiar chemical structure diagram. Many different connection table formats exist. Figure 1e illustrates a connection table representation for aspirin in MDL format.12

Exact structure searching is made complex by the fact that there are many different ways in which the atoms in a connection table can be ordered. For a structure consisting of N atoms, there are N different numberings that are possible; therefore, comparing two connection tables to see if they represent the same structure is a nontrivial task. However, an algorithm to generate a unique numbering of a structure, also known as a canonical representation, was described by Morgan as long ago as 1965.13 The algorithm is based on calculating the connectivities of atoms in an iterative fashion in an attempt to discriminate the atoms as far as possible. In the first iteration, the connectivity of an atom is the sum of its immediate neighbors. In subsequent iterations, a new connectivity value is calculated for each atom by summing the connectivities of its neighbors. This process continues until the number of distinct connectivity values has reached a maximum. The atom with maximum connectivity is then listed first in the connection table. Its neighbors are listed next in order of decreasing connectivity, then their neighbors are listed, and so on. Various rules have been devised to resolve ties when two atoms have the same connectivity values and subsequent algorithms have been developed to deal with stereochemistry.14

If all of the structures in a database are stored in their canonical representations, then an exact structure search can be carried out by converting the query structure to its canonical form and comparing its connection table with those in the database. Hashing algorithms have also been developed that allow structures to be retrieved directly. A hash key is a number that is calculated from a canonical connection table and is used to determine the physical location on disk where a structure is stored. For example, the Augmented Connectivity Molecular Formula used in the Chemical Abstracts Service Registry System15 is based on a procedure similar in concept to the Morgan algorithm and is used to calculate a hash key. Ideally, each structure will produce a unique hash key, however, there is always a chance that two different structures will produce the same key. Thus, it is necessary to compare the connection table representations for the structures mapped to the same hash key in order to resolve such clashes.

Substructure searching, which aims to identify all structures within a database that contain a structural fragment embedded within a larger structure, is a more complex problem. In graph theoretic terms it is equivalent to searching for a subgraph isomorphism between two graphs. While efficient subgraph isomorphism algorithms do exist, they are not sufficiently fast to allow searching through the large databases that typically exist today. Subgraph isomorphism belongs to a class of problems that are known as NP-complete, which means the time required to find a solution varies exponentially with the size of the problem (in this case, with the number of nodes in the graphs). Thus, substructure search is implemented as a two-stage process with a fast screening stage used to eliminate structures that cannot possibly match the query so that the more expensive graph matching algorithm need only be applied to a subset of the database.

Screening systems are based on fragment bitstrings, which are represented as binary vectors. In a dictionary-based screening system, each position in the bitstring corresponds to a particular fragment, such as a carbonyl group or an amino group. For a given molecule, a bit is set to on or '1' if the molecule contains the fragment it represents, otherwise it is set to off or '0' (see Figure 2). The screening stage of substructure search involves generating a bitstring for the query substructure and comparing it with bitstring representations of all the molecules in the database. If a database structure contains a '0' at a position set to '1' in the query this indicates that a fragment present in the query is absent from the database structure, which, therefore, cannot possibly match the query. The database structure can then be eliminated from further consideration. Only those structures that pass the screening stage proceed to the graph matching stage. Ideally, a well-designed screening system should screen out as much as 95% of the database, although this obviously depends on the level of specificity of the query; for example, for a benzene ring as query the screen out is likely to be much lower since a large proportion of the structures in the database will contain it.

ITTTT

Figure 2 Fragment bitstring.

Key as: acyclic single bond cs: cyclic single bond cd: cyclic double bond *: any atom

(c) Bond sequence (d) Ring composition

Figure 3 Examples of substructural fragment types included in dictionary-based screening systems.

Examples of the different types of substructural fragments that are included in screening systems are illustrated in Figure 3.16'17 Typically, they include: augmented atoms, which consist of a central atom and its neighboring atoms; atom sequences, which consist of a linear path of atoms with or without the bonds differentiated; bond sequences in which the atom types are ignored; and various types of ring fragments such as ring sizes and bond patterns around rings. The most widely used dictionary-based systems are the MACCS and ISIS systems from MDL Information Systems12 and the BCI fingerprints from Barnard Chemical Information.18

To be effective at screen out, a fragment dictionary needs to be carefully designed since fragments that occur in most molecules are unlikely to be discriminating and, conversely, fragments that occur infrequently are unlikely to be useful since they will rarely occur in queries. The optimum dictionary is, therefore, data set dependent and requires a statistical analysis of the database to be performed.19 This presents difficulties when new compounds are added to the database, especially if they represent new or previously underrepresented chemistries.

Hashed fingerprints provide an alternative approach to generating fragment bitstrings. These methods involve the exhaustive enumeration of the fragments in a molecule and do not rely on a dictionary of fragments (they therefore avoid the problem of data set dependency). For example, the Daylight fingerprint is constructed by enumerating all linear paths up to a specified length (typically from 2 to 7 atoms).20 Each path is then hashed to generate a small number of bits (typically four or five), which are set to '1' in the fingerprint bitstring. Since the paths are mapped into a fixed length bitstring collisions can occur whereby a given bit is set by more than one path and, in contrast to the dictionary-based bitstrings, there is no longer a one-to-one mapping between bit position and fragment.

The second stage of substructure searching involves the application of a subgraph isomorphism algorithm to the database structures that pass the screening stage. The brute force approach to substructure search would involve investigating every possible way of mapping the query atoms onto the database structure atoms. For a query substructure consisting of Nq atoms and a database structure consisting of Nd atoms there are Nq!/Nd! — Nq! possible mappings, which is clearly computationally infeasible for all but the smallest of structures. More efficient methods have therefore been developed that make use of heuristics (i.e., rules) aimed at either improving the chances of finding a match early on or efficiently rejecting candidates that cannot give rise to a match.21 For example, back-tracking methods attempt to reject potential mappings as early as possible.22 The search begins by mapping a node in the query substructure to a node of the same atom type in the molecule graph. An attempt is then made to extend the mapping by matching neighboring nodes from the query onto corresponding neighboring nodes in the molecule graph. This process continues until either all the nodes have been successfully matched or until it proves impossible to find a match, at which point the algorithm returns (or back-tracks) to the last successful match and attempts an alternative mapping, by, for example, trying a different assignment of the neighboring nodes. This back-tracking procedure continues until a match of all query nodes has been found or until all possible mappings of the initial node have been tried, at which point the algorithm terminates with a mismatch recorded.

The Ullmann algorithm has been found to be particularly efficient for chemical graph matching. The algorithm is a combination of a back-tracking procedure and a refinement or relaxation step in which neighboring atoms are taken into account when considering possible mappings (i.e., an atom in the query is not permitted to match an atom in the molecule unless each of their neighboring atoms also match). A matching matrix is constructed between the query and database structure, which has one row for each query atom and a column for each database structure atom. The elements of the matching matrix take the value '1' if a match is possible between the corresponding atoms (based on atom type and connectivity where the connectivity of the database atom must be equal to or greater than the connectivity of the query atom), otherwise the value is '0.' The aim is to find a matrix in which there is one element per row that is equal to '1' with each mapping occurring in a different column; this then specifies a unique mapping for every query atom onto a database structure atom. The search begins by assigning the first query atom to the first potentially matching molecule atom; all other entries in the row are set to '0.' The refinement step is then applied, which involves checking that the neighbors of the query atom and database atom match. If it fails then the algorithm back-tracks and the next potential match is attempted. If it succeeds then the mapping is extended to the neighboring atoms.

While graph theoretical techniques are widely applied in the processing of chemical structures, the analogy between a graph and a chemical structure is not perfect and special procedures need to be implemented to handle some features of molecules such as aromaticity, tautomerism, and multicentre bonds, etc.23

Healthy Chemistry For Optimal Health

Healthy Chemistry For Optimal Health

Thousands Have Used Chemicals To Improve Their Medical Condition. This Book Is one Of The Most Valuable Resources In The World When It Comes To Chemicals. Not All Chemicals Are Harmful For Your Body – Find Out Those That Helps To Maintain Your Health.

Get My Free Ebook


Post a comment