Search Engine Traffic Guide
The program BLAST, which is available from the NCBI, conducts rapid searching of sequence databases (Altschul et al., 1990). The program searches for similarity between a query sequence and sequences in a nucleotide database (using BLASTN) or peptide database (using BLASTP). Two versions of the program are available (1) a version that will run on a local workstation and (2) client software that connects a local computer via the Internet to a high-speed search engine located at the NCBI.
Other BLAST programs are also available from the NCBI Web page. BLAST 2 sequences uses the BLAST search engine to produce an alignment of two sequences entered by the user. On the Specialized BLAST pages, researchers can use the BLAST engine to search sequences that are not in GenBank. At present, these databases include unfinished microbial genomes, P. falciparum (the human malaria parasite), and tentative
Other sequence-based protein family databases consist of multiple sources. The ProClass database (Wu et al., 1999) is a nonredundant protein database organized according to family relationships as defined collectively by PROSITE patterns and PIR superfamilies. The MEGACLASS server (States et al., 1993) provides classifications by different methods, including Pfam, BLOCKS, PRINTS, ProDom, and SBASE. The MOTIF search engine at http motif.genome.ad.jp includes PROSITE, BLOCKS, ProDom, and PRINTS.
PubMed attempts to slightly broaden the scope of MEDLINE and address some of its shortfalls by including life-science citations from general science and chemistry journals, adding roughly one million more entries to the MEDLINE set. PubMed also attempts to completely index journals back to 1966, regardless of the date of the journal's inclusion in MEDLINE. For papers published prior to 1966, the user will need to access OLDMEDLINE, which must be done through a different search engine, called the NLM Gateway (http gateway.nlm.nih.gov). This
Other popular search engines include Lycos (http www.lycos.com) One of the earliest search engines to appear on the web, but is consistently updated and still one of the most popular engines. Questions may be posed to this search engine which portrays the character of the fictitious character Jeeves, the knowledgeable butler created by P.G.Wodehouse.
The preceding sections have identified many reviews and commentaries of the use of risk assessment methods and HACCP in developing pathogen management plans. Many additional resources are available from government and private Internet sites around the world, and are readily located using Internet search engines. Some sites of particular interest are presented below. Risk World provides many links to risk analysis-related sites
A more general tool is Google's Desktop Search, a version of Google's popular search engine that can be applied to the local search of files on a user's machine. The personal version of Desktop Search is available for free (desktop.google.com), but greatly enhanced functionality is given by Google Mini and the Google Search Appliance (www.google.com enterprise), both of which allow intranet-accessible documents and files across an organization to be searched.
Once potent compounds are identified that impart a desired phenotype in a biological assay, the target of the compound must be determined. If the active compounds are from a complex extract, the active agents of the extract must be purified and identified. More typically, active compounds are from a library of known small molecules. In this case, one can perform database searches to determine whether an active compound has previously been tested in biological assays. Several chemical structure databases and structure-based search engines are available. Structure-based search engines allow investigators to assess analogs of the compound as well as the compound itself. One can determine whether the compound or its analogs have previously been determined to have activity in biological assays. If a candidate target is identified, the ability of the compound to affect the target in an in vitro assay is assessed. Some purified proteins are available from commercial suppliers. In addition, a...
Other useful software tools have been directed toward keeping up with clinical trial data via structure searching as well as text-based searching. Software packages such as Pharma Projects 2 , Prous Science Integrity 3 , and of course, Internet search engines such as Google and Yahoo have all contributed to tracking of clinical data and research trends. Some resources such as
The Genome Browser works interactively with the Celera Discovery System (CDS), a powerful Web-interface that allows the user to submit sequence, domain, pattern, and HMM searches, parse biomolecule classifications like the Panther database of over 40,000 hidden Markov models and the Gene Ontology (GO) classifications, variation, gene expression data utilizing over 40 databases, more than in any other biosequence application. CDS also performs high-performance text searches using the LION search engine. CDS supports saving, storing and retrieval of BLAST and other analysis results. For additional information, see http cds.celera.com and Kerlavage et al. (2002).
With the improvements in performance of the internet and in computing speed, communications and computational resources are no longer barriers to creating databases derived from the PDB and other archival databases. There are many production systems and many experimental systems, and many more can be expected in the future. The RCSB (RCSB database list), EBI (EBI services web page), and NCBI (NCBI cross-database search page) provide extensive lists of searchable databases, and web searches (see e.g. the Google web page) will yield other resources. The following are examples.
The Distance Matrix Alignment (Dali) Fold Classification Database (Holm and Sander, 1994, 1996) provides a mapping of similarity of folding patterns from the PDB, organizing the known proteins into a tree of fold families. The Dali search engine keeps the database updated and is available to use to compare a probe structure to the PDB (Dali web page).
A major advantage of the digital revolution has been in storage and retrieval of information. Storage in notebooks and filing cabinets previously meant that searching for specific data or experiments was a tedious manual process. With digital information, modern search engines can quickly find specific information in a fraction of the time usually required for a manual search. Making backup copies of nondigital data can be difficult, expensive, and time-consuming since it requires copying, retyping, or photographic reproduction. Copies of digital data can be generated more easily and at reduced costs.
Information in categories 1 and 2 is structured, and entirely (1), or mostly (2) reliable, because the sources are well known. On the other hand, this information in the 'deep web'136 is only accessible via the individual database interfaces, not via the usual search engines. Access is often controlled, demanding at least registration, often payment (almost entirely so for 1). Almost everything in 3, and a lot in 2, is free (e.g., all databases listed above). However, much of the information on the internet (almost everything in category 3) is not reviewed or verified, as is the case in the 'established' literature and databases. Thus, the quality of this information is extremely variable, and continuing access, one of the hallmarks of the traditional publication system, is by no means guaranteed - we all keep getting these infamous 404 messages for information displaced or removed from the web altogether.
The Java-based46 JChem Base, from ChemAxon Ltd., is another small-enterprise solution that allows the query of mixed structural and nonstructural data. It can integrate a variety of database systems (Oracle, MS SQL Server, IBM DB2, MS Access) with web interfaces and offers a fast similarity, exact-structure and (sub)structure search engine. Using the JChem Cartridge for Oracle the user can acquire additional functionalities from within Oracle's SQL. The system includes Marvin, a Java-based chemical editor and viewer.47
Additional text fields may be defined to register full-text search capability (if the DBMS supports this, acting like a web search engine) and error data. If some inconsistency between miscellaneous sources is detected, the user needs to be notified. Because molecular biology has an extremely dynamic character, it is easy to find discrepancies between sources. For example, one Target table record stores five external identifiers, but one of them may point to an outdated link that identifier should be flagged, and an alert should be generated. Updating on a weekly monthly basis the Target unit ensures that information stays as current as possible.
In this chapter, we have presented how computers have had an impact on the life of the medicinal chemist in the past several decades. The topics ranged from computerized chemical information systems to computer-assisted drug design and cheminformatics technologies. The former topic covered scientific literature and patent searches on research topics, reactions, and structures. It also covered computer systems for the management of compound collections and the storage retrieval of biological data. The impact of Web technologies on the medicinal chemist was also noted, especially in the area of using the Web to keep up with competitive intelligence information. Online search systems include PubMed and the more general search engines (Yahoo, Google). On the science side, we covered the development of CADD and cheminformatics technologies, especially the ligand-based technologies, such as the pharmacophore perception methods (Catalyst, GASP, pharmacophore keys). We also explained the...
The primary advantage of the computer is to deal with work that is so large and so complex that it cannot otherwise readily be possible. One example of such a need is the highly complicated chemical toxicological biomedical literature that exists. A computer can search and mine the literature, and it can organize it into mutually relevant collections of articles. Data clustering is one example of such an intelligent organization of the literature 36 . Other examples include directory searches, keyword searches, and database searches. (See types of search engines 37 .) There are several different clustered search engines available. A case in point is Clusty.com's Vivisimo 38 . The default setting of this engine is to search the recent literature. Repeated searches, say at monthly intervals, enables one to keep up with topics of interest. A recent (June 2006) clustered search on Computational Toxicology, the topic of this book, gave the results described in Table 1.1. TABLE 1.1 Results...
The UltraLink has been implemented on our Knowledge Space Portal (KSP), which is a Web-based application deployed at the Novartis Institutes for Biomedical Research (NIBR). The KSP is an information integration environment that enables scientists to search a diverse collection of internal and external sources, including the Internet (through Google). The system allows for the integration of diverse sources and applications and provides new ways to navigate information in a seamless manner. Databases are organized in clusters that are defined by the information domain (chemistry, biology, medicine, etc.) to which they belong. Individual databases or whole clusters can be combined and searched with a natural language query. A query interpreter enriches and transforms the queries to match the syntax of the corresponding search engines and normalizes and transforms the queries to our representation standards. The resulting list of documents is ranked by relevance. In contrast to standard...
The Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB http www.rcsb.org pdb ) maintains a structural archive for atomic-resolution protein structures determined by X-ray crystallography and NMR techniques. The structures are stored as sets of three-dimensional (x,y,z) positional coordinates for all the atoms within each structure. These coordinate files are written in the PDB format, which can be read and displayed by a number of existing graphics programs. The simplest method for retrieving a structural coordinate file or viewing the protein structure is by accessing the PDB World Wide Web site at the above URL using browser software such as Netscape or Microsoft Internet Explorer. As an example, let us retrieve the structure of a hemoglobin from the PDB Web site. We first connect to the PDB home page http www.rcsb.org pdb and choose Search-Lite simple keyword search. Input the word hemoglobin in the keyword query window and click on Search. This...
Cirrhosis of the liver is one of the most well-known adverse effects of chronic alcohol abuse. The cholesterol-lowering, life-prolonging statin drugs must be monitored routinely for hepatotoxicity and rhabdomyolosis. A Google search on the terms statins, hepatotoxicity, and review produced over 22,000 hits indicating this is a very active field of interest.
The potential for biomarkers to enhance biological understanding of disease processes and drug targets, coupled with their potential to improve the efficiency of decision making during the drug discovery and development process, has led to heightened interest in biomarker applications in the current environment. The impact of this interest may be noted by the observation that, in early 2005, a search of the World Wide Web using the Google search engine yields approximately 1 450 000 matches for the term 'biomarkers.' Similarly, a search for the text word 'biomarker' or 'biomarkers' in the Pubmed biomedical literature database at this time yielded a total of more than 12000 citations, dating back to the late 1970s. Of the 12 000, more than 8300 were dated in the year 2000 or later. The increased interest can be linked to breakthroughs in molecular technologies in areas such as genomics and proteomics as well as bioinformatics, and also to the increasing sophistication of drug discovery...
Protein databases are especially powered by the Internet. Unlike traditional media, such as the CD-ROM, the Internet allows databases to be easily maintained and frequently updated with minimum cost. Researchers with limited resources can afford to set up their own databases and disseminate their data quickly. Notably, many small databases on specific types of proteins, such as the EF-Hand Calcium-Binding Proteins Data Library and O-GlycBase (http www.cbs.dtu. dk databases OGLYCBASE ), became widely available. Users worldwide can easily access the most up-to-date version through a user-friendly interface. Most protein databases have interactive search engines so that users can specify their needs and obtain the related information interactively. Many protein databases also allow submitters to deposit data interactively, and allow database servers to check the format of the data and provide immediate feedback.
If users are inexperienced in searching for information they should first consult search engines, meta-databases, or portals. The large search engines generally provide a larger number of hits, but often from commercial, and possibly dubious, sources. Yet if information on a new or rare compound is needed these can be recommended as a first choice. The smaller subject engines provide more reliable data, but vary considerably in their results. All the methods presented above of obtaining information via the internet bear one risk - dead links. Although a search term could be found by a search engine in its own website-meta-data database, the original link to the website could be broken and the information is lost. A more stable source of information is represented by databases.
Started by eMolecules, Inc., this160 is a new free open-access search engine for chemistry-related information. Its mission is to discover, curate, and index all of the public chemical information in the world, and make it available to the public. eMolecules distinguishes itself by extremely fast searches, an appealing presentation of results, and high-quality chemical drawings. Millions of molecules from hundreds of sources are merged into a single, searchable chemical database. It lets the user run searches by entering text or by drawing molecular structures via Java Molecular Editor. eMolecules also provides code that users can embed into their own web sites for direct access to it, as well as hosted cheminformatics systems and full web sites for chemical suppliers, pharmaceutical, and other chemical industries.
Abstract Most journals now state their preferred abstract format covering background, methods, results and conclusions, often with a word limit. Abstracts deserve a great deal of thought and time being spent on them, as many readers will read no further, yet still draw conclusions from what you say in only 2-300 words. The provision of abstracts, but not full text, on many literature search engines increases the chance that a high proportion of those who see the abstract will never see the full paper. It is therefore better to state specific aspects of the results (e.g. the main outcome measure) in some detail, and others sufficiently vaguely that the reader will clearly have to look further to gain the full picture. Avoid statements such as 'there was no difference p 0.3' in an abstract just as you would in a full report - it is better to give estimates and confidence intervals as well as p-values, or say nothing.
We are all confronted daily with the complexity of finding and retrieving information relevant to our profession. Although Internet search tools (such as Google) have greatly contributed to the simplification of this process, there is still a long way to go before such tasks become really efficient. Indeed, a typical session using such search engines yields a number of hits, generally ranked by relevance. The users can then follow the hyperlinks and explore the top hits returned for their query. Although in simple cases the first few hits are relevant to the query, more refined searches are needed to disambigu-ate terms that have several meanings or are used in different contexts. The final assessment of the relevance of any link can only be done by reading the content of the target page. The users must therefore often follow several links and iteratively refine their query before the relevant answer is found. Furthermore, any new concept found within these pages will trigger a new...
Text mining is a relatively new technology for the life sciences that enables the retrieval and extraction of information contained in unstructured texts. The basic tasks of text mining can be defined as the identification of the entities in the universe of discourse and the detection of their relationships. A particular example of such identified entities is protein names, their function, and interactions with other molecules 1-4 . Identification means that we assign semantic values to the retrieved entities and relationships, in contrast to common search engines, which only match strings or sequences of strings in a given text. In our case the domains under consideration include medicine, biology, chemistry, and their related documents and databases.
Today, much of the utilization of computerized data is still predicated upon the paper paradigm. Desktop applications rely on the file folder construct of a graphical user interface for storing documents and email. This was a step up in sophistication from the 1980s model of file system storage, in which users needed to know the exact location within directories and subdirectories (i.e., the full path name) to access and manipulate files. But with the typical end user now sitting in front of a computer with 40-100 GB of disk space (it is predicted that by 2010 most computers could have a terabyte of local storage), maintaining an efficient file system can be challenging and time consuming. Retrieving misplaced files is difficult at best, as the typical search functions of most operating systems perform a perfunctory search, querying files individually, looking for matches. In many instances, it is actually easier and faster to query the Internet for the needed information, rather than...
We recommend running a thorough search on one of the various Web-based patent search engines (e.g., www.deiphion.com,www.uspto.gov) for patents that may be related to the device or technique that you are developing. In addition, the U.S. Patent and Trademark Office offers free information about patents, trademarks, and copyrights, and every state has a Patent and Trademark Depository Library that maintains collections of current and previously issued patents and patent and trademark reference materials.
SEO Secrets Uncovered
Announcing an important message for Webmasters. Who Else Wants to Generate Massive Traffic and Crank Up the Exposure Their Websites Receive by Tapping Into the Unlimited Power of Today's Top Search Engines? As a webmaster, do you spend time studying the number of hits your website is receiving? Do you worry whether you and your clients are getting the exposure needed?