Statistical Modeling In Pharmaceutical Research And Development

Andréa de Gaetano, Simona Panunzi, Benoit Beck, and Bruno Boulanger


3.1 Introduction

3.2 Descriptive versus Mechanistic Modeling

3.3 Statistical Parameter Estimation

3.4 Confidence Regions

3.4.1 Nonlinearity at the Optimum

3.5 Sensitivity Analysis

3.6 Optimal Design

3.7 Population Modeling References


The new major challenge that the pharmaceutical industry is facing in the discovery and development of new drugs is to reduce costs and time needed from discovery to market, while at the same time raising standards of quality. If the pharmaceutical industry cannot find a solution to reduce both costs and time, then its whole business model will be jeopardized: The market will hardly be able, even in the near future, to afford excessively expensive drugs, regardless of their quality.

Computer Applications in Pharmaceutical Research and Development, Edited by Sean Ekins. ISBN 0-471-73779-8 Copyright © 2006 John Wiley & Sons, Inc.

In parallel to this growing challenge, technologies are also dramatically evolving, opening doors to opportunities never seen before. Some of the best examples of new technologies available in the life sciences are microarray technologies or high-throughput-screening. These new technologies are certainly routes that all pharmaceutical companies will follow. But these new technologies are themselves expensive, time is needed to master them, and success is in any case not guaranteed. So, by mere application of new technology costs have not been reduced, and global cycle time continues to extend while the probability of success remains unchanged.

One key consideration that should be kept in mind is that the whole paradigm for discovering and developing new drugs has not changed at all in the mind of the scientists in the field. The new technologies have been integrated to do the same things as before, but faster, deeper, smaller, with more automation, with more precision, and by collecting more data per experimental unit. However, the standard way to plan experiments, to handle new results, to make decisions has remained more or less unchanged, except that the volume of data, and the disk space required to store it, has exploded exponentially.

This standard way to discover new drugs is essentially by trial and error. The "new technologies" approach has given rise to new hope in that it has permitted many more attempts per unit time, increasing proportionally, however, also the number of errors. Indeed, no breakthrough strategy has been adopted to drastically increase the rate of successes per trial and to integrate the rich data into an evolving system of knowledge accumulation, which would allow companies to become smarter with time. For most new projects initiated, scientists start data production from scratch: The lessons they have learned, or they think they have learned, from previous projects are used only as a general cultural influence; they do not materially determine the continuing development of successive projects.

This possibly slightly pessimistic portrait of the current status of research in the life sciences contrasts sharply with the progression of technology and development changes achieved in other industrial areas. As an example, consider aeronautics. New airplanes today are completely conceived, designed, optimized, and built with computer models (in fact mathematical and statistical models), through intensive simulations. Once a new plane is constructed, it will almost surely fly on its first trial, even if fine-tuning may still be needed. In this industry, each attempt produces one success. If we were to translate the current paradigm of discovery and development of new drugs into aeronautics terms, we could think of many metallurgists with great personal expertise in metallurgy, who, using vague notions of aerodynamics and resistance of materials, assemble large numbers of "candidate planes," each a complex arrangement of metal pieces. Each "candidate plane" is then tested under real conditions, by attempting to fly it from a number of likely take-off surfaces and in different meteorological conditions: The very few that do not crash are finally called "planes." The configuration of the many candidate planes that crashed is examined, so as to avoid repeating the same kinds of error in the future, but each metallurgist has his or her own way to read the facts and draw conclusions for future assemblage instead of consulting or hiring a specialist in aerodynamics or materials. The theory here would be that a plane is, finally, a large collection of pieces of metal, all assembled together! So why would other kinds of expertise be needed, besides those closely linked to metallurgy? In this vision of the business, the more new planes one wants to launch, the more metallurgists one needs, and the process could even be accelerated if one could buy new-technology machines that automatically build and assemble large numbers of different pieces of metal.

In the aeronautics industry, when an experiment is envisaged, for example, testing the resistance of a particular piece, the goal of the experiment is first of all that of verifying and refining the computer model of that piece to answer a fundamental question: Does the model behave like the real piece, or what changes are needed to make the model behave like the piece? Once adequately tuned, the model forecasts will then be used to understand how to optimize the resistance of the piece itself, until the next comparison between model and reality is done. After a few such iterations, the final piece is fully checked for quality purposes and will almost surely be found to be the right one for the job at hand. Translating the pharmaceutical approach to an experiment into aeronautics terms produces a somewhat different picture: A piece is built, which should satisfy quality checks, and an experiment is done to evaluate the resistance of the piece. If the test fails, as it is very likely to do, the piece is thrown away and the metallurgist is asked to propose a new piece by next week.

This caricaturized image of the process of discovery and development of new drugs has been drawn to highlight the pivotal role that models (simplified mathematical descriptions of real-life mechanisms) play in many R&D activities. In the pharmaceutical industry, however, in-depth use of models for efficient optimization and continuous learning is not generally made. In some areas of pharmaceutical research, like pharmacokinetics/pharmacodynamics (PK/PD), models are built to characterize the kinetics and action of new compounds or platforms of compounds, knowledge crucial for designing new experiments and optimizing drug dosage. Models are also developed in other areas, as for example in medicinal chemistry with QSAR-related models. These can all be defined as mechanistic models, and they are useful. But in these models, the stochastic noise inherent in the data, the variability that makes biology so much more different from the physical sciences, is not as a general rule appropriately taken into account.

On the other side, many models of a different type are currently used in the biological sciences: These can be envisaged as complicated (mathematical) extensions of commonsense ways to analyze results when these results are partially hidden behind noise, noise being inescapable when dealing with biological matters. This is the area currently occupied by most statisticians: Using empirical models, universally applicable, whose basic purpose is to appropriately represent the noise, but not the biology or the chemistry, statisticians give whenever possible a denoised picture of the results, so that field scientists can gain better understanding and take more informed decisions. In the ideal case, as in regulated clinical trials, the statistician is consulted up front to help in designing the experiment, to ensure that the necessary denois-ing process will be effective enough to lead to a conclusion, positive or negative. This is the kingdom of empirical models.

The dividing line between empirical models and mechanistic models is not as clear and obvious as some would pretend. Mechanistic models are usually based on chemical or biological knowledge, or the understanding we have of chemistry or biology. These models are considered as interpretable or meaningful, but their inherent nature (nonlinearity, high number of parameters) poses other challenges, particularly once several sources of noise are also to be adequately modeled. For these reasons empirical approaches have been largely preferred in the past. Today, however, the combination of mathematics, statistics, and computing allows us to effectively use more and more complex mechanistic models directly incorporating our biological or chemical knowledge.

The development of models in the pharmaceutical industry is certainly one of the significant breakthroughs proposed to face the challenges of cost, speed, and quality, somewhat imitating what happens in the aeronautics industry. The concept, however, is not that of adopting just another new technology, "modeling." The use of models in the experimental cycle changes the cycle itself. Without models, the final purpose of an experiment was one single drug or its behavior; with the use of models, the objective of experiments will be the drug and the model at the same level. Improving the model will help understanding this and other drugs and the experiments on successive drugs will help improving the model's ability to represent reality. In addition, as well known in the theory of experimental design, the way to optimally conceive an experiment depends on the a-priori model you have. If you have very little a priori usable information (i.e., a poor model), then you will need many experiments and samples, making your practice not very cost effective. This is a bonus few realize from having models supporting the cycle: The cost, speed, and effectiveness of studies can be dramatically improved, while the information collected from those optimized experiments is itself used to update the model itself. Modeling is the keystone to installing a virtuous cycle in the pharmaceutical industry, in order to successfully overcome approaching hurdles. This, of course, requires us to network with or to bring on board modelers that are able to closely collaborate with confirmed drug hunters.

Using the mathematically simple example of Gompertz tumor growth, this chapter discusses the relationship between empirical and mechanistic models, the difficulties and advantages that theoretical or mechanistic models offer, and how they permit us to make safe decisions and also to optimize experiments. We believe there is an urgent need to promote biomathematics in drug discovery, as a tool for meaningfully combining the scientific expertise of the different participants in the discovery process and to secure results for continuing development. The key is to move, whenever meaningful, to mechanistic models with adequate treatment of noise.


According to Breiman [1], there are two cultures in the use of statistical modeling to reach conclusions from data. The first culture, namely, the data modeling culture, assumes that the data are generated by a given stochastic data model, whereas the other, the algorithmic modeling culture, uses algorithmic models and treats the data mechanism as unknown. Statistics thinks of the data as being generated by a black box into which a vector of input variables x (independent variable) enter and out of which a vector of response variables y (dependent variable) exits. Two of the main goals of performing statistical investigations are to be able to predict what the responses are going to be to future input variables and to extract some information about how nature is associating the response variables to the input variables.

We believe that a third possible goal for running statistical investigations might be to understand the foundations of the mechanisms from which the data are generated or going to be generated, and the present chapter is focused on this goal.

To understand the mechanism, the use of modeling concepts is essential. The purpose of the model is essentially that of translating the known properties about the black box as well as some new hypotheses into a mathematical representation. In this way, a model is a simplifying representation of the data-generating mechanism under investigation. The identification of an appropriate model is often not easy and may require thorough investigation. It is usual to restrict the investigation to a parametric family of models (i.e., to a set of models that differ from one another only in the value of some parameter) and then use standard statistical techniques either to select the most appropriate model within the family (i.e., the most appropriate parameter value) with respect to a given criterion or to identify the most likely subfamily of models (i.e., the most likely set of parameter values). In the former case the interest is in getting point estimates for the parameters, whereas in the latter case the interest is in getting confidence regions for them.

The way in which the family of models is selected depends on the main purpose of the exercise. If the purpose is just to provide a reasonable description of the data in some appropriate way without any attempt at understanding the underlying phenomenon, that is, the data-generating mechanism, then the family of models is selected based on its adequacy to represent the data structure. The net result in this case is only a descriptive model of the phenomenon. These models are very useful for discriminating between alternative hypotheses but are totally useless for capturing the fundamental characteristics of a mechanism. On the contrary, if the purpose of the mode ling exercise is to get some insight on or to increase our understanding of the underlying mechanism, the family of models must be selected based on reasonable assumptions with respect to the nature of the mechanism. As the fundamental characteristics of the mechanism are often given in terms of rates of change, it is not unusual to link the definition of the family to a system of differential equations. As the mechanisms in biology and medicine are relatively complex, the systems of differential equations used to characterize some of the properties of their behavior often contain nonlinear or delay terms. It is then rarely possible to obtain analytical solutions, and thus numerical approximations are used.

Whenever the interest lies in the understanding of the mechanisms of action, it is critical to be able to count on a strong collaboration between scientists, specialists in the field, and statisticians or mathematicians. The former must provide updated, rich, and reliable information about the problem, whereas the latter are trained for translating scientific information in mathematical models and for appropriately describing probabilistic/stochastic components indispensable to handling the variability inherently contained in the data generation processes. In other words, when faced with a scientific problem, statisticians and biomathematicians cannot construct suitable models in isolation, without detailed interaction with the scientists. On the other hand, many scientists have insufficient mathematical background to translate their theories into equations susceptible to confrontation with empirical data. Thus the first element of any model selection process within science must be based on close cooperation and interaction among the cross-functional team involved.

When there is a relative consensus about the family of models to use, the data must be retrieved from available repositories or generated with a well-designed experiment. In this chapter, animal tumor growth data are used for the representation of the different concepts encountered during the development of a model and its after-identification use. The data represent the tumor growth in rats over a period of 80 days. We are interested in modeling the growth of experimental tumors subcutaneously implanted in rats to be able to differentiate between treatment regimens. Two groups of rats have received different treatments, placebo and a new drug at a fixed dose. So in addition to the construction of an appropriate model for representing the tumor growth, there is an interest in the statistical significance of the effect of treatment. The raw data for one subject who received placebo are represented as open circles in Figure 3.1. For the considered subject, the tumor volume grows from nearly 0 to about 3000 mm3.

A first evaluation of the data can be done by running nonparametric statistical estimation techniques like, for example, the Nadaraya-Watson kernel regression estimate [2]. These techniques have the advantage of being relatively cost-free in terms of assumptions, but they do not provide any possibility of interpreting the outcome and are not at all reliable when extrapolating. The fact that these techniques do not require a lot of assumptions makes them


Figure 3.1 Time course of implanted tumor volume for one experimental subject (Control) and associated fitted model curves (solid line, exponential model; dashed line, nonparametric kernel estimate).


Figure 3.1 Time course of implanted tumor volume for one experimental subject (Control) and associated fitted model curves (solid line, exponential model; dashed line, nonparametric kernel estimate).

relatively close to what algorithm-oriented people try to do. These techniques are essentially descriptive by nature and are useful for summarizing the data by smoothing them and providing interpolated values. The fit obtained by using the Nadaraya-Watson estimate on the set of data previously introduced is represented by the dashed line in Figure 3.1. This approach, although often useful for practical applications, does not quite agree with the philosophical goal of science, which is to understand a phenomenon as completely and generally as possible. This is why a parametric mechanistic modeling approach to approximate the data-generating process must be used.

When looking at the presented data, it would be reasonable, as a first approximation, to imagine using a parametric family of models capturing the potential exponential growth of the tumor volumes. Although certainly reasonable from a physiological point of view, the selection of the exponential family is, at this stage, only based on the visual identification of a specific characteristic exhibited by the data, in this case, exponential growth. The exponential parametric family is mathematically fully characterized by the family of equations V(t) = a exp(Xt). A particular model is fully specified by fixing the values for its two parameters a and X. Note that it is particularly important to quantitatively study the change in behavior of the different models in terms of the parameters to have a good understanding of constraints existing on the parameters. In this case, for example, both parameters must be positive. To fit the model on the observed data, statistical techniques must be applied. These techniques attempt to optimize the selection of the parameter value with respect to a certain criterion. The ordinary least-squares optimization algorithm (see Section 3.3) has been used to get parameter estimates. Although this model has been selected essentially on the basis of the observed data structure, it is possible to try to give an interpretation of the model parameters. However, the interpretation of the parameters is only done after fitting the curve, possibly because of similar experiences with the same model used on other phenomena, which generate similar types of data. Up to this point, the interpretation is not at all based on known scientific properties of the data-producing mechanism built into the model. Note that again a similar a posteriori interpretability search is obviously not possible in the case of a nonparametric fit. For the exponential family of models, the first parameter might be interpreted as the tumor volume at time zero whereas the second might likely represent the tumor growth rate. The problem with the model identified from the exponential family is that mathematically the tumor growth will continue up to infinity, which from a physiological point of view is very difficult to accept and to justify. In other words, the very form of the mathematical model as such, independently of any recorded data, is incompatible with physiology as we know it. The mathematical analysis of the model behavior, abstracting from any recorded data, should be part of any serious modeling effort directed to the understanding of a physiological mechanism and should precede the numerical fitting of the model to the available data. This qualitative model analysis seeks to establish, first of all, that the model equations do admit a solution (even if we cannot explicitly derive one) and that this solution is unique. Secondly, the solution must have a set of desirable properties that are typical of the behavior of physiological systems, for example, they are bounded, positive, of bounded variation, stable with respect to the parameters and to the initial conditions. Finally, these solutions must exhibit or fail to exhibit some characteristic patterns, like oscillations whose period may depend on some parameter, or, more interestingly, may become established or change regime depending on some "bifurcation" parameter value. As noted before, in the absence of the possibility of actually deriving an explicit solution, given the complexity of the differential formulation, qualitative analysis seeks to characterize the unknown analytical solution, leaving to numerical techniques the actual computation of a close approximation to the unknown solution.

After having used a (simple) model formulation with some plausible meaning and a behavior matching the observed data structure, the next step in the quest for a good model is to go back to the selection of an appropriate family, this time operating a selection not only with reference to the apparent data structure but also incorporating some known or presumed quantitative properties of the mechanism under investigation. The investigation of tumor growth on which we concentrate in this chapter falls in fact into the broad topic of growth curve analysis, which is one of the most common types of studies in which nonlinear regression functions are employed. The special characteristics of the growth curves are that the exhibited growth profile generally is a nonlinear function of time with an asymptote; that random variability associated to the data is likely to increase with size, so that the dispersion is not constant; and finally, that successive responses are measured on the same subject so that they will generally not be independent [3]. Note that different individuals may have different tumor growth rates, either inherently or because of environmental effects or treatment. This will justify the population approach presented in Section 3.7.

The growth rate of a living organism or tissue can often be characterized by two competing processes. The net increase is then given by the difference between anabolism and catabolism, between the synthesis of new body matter and its loss. Catabolism is often assumed to be proportional to the quantity chosen to characterize the size of the living being, namely, weight or volume, whereas anabolism is assumed to have an allometric relationship to the same quantity. These assumptions on the competing processes are translated into mathematics by the following differential equation:

Was this article helpful?

+1 0
How To Bolster Your Immune System

How To Bolster Your Immune System

All Natural Immune Boosters Proven To Fight Infection, Disease And More. Discover A Natural, Safe Effective Way To Boost Your Immune System Using Ingredients From Your Kitchen Cupboard. The only common sense, no holds barred guide to hit the market today no gimmicks, no pills, just old fashioned common sense remedies to cure colds, influenza, viral infections and more.

Get My Free Audio Book

Post a comment