As already mentioned, problems pertaining to the reliability and validity of diagnosis are of definite importance for research in psychopathology. Research results based on inadequately diagnosed patients are of dubious value. However, there are other considerations that are equally important. One frequent and serious problem pertains to the specific sample of subjects or patients used in particular research investigations. Investigators generally can never study all individuals who supposedly manifest a particular form or type of psy-chopathology. Rather, they must usually settle for a much smaller number of cases that purportedly represent or typify the disorder under investigation. A variety of sampling problems may be encountered in this process, and it is worth examining some of these problems.
As indicated, investigators usually can study only small samples of the population they are interested in evaluating. An immediate critical issue is how well the sample selected for study actually represents the population or group it supposedly represents. How typical, for example, is a sample of thirty or fifty patients diagnosed as cases of schizophrenia in a state hospital in New Jersey, of all "schizophrenics" or of those in other types of settings? How far can one generalize from the findings obtained with this specific sample? Ostensibly, at least, findings obtained from a particular sample representative of a given category of disorder presumably have relevance for other comparable samples. The problem, therefore, is how does one define both the sample studied and the other samples of the population or diagnostic group to which the results are supposedly applicable?
Unfortunately, there are no true standard reference measures which can be used to define a particular clinical group of subjects, and the terms used for clinical diagnosis leave much to be desired. Consequently, it is exceedingly important to select the sample with great care and to provide as much useful descriptive information as possible concerning the sample. However, some published studies fail to provide adequate descriptive data for their samples. Such items as age, length of hospitalization, frequency of hospitalization, marital status, work history, previous treatments, family resources, education, intelligence, type of psychiatric ward or institutional setting, maintenance on medications, and the like, are all potentially important variables which may influence test performance, treatment outcome, duration of improvement, and similar variables. It should be apparent that significant variation on some or many of these variables between studies limits the reliability of the results secured and the drawing of conclusions that may have broad applicability.
The problems mentioned are particularly apparent when samples of modest size are drawn from institutional settings that vary widely in a number of dimensions. Patients in university or private hospitals generally are quite different from those in state or VA psychiatric hospitals, and generalizations from one setting do not necessarily fit other settings, even though the patients may all carry an official diagnosis of schizophrenia. For example, in a previous study of prognostic scales used in research on schizophrenic patients, we were unable to obtain an adequate number of "reactive" or good prognosis patients in the state hospital where we were conducting our study and had to secure patients from a city hospital that had fewer chronic patients (Garfield & Sundland, 1966).
Although sample size is of some importance because small-scale studies are more prone to produce findings limited in reliability, size, alone, is not a sufficient criterion. The selection of subjects and sample specifications are clearly of prime importance. If one is studying or comparing selected types of disordered behavior, the criteria used in selecting subjects should be explicit, and the procedures used should be those for which the reliability and validity are known or available and which meet commonly accepted standards. Particularly where studies are conducted of groups based largely on psychiatric diagnosis, it is essential to state clearly how the diagnoses were derived and that other supporting selection criteria be used. One large recent study examined the extent to which diagnostic instruments for assessing axis II personality disorders diverge from clinical diagnostic procedures. "Whereas current instruments rely primarily on direct questions derived from DSM-IV, clinicians of every theoretical persuasion found direct questions useful for assessing axis I disorders but only marginally so for axis II" (Westen, 1997, p. 895). Axis II diagnoses were made by "listening to patients describe interpersonal interactions and observing their behavior with the interviewer. In contrast to findings with current research instruments, most patients with personality disorders in clinical practice receive only one axis II diagnosis" (Westen, 1997, p. 895).
Diagnoses based on old records and provided by different psychiatrists also do not constitute a sound or reliable basis for subject selection. Diagnostic criteria and nomenclature change over the years, and it is a difficult task to compare and equate diagnoses based on different diagnostic schemes, as several studies have indicated (Fen-ton, Mosher, & Matthews, 1981; Goldstein & Anthony, 1988; Hill et al., 1996; Klein, 1982; Rutherford, Alterman, Cassiola, & Snider, 1995).
Another problem is the randomization of subject selection. For example, were the subjects selected at random from a previously selected pool of available subjects or were they selected because they were not on drugs, happened to be available to the investigator, were in a special ward, were judged to be cooperative, or in some other manner were really not "typical subjects?" Some studies use moderately complex tasks that necessitate some screening or selection of subjects to comply with the demands of the investigation. Such research compliance leads to a rather selected sample of subjects. The Vigotsky Test of conceptual thinking was at times used as a diagnostic test for schizophrenia, although even non-college-educated normal subjects could not perform satisfactorily on it (Garfield, 1974). Clearly, such selectivity can bias the results and limit the extent to which generalized statements can be made.
Selection of subjects on the basis of a single scale may also not be adequate, particularly if they are then considered to represent a particular diagnostic group. For example, not all subjects who score at 70 or higher on a scale of the MMPI may resemble groups diagnosed on other criteria as schizophrenic, depressed, etc. College students who obtain such scores may or may not be clinically depressed, psychotic, etc., and therefore comparisons with actual clinical populations may not be warranted.
It is clearly desirable to use more than one procedure or method for establishing the diagnosis of the subjects to be used in any research study in which the diagnosis is considered a significant variable. Psychiatric or clinical diagnosis should be supplemented by other criteria such as scores on appropriate tests or standardized rating scales. In depression, for example, scores on the MMPI, the Hamilton Rating Scale (1960), and the Beck Depression Inventory (1972) could be, and should be, used in addition to clinical diagnosis. Other instruments that were developed to aid in obtaining more reliable diagnoses are the Schedule for Affective Disorders and Schizophrenia (SADS) (Endicott & Spitzer, 1978) and the Research Diagnostic Criteria (Spitzer, Endicott, & Robins, 1975).
The former is a structured interview guide with rating scales relevant to the specific diagnostic categories, whereas the latter provides criteria whereby investigators can select relatively homogeneous groups of subjects who meet specified criteria of diagnosis. Although the Research Diagnostic Criteria were used to obtain more precise diagnostic designations, as noted earlier, such diagnostic criteria tended to be quite selective and to leave a significant number of patients undiag-nosed. In the NIMH Treatment of Depression Collaborative Research Program, uniform inclusion criteria, including a diagnosis of Major Depressive Disorder based on the Research Diagnostic Criteria and a score of 14 or greater on a modified seventeen-item Hamilton Rating Scale for Depression were used ''and uniform exclusion criteria were used across sites'' (Elkin, 1994, p. 116). Data were also obtained on a number of patient clinical, demographic, and personality variables that might be related to eventual outcome.
With the advent of DSM-III-R, a new structured interview for use with the new diagnostic system was developed with somewhat separate forms for diagnosing psychiatric inpatients, outpatients, and nonpatients (Spitzer, Williams, & Gibbon, 1987). Although highly structured like most interview schedules of this type, a number of open-ended questions are used. Other more specialized interview schedules have also been developed (Mc-Reynolds, 1989).
Various experimental or laboratory tests of thought disorder and communication have also been used in studies of patients with schizophrenia. However, none of these have attained any general recognition as diagnostic measures. On the other hand, investigators do believe that the use of psychometric data along with fixed diagnostic criteria can lead to a more valid definition of schizophrenia (Moldin, Gottesman, & Erlenmeyer-Kimling, 1987).
In addition, clinical diagnoses for research purposes should be based on diagnoses from two or more clinicians with a reasonably high indication of reliability between them. In a study on the use of many raters to increase the validity of judges, groups of one to ten judges rated a patient's mood from speech samples taken at various times during psychotherapy (Horowitz, Inouye, & Siegelman, 1979). These ratings were averaged and then correlated with an objective measure of anxiety. The correlations increased noticeably, up to a theoreti cal asymptote, as the number of judges increased. Thus, using five or six judges will tend to increase both the reliability and validity of ratings.
If such procedures are followed, there is greater assurance of the reliability of diagnosis, and there are several external or operational reference points in defining the samples used. To be sure, extra effort is required to carry out such procedures and, to a certain extent, such procedures might reduce the size of potential samples and raise new issues of selectivity and limited generalizability. However, the samples would be more clearly defined, and the dangers of relying exclusively on somewhat haphazard means of classification would be lessened. In the long run, the conflicting results obtained with the use of diffuse, unclear, and unreliable diagnostic categorization might lessen.
Another issue involves the possible selectivity of a patient sample to match it with an available control group. For example, if the control group has an average IQ of 98, an average educational attainment of tenth grade, or is composed primarily of ward attendants, a subject sample selected to match them on one or more of these variables may be a highly selective and unrepresentative sample of the patient group they are supposed to represent. They are not patients typical of a particular diagnostic group, and broad generalizations to other more randomly selected groups would not be justified. For example, in my previous study of hospitalized patients diagnosed as schizophrenic, very different patterns of performance on the Wechsler-Bellevue Scale were obtained for patients who differed in education and IQ (Garfield, 1949). The most striking difference was noted on the arithmetic subtest for individuals who differed in educational levels. Schizophrenics from the lowest educational group performed at their lowest level on this subtest, whereas the group with some college education obtained its highest score on this subtest. Furthermore, comparisons of samples of schizophrenic patients studied in different investigations, who differed noticeably in mean IQ, also revealed significant differences on test patterns among these samples (Garfield, 1949). In other words, there were as many significant differences among the various samples of patients as there were between a given sample and a normal group of control subjects.
The selection and specification of subjects used in research on clinical diagnosis is thus a matter of primary importance, and would-be investigators, as well as those who read the research reports, should give careful attention to the issues discussed in the preceding paragraphs. In the final analysis, the results can be no better than the type and representativeness of the subjects used.
Another problem frequently encountered in research reports involves the appropriateness of the control groups used. Although we are more sophisticated about such matters now than in the past, problems of appropriate controls are still evident. Obviously, a group of patients that has been hospitalized for some time should not usually be compared with a group of apparently normal college students, but comparisons like this have been made.
The problem becomes especially significant when a particular investigation focuses on differential clinical diagnosis or on the specification and appraisal of patterns of test performance or other diagnostic indicators for a specific category of patients. If the results of the study are to have any clinical significance in the practical sense, then the experimental group must be compared to other clinical groups with which they would normally be compared in the actual clinical situation. For example, comparisons of test performance or other measures of a sample of patients with a diagnosis of schizophrenia should be made with other clinical groups normally seen in that type of clinical situation and in approximately the usual proportions. In any clinical or hospital setting where a clinical diagnosis is sought, the problem is rarely one of comparing the given patient to a normal population. Rather, the issue is one of differential diagnosis and appraisal in which a number of specific diagnostic questions, such as the following may be raised: Is there any suggestion of psychotic disturbance or of possible brain damage? How serious is the thought disturbance or depression manifested by the patient? In trying to reach answers to such questions, clinicians must consider and compare the patterns of various types of psychopathology. They do not simply compare or contrast the patient's performance to the performance of nonpathological or normally adjusted individuals.
Thus, in studying a diagnostic pattern for possible use in clinical diagnosis, the investigator should compare the results of the particular clini cal group of interest with those of other diagnostic groups that are usually encountered in practice and from whom the aforementioned group is to be compared or differentiated. For clinical diagnosis, it is efficacious to have a control group made up of the proper mix of the other diagnostic groups that are normally encountered in the particular setting or of several groups representing the types of disorders that are most frequently confused with the group under study.
As emphasized earlier, the investigator should provide adequate information on all groups of subjects, indicating how they were selected and from what subpopulation they were drawn. For example, if a control group of thirty patients was selected for comparison with the group under study, why and how were these particular subjects selected, and from what number of comparable subjects were they drawn? If they were the only ones who had certain data or test scores available, what were the reasons for this particular state of affairs? How selective is the group, and how representative is it of supposedly similar subjects? Selective bias may greatly impair the kinds of conclusions that may be drawn and the extent to which the findings may be generalized.
It should be stressed that whatever control groups are used, they should be comparable in the variables of importance for a particular investigation. Where cognitive tests are used, the groups should show some comparability in level of ability, education, and age, because these may affect performance. Obviously, however, if one's particular focus is the diagnosis of mental retardation, such comparability may not be necessary or meaningful. Length of institutionalization, medications, type of ward, degree of cooperation, and other such attributes may also be important variables.
In essence, therefore, considerable attention should be given to selecting adequate and appropriate control groups in research on clinical diagnosis. Although this is particularly relevant to differential diagnosis and related problems, the issue of appropriate control groups also applies to experimental or theoretical studies of psycho-pathology. Numerous studies of various psychological functions in schizophrenic subjects have compared the latter quite frequently with normal controls. This may have value at a certain stage of investigation, but it has limited value if one is interested in ascertaining or demonstrating particular patterns of response or thought in patients with this specific clinical disorder. The investigator is actually trying to discover or demonstrate response patterns that characterize a particular pathological group. Whether the patterns obtained are distinctive of the particular type of pathology in question can be demonstrated only if the performance of the pathological group is compared with other pathological groups of comparable severity. In this particular instance, other psychotic groups would be the most adequate control group.
Although many of the points stressed in this section may appear obvious, they have been overlooked and continue to be disregarded by investigators in psychopathology. For example, a recent study of borderline personality disorder, with several positive features, also exhibited some of the deficiencies previously discussed (Nurnberg, Hurt, Feldman, & Suh, 1988). Patients for this study were selected from consecutive admissions to a twenty-three bed adult inpatient unit of a university teaching hospital. The criteria used specified an age range of 16 to 45, ''no evidence or history of organic mental disorder, neurologic disorder, substantial concurrent medical illness, mental retardation, or alcohol or drug addiction as a primary diagnosis; no DSM-III-R diagnosis of schizophrenia, major affective disorder, paranoid disorder, or schizoaffective disorder; and an independent clinical diagnosis of borderline personality disorder by the treatment team'' (p. 1280). All patients also were rated independently before discharge by the senior investigator to confirm the diagnosis of borderline personality disorder according to DSM-III, an appropriate score on the Diagnostic Interview for Borderline Patients (Gun-derson, Kolb, & Austin, 1981), and the absence of exclusion criteria. In this way seventeen patients (ten women and seven men) ages 17 to 35 years were selected.
Let us now evaluate what has been presented thus far about this study. A number of inclusive and exclusionary criteria have been specified to obtain what might be viewed as clear or ''pure'' cases of borderline personality disorder. The procedures and age and sex of the subjects have been clearly indicated so that an attempt at replication is possible. However, no data are provided on how many patients were evaluated, how many were rejected for not meeting study criteria, and the educational or socioeconomic status of the sample selected. In other words, is this group highly selective, for example, 17 out 1,000, or is such a group easily obtained from newly admitted inpatients? There is also no information on the reliability of the diagnosis.
So much for the experimental group. The control group with which it was compared consisted of twenty subjects (twelve women and eight men) selected from the hospital staff. These individuals were subjected to the same exclusion criteria that were used for the patients. In addition, they had to have no previous or present psychiatric treatment and be free from emotional disturbance or an acute major life event change within the past year. They ranged in age from 18 to 40 years. Again there is no information on the size of the group from which this control sample was selected or on educational and social class indexes.
The two groups were compared on the basis of a seventeen-item criteria instrument. By using the five best items, it was found that the presence of any two was a good predictor of borderline personality disorder in the patient group but produced a false negative rate of 30% in the controls. However, the use of four out of the best five predictors reduced diagnostic errors significantly.
As already pointed out, the use of a normal group or in this case, a rather selective normal control group, greatly increases the possibility of obtaining significant differences. Comparisons need to be made with other comparable diagnostic groups and attempts at cross-validation would be required if the criteria are to have diagnostic utility in the clinical situation. This is particularly true because many patients have more than one diagnosis (Wolf, Schubert, Patterson, Grande, et al., 1988) and as pointed out earlier, the rate of comor-bidity is especially high for personality disorders.
In a number of articles, significant correlations or differences at the .05 level of significance are reported with little concern shown over the number of significance tests performed. Although it should be apparent that the interpretation of the significant findings obtained is directly related to the number of statistical tests performed, this stricture is not always observed. If thirty-five comparisons or correlations are performed and two are at the .05 level of significance, it seems clear that the results obtained are so close to a chance occurrence that little should be made of them.
Similarly, post hoc analyses are sometimes treated as if explicit hypotheses were being tested. Obviously, considerations enter into such analyses that differ when a specific hypothesis, stated in advance of the investigation, is being tested. Although these matters are rather basic in experimental design and statistical analysis, the lessons on these topics are either not learned well or are quickly forgotten. When the investigators emphasize significant findings that they have not predicted in advance and that are noted after the investigation has been completed, they essentially are capitalizing on chance occurrences. For the results to be taken seriously, they should be replicated on a new sample of subjects.
A related issue involves the practical or clinical significance of findings that are significant statistically at the .05 or even at the .001 level of significance. It appears from numerous reports that researchers have been led to worship at the shrine of statistical significance. Very likely, this may result from the emphasis individuals place on obtaining "positive" results and the fear of not being able to reject the null hypothesis. Many investigators appear content, and some even euphoric, at obtaining "positive results'' at the .05 level of significance, regardless of whether the results appear to have potential diagnostic use. Of all the limitations the present writer has encountered in reviewing journal manuscripts, the emphasis on statistical significance and the disregard of practical psychological significance is probably the most frequent.
Clearly, researchers must examine their data to see if the results obtained are due to chance. A statistical test of the differences obtained between selected clinical groups or of the correlations obtained on certain measures and designated criteria for specific diagnostic categories is a necessary procedure for estimating the influence of chance on the results. No criticism of this procedure is intended here. However, one should recognize that for clinical research this represents only the first of the necessary procedures for appraising the results obtained. Such a statistical test informs us of the probability that our results may be explained by chance occurrences. If our results are significant at the .05 level, the interpretation should be that there are only five chances in a hundred that the results obtained may be attributed to chance. The findings, of course, could be due to chance, but the odds are against it. However, all one can reasonably conclude from such findings is that they do not appear to be due to chance, and that if the study were repeated, we might expect comparable findings. This, of course, does not guarantee that the new findings will be comparable to those in the original study, for conflicting findings are by no means uncommon in studies of psychopathology. However, whether the results obtained have any potential clinical usefulness cannot be determined by these statistical tests alone. Other appraisals must be made.
Some factors that influence statistical results also need to be considered. Statistical tests are very much influenced by the size of the samples used and by their variability. Large samples are generally less influenced by selective and chance variables than very small samples. Thus, findings that are small may be statistically significant in the former instance, whereas they would fail to attain significance with smaller samples. For example, with samples of around thirty subjects, correlations have to be in the neighborhood of .35 or so to reach significance. In contrast to this, with a sample of several hundred subjects, a much lower correlation, in the neighborhood of .10, may be statistically significant, although the amount of predicted variance in the scores has no practical significance. For example, in a study that I conducted of 855 clinical psychologists, a correlation of .086 was obtained between career satisfaction and satisfaction with the American Psychological Association (Garfield & Kurtz, 1976). This correlation was significant at the .01 level. However, such a correlation is clearly a "low correlation" and has little or no practical significance.
Thus, besides the level of significance, we must also consider the size of the sample and the actual amount of variance accounted for by the correlation coefficient if we are to interpret the findings in terms of their utility. Moderately high correlations that are not statistically significant have little value and are unreliable.
Similarly, low variability within groups of subjects generally increases the probability of securing statistically significant differences. However, though small standard deviations increase the possibility of useful discrimination between different clinical groups, the actual utility of the measures or comparisons used requires further analysis of the actual data obtained.
Although there have been some discussions of the difference between clinical and statistical significance in the published literature, the importance of this topic has received attention only recently (Jacobson & Revenstorf, 1988; Jacobson & Truax, 1991; Kazdin, 1994; Lambert & Hill, 1994). Many investigators appear content to rest on their statistical laurels and manifest little concern about the practical value of their results. However, there are several aspects of the research data which should be examined for their potential clinical significance. For example, regardless of the mathematical procedures used to test hypotheses within a statistical model, it is also important to know how much of the variance is accounted for by the particular variables studied. For practical as well as theoretical purposes this is an important consideration, but it is frequently omitted in the discussion of results. In the case of correlational data, simply squaring the coefficient of correlation provides an estimate of the amount of variance accounted for by the particular set of correlates. With other methods of analysis, the implications may not be as readily apparent, but it is equally important to provide an estimate of the variance accounted for by the experimental manipulation.
The importance of such analyses can be illustrated briefly here. In the report of the activities and preferences of a sample of 855 clinical psychologists (Garfield & Kurtz, 1976), numerous findings at the .01 level of significance or better were obtained. Some may have been the result of the numerous comparisons made, but the sample size was very likely of some importance. Consequently, correlations of .10 which were highly significant statistically were judged of little practical importance because they accounted for only 1% of the variance. In another study of more than 1000 clients in a number of community mental health centers, several correlations of around .10 were reported as highly significant, even though they were obviously of little significance clinically or socially (Sue, McKinney, Allen, & Hall, 1974). For example, the correlation between diagnosis and premature termination was .10, and this was significant at the .001 level of confidence. However, by itself, such a significant finding accounts for a negligible amount of the variance. Particularly where large samples are used, authors should feel obligated to stress the implications of their findings in terms of the variance accounted for by the variables under study.
In addition to estimating the amount of variance accounted for by the variables studied, other analyses can also be performed. The means and standard deviations for the clinical samples studied are of potential value to other clinicians and investigators. Consequently, they should be reported, and the extent of overlap of the distributions also should be stated clearly. Also of value for clinical diagnosis are the numbers of subjects who would be correctly diagnosed by the diagnostic procedures evaluated and those who would be misdiag-nosed—that is, data on false positives and false negatives. Such data are clearly important in evaluating any diagnostic procedures, but they are not always obtained or provided. It should be apparent that a particular diagnostic technique may differentiate two or more clinical groups at the .05 level of significance but yet produce so many false positives or negatives that it has very limited clinical utility. For such reasons, it is important that investigators analyze their data in terms of clinically relevant diagnostic considerations and present their analyses clearly.
Another related problem in research on clinical diagnosis concerns the old issue of base rates. Although this important matter was raised some years ago and is not an unfamiliar topic in either the areas of diagnostic assessment or psychotherapy (Gathercole, 1968; Meehl & Rosen, 1955; McNair, Lorr, & Callahan, 1963; Milich, Widiger, & Landau, 1987), a number of studies seemingly have disregarded it even when it was clearly relevant. Consequently, base rates are worth some discussion.
I can begin by referring to an earlier experience of mine. Working as a clinical psychologist in a VA Hospital more than 50 years ago, I collected psychological test data on a moderate sized sample of patients who had been referred for psychological evaluation. I compared my diagnostic impressions based on test data with the clinical diagnoses made at the staff conferences on these patients and, among other things, noted that my diagnosis of schizophrenia agreed with the staff diagnosis in 67% of the cases. On the whole, I was not displeased with this degree of agreement. I prepared an article on this study and submitted it to the Branch Chief Psychologist for his evaluation. The Branch Chief, David Shakow, returned it to me with his comments. He particularly raised the issue of how my diagnosis of schizophrenia compared with the rates of such admissions to the hospital, and suggested that I secure the base rates for diagnoses of schizophrenia in my hospital for comparison. I reluctantly carried out a survey of hospital admissions for a specific period and discovered to my surprise that the number of patients officially diagnosed as cases of schizophrenia was just about 67%. In other words, my diagnostic work did not exceed the base rate for diagnoses of schizophrenia, and automatically diagnosing every admitted patient as a schizophrenic would have been as accurate as the diagnoses derived from my psychological test evaluation. Needless to say, this was a sobering experience which has remained rather vivid in my memory.
The matter of base rates, thus, is a factor that must be considered in evaluating certain kinds of research on clinical diagnosis. To be worthwhile clinically, diagnostic procedures must clearly exceed the base rates for a particular disorder. Otherwise, one may be wasting time and effort. Attention to such matters, along with attention to false positives and negatives indicates the potential difficulties in using diagnostic procedures for disorders where the incidence is very high, or where it is very low, as in the case of suicide. In the latter instance, a diagnostic or predictive measure must be extremely effective in discriminating the cases being evaluated if it is to be clinically useful.
Thus, for specific kinds of problems pertaining to differential diagnosis or prediction, it is essential for the researcher to provide data on the base rates for the disorders of interest and to clearly show the advantages as well as disadvantages for the procedures being evaluated.
In this section, I discuss some of the deficiencies noted in some research reports on clinical diagnosis. These include such problems as not providing basic information on the measures used, necessary information on important variables that might influence results, how diagnoses were secured, how the subjects were selected, and similar aspects. The examples discussed will illustrate the kinds of problems encountered when such information is not included in the report.
At present, when a large number of patients, both inpatients and outpatients, are receiving medication of various kinds, it is extremely important to provide this information in the research report. If the medications used have any potency, they are bound to influence the behavior and mental func tioning of the subjects studied. The taking of medication must be clearly mentioned, and also the medication used, the dosages, and in many cases, the duration of medication should also be specified. If part of a group of subjects is on medication and part not on medication, there are clearly problems in mixing the results of these two subgroups and treating them as one relatively homogeneous group. In a related fashion, comparing a clinical group of subjects receiving medication with a control group which is not must obviously limit the kinds of conclusions that can be drawn from such comparisons. If one is attempting to compare the effects of drugs on two groups of comparable patients, then, of course, the previous comparison would be feasible, providing that a placebo was used with the control group. However, if the comparison is done to compare the mental functioning, behavior, or personality characteristics of a given diagnostic group with some other group, then such a comparison would provide results that were contaminated by the influence of the medication.
Another more common problem involves the lack of adequate information presented on the particular techniques or methods of appraisal used in the research study. This is of particular importance when the procedures used are not standardized or have been constructed by the investigators for their specific research without adequate preliminary reliability or validity studies. It is the responsibility of investigators to describe clearly the procedures and techniques they have used in their study so that others may fully understand what has taken place and be able to attempt replications of the study. Readers of such reports should also evaluate the adequacy of the research before attempting to apply it to their own clinical or research work.
With the pressures for publication and the corresponding desire to use journal space efficaciously, it is understandable that editors want manuscripts to be as brief and concise as possible. Nevertheless, this does not mean that significant information about a research project should be omitted. For example, when relatively unknown tests or rating scales are used, the investigator should describe them in sufficient detail so that the reader clearly understands the techniques used and can appraise their suitability for the samples and the problem under investigation. This is not an infrequent occurrence. Lambert and Hill (1994) stated,
"The proliferation of outcome measures (a sizable portion of which were unstandardized scales) is overwhelming if not disheartening" (p. 74). The investigator should also provide pertinent data concerning the reliability and validity of the procedures used in these instances as well. Because the value of the results obtained in any research study depends on the adequacy of the measures used, such information is essential.
A final point concerns inadequate or apparently biased citing of the relevant research literature. I cannot recall seeing any discussion of this topic in presentations of research on either clinical diagnosis or treatment. Perhaps it is assumed that all investigators are aware of the need to review carefully the existing research on the topic under investigation. Nevertheless, this issue is mentioned in critiques of research reports or reviews of research (Garfield, 1977, 1978). The issue is of some importance where there is conflicting literature on a specific topic and investigators refer primarily to those published reports that support their findings or position and omit mention of findings that fail to do so. Such a practice violates accepted standards of scholarship, misleads uninformed readers of the research reports, and may tend to perpetuate the use of fallible diagnostic techniques.
One of the concerns about much past research, not limited to clinical diagnosis alone, has been the large number of conflicting findings in the literature and the difficulty in replicating published findings. This has become such a frequent occurrence in research in clinical psychology that many review articles tend to present a summary of the number of positive and negative studies on a given topic (e.g., Garfield, 1994; Luborsky, Singer, & Luborsky, 1975). If more than half of the studies are favorable to a given proposition, then that view may be judged to have some support, even where the studies may vary greatly in quality. This is a rather risky means of drawing conclusions about some technique or finding. If a particular clinical treatment for schizophrenia, for example, is found helpful in eight investigations and harmful in four, should we conclude that overall the treatment is helpful? The fact that we have such conflicting findings should make us suspect some possible limitations in the research reported and to withhold judgment until we are able to explain the discrepant findings.
Most likely, the discrepancies among research reports are due to subject variables, sample size, and variations in diagnostic and treatment procedures, as well as to chance variables. Because of the kinds of problems already mentioned concerning variations in assigning clinical diagnoses, in selecting subject samples, in the kind of settings used, and in the procedures used, the findings from any single investigation are probably best viewed as suggestive. Although attempted replications by other investigators in different settings are essential for appraising the value of any investigation, there is one procedure that most researchers could use to improve the reliability of their findings— and I am surprised that it is not used more frequently. This is simply to secure enough subjects in the experimental and control groups so that all groups can be randomly divided into two subgroups. The study is then conducted so that one set of subgroups serves as the initial group and the second set serves as an attempt at cross-validation. This is a relatively straightforward procedure which has been used in some past studies, but relatively infrequently (Garfield & Wolpin, 1963; Lorr, Katz, & Rubenstein, 1958; Sullivan, Miller, & Smelser, 1958).
In the study by Sullivan et al. (1958), two cross-validations were carried out on MMPI profiles of patients who terminated prematurely from psychotherapy; the results of the cross-validation attempts were clearly very important for the final conclusions drawn. Significant differences between premature terminators and those who continued in therapy were found for several MMPI Scales for each of the several groups of subjects investigated. However, none of the scales that differentiated the two groups under study in one sample showed a consistent pattern. For each separate appraisal, different scales were found significant in their differentiation. Consequently, the findings secured in the first appraisal were not supported in the subsequent cross-validations, and the final conclusions reached were different from what they would have been if the cross-validations had not been attempted. This old but well-conducted study clearly illustrates the importance and the necessity of cross-validation.
Although attempted replications or cross-validations by other investigators are also required to evaluate fully the significance and utility of findings reported in any single investigation, the procedure suggested above is useful for increasing the potential value of single studies.
Was this article helpful?