Reliability and Validity of Clinical Diagnosis

As indicated, clinical diagnosis usually results in a specific categorization or label. After the patient has been appraised, the clinician offers a diagnosis (e.g., schizophrenia, paranoid type). If the diagnosis reflects a specific disease process that is known and understood, there is a reasonable probability that most diagnosticians might agree on the correct diagnosis. However, if the diagnostic category lacks preciseness and covers a moderate variety of behaviors, the reliability of the diagnosis may be impaired. This is of some importance for both practice and research. For example, if a specific treatment is indicated for a given disorder, a misdiagnosis may lead to improper treatment. In research investigations, unreliable diagnoses for the subjects studied may lead to unreliable or invalid results. Thus, the reliability of the diagnosis is of some consequence; most of the research in the past has not secured high agreement among psychiatrists who provided diagnoses on the same group of subjects (Ash, 1949; Beck, Ward, Men-delson, Mock, & Erbaugh, 1962; Blashfield, 1973; Hunt, Wittson, & Hunt, 1953; Kreitman, 1961; Kuriansky, Demingk, & Gurland, 1974; Malamud,

1946; Spitzer & Fleiss, 1974). Although several of these studies have flaws, they do illustrate the problem of the lack of reliability in psychiatric diagnosis. The percentage of agreement between two psychiatrists for specific subtype diagnosis, excluding organic brain syndromes, varied from 6 to 57% in one study (Schmidt & Fonda, 1956). In a study of three psychiatrists who worked in the same hospital with comparable groups of patients, one of the psychiatrists diagnosed two-thirds of his patients as schizophrenic, compared with 22 and 29% by the other two psychiatrists (Pasa-manick, Dinitz, & Lefton, 1959). Anyone who has worked for a time in a psychiatric hospital has had the opportunity to observe the diagnostic preferences and biases of other staff members.

Other examples of the possible unreliability of psychiatric diagnosis can be found in differences in proportions or distribution of diagnoses over time and in different countries or geographic locations. Kramer (1965), for example, collected data that showed large variations over time in first admission rates to state mental hospitals in the United States for different diagnoses. The rates of schizophrenic admissions were 17.2, 22.0, and 25.3 per 100,000 for the years 1940, 1950, and 1960, respectively. The 1960 rate of admission for patients diagnosed as schizophrenic exceeded the 1940 rate by 47%. The rates of affective psychosis, on the other hand, showed a steady decline from 11.2 to 9.5 to 7.4 for the same periods, or a decline of 33.9%. Does this represent a real increase in rates of admission for schizophrenia and a real decrease in such rates for affective psychosis, or does this reflect inconsistency in applying diagnostic criteria? Similar questions can be raised concerning other diagnoses for which large differences in rates of admission are observed. "The large increases in the rates for the psychoneuroses, personality disorders, and alcoholic addictions raise questions as to whether they represent true increases in rates of admission for these diagnostic groups or differences in diagnostic fads and criteria or other factors that lead mental hospital psychiatrists to place nonpsychotic diagnoses on increasing numbers of patients'' (Kramer, 1965, pp. 104-105).

It is reasonably clear from the kinds of data mentioned before, as well as from studies comparing psychiatric diagnoses obtained in different countries, that unreliability in diagnosis has constituted a problem of some importance. It is cer tainly a serious problem for research on psycho-pathology. If there is limited reliability for the diagnostic groups studied, the results secured for any investigation may lack stability, and replication and generalization of results will be difficult. Clinically, lack of reliability may lead to incorrect diagnosis and treatment. As a result of faulty diagnosis, individuals may be socially stigmatized or inappropriately institutionalized. Thus, reliability of diagnosis is not merely an academic issue, even though our concern here with methodological issues may pertain more directly to problems of research in psychopathology.

Although the reliability of clinical diagnosis is important, there is also the matter of the validity of the diagnoses secured. Reliability, although necessary, does not guarantee validity. As emphasized in a critical review, "Even if the reliability of schizophrenia could be assured, therefore, the validity of the concept would require further demonstration" (Bentall, Jackson, & Pilgrim, 1988, p. 306). The importance of the validity of clinical diagnoses in psychiatry, however, has not received very much research attention, perhaps because of the difficulties in securing adequate criteria for evaluative purposes. How does one decide if a diagnosis of anxiety disorder or schizophrenia is valid? In essence, in evaluating disorders without physical components or clearly known etiologies, one must rely on clinical judgments or psychological tests, which in turn have been based largely on clinical judgments. In the absence of some standard or accepted criteria against which to evaluate clinical diagnoses, the process of validation is difficult and thus somewhat neglected. Reliability, by contrast, is much easier to appraise and has received increased attention in the preparation of DSM-IV.

Nevertheless, the problem of validity remains. It may not be viewed as of great importance when there are no treatments of proven worth or preventive strategies for different diagnostic categories. However, where different and effective treatments or methods of prevention are available, the validity of differential diagnosis becomes important. It is, however, always significant for studies of particular types of disorders and may be particularly so for certain types of research.

One kind of research in which the validity of clinical diagnoses is of great importance is that conducted with subjects at high risk for schizophrenia. Although the problem of low reliability of psychiatric diagnoses can be overcome with competent and specially trained diagnosticians who are provided with accurate and detailed information and with structured interview guidelines, the matter of validity is more complex and difficult. Nevertheless, "For high-risk researchers, the issue of the predictive validity of the diagnoses of schizophrenia is of central importance'' (Hanson, Gotterman, & Meehl, 1977, p. 576).

The Recent DSMs-III, III-R, AND IV

As already noted, there has been considerable activity in recent years in forming new psychiatric diagnostic systems. Three revised diagnostic manuals have been published in just 14 years. Furthermore, these manuals were considerably larger than their predecessors, and the approach and methods for producing them differed in important ways. These new diagnostic manuals manifest a definite sensitivity to the criticisms made of the older systems. The authors have been particularly sensitive to issues of definition and reliability, and the new diagnostic systems have stimulated a considerable amount of research (McReynolds, 1989; Millon & Klerman, 1986; Sutker & Adams, 1993). Some of these studies are reviewed here briefly, mainly to illustrate methodological issues.

In many respects, DSM-III was a radical departure from its predecessors. Its authors attempted to avoid any theoretical partisanship or controversies, and they also attempted to emphasize operational criteria and descriptive psychopathology. "These criteria are based, for the most part, on manifest descriptive psychopathology, rather than inferences or criteria from presumed causation or etiology, whether this causation be psychodynamic, social, or biological. The exception to this is the category of organic disorders whose etiology is established as caused by central nervous system pathology'' (Klerman, 1986, p. 5). However, this distinction for organic disorders was omitted in DSM-IV. "The term 'organic mental disorders' has been eliminated from DSM-IV because it implies that the other disorders in the manual do not have an 'organic' component" (American Psychiatric Association, 1994, p. 776). This change appears to reflect political-guild issues more than diagnostic ones (Follette & Houts, 1996), and I will say no more about it here.

Although some have criticized the deliberate avoidance of a theoretical approach to diagnosis (Follette & Houts, 1996; Skinner, 1986), the authors of DSM-III, as well as DSM-III-R and IV, emphasized the need for accurate description and reliability of diagnosis and even carried out reliability studies, particularly for DSM-IV. Let us now turn to some studies of reliability and related issues.

In one study, twenty psychiatrists made independent diagnoses on twenty-four actual case histories of childhood psychiatric disorders (Cant-well, Russell, Mattison, & Will, 1979). The average agreement of these clinicians with the authors' consensus on the expected DSM-III diagnosis was just less than 50%. In another report on this study, the average agreement between the psychiatrists on their most common diagnosis (they were allowed more than one) was 57% for DSM-II and 54% for DSM-III (Mattison, Cantwell, Russell, & Will, 1979). Interrater agreement for DSM-III reached 80% for only four of the twenty-four cases; the best results were obtained for diagnoses of mental retardation. Noteworthy disagreement was noted in both systems for anxiety disorders, complex cases, and in the subtypes of depression.

The lack of agreement among the different diagnostic systems and the fact that they tend to select different samples of subjects for the same diagnostic category has also been noted by others (Endicott, Nee, Fleiss, Cohen, Williams, & Simon, 1982; Fenton, Mosher, & Matthews, 1981). In the study by Endicott et al. (1982) of diagnostic criteria for schizophrenia, six well-known systems were used to evaluate newly admitted patients including the Research Diagnostic Criteria (RDC) (Spitzer, Endicott, & Robins, 1975), the Feighner Criteria (Feighner, Robins, Guze, Woodruff, Wi-nokur, & Munoz, 1972), and DSM-III. ''The most salient finding of the study is that the systems vary greatly in the rates at which they make the diagnosis of schizophrenia" (Endicott et al., 1982, p. 888). The percentage of cases diagnosed as cases with schizophrenia ranged from 3.6 to 26%, a sevenfold difference, although most systems showed acceptable rater reliability. ''The disparity illustrates the degree of difficulty associated with the diagnosis of schizophrenia and in the concept of schizophrenia ..." (Endicott et al., p. 888).

A somewhat similar study of forty-six cases of schizophrenia was reported by Klein (1982) who compared seven diagnostic systems including most of those evaluated by Endicott et al. (1982).

These patients had to have a hospital diagnosis of schizophrenia based on DSM-II, a score of four or more on the New Haven Schizophrenic Index (Astrachan et al., 1972), and be under age 56 with no evidence of organic brain damage, toxic psychosis, drug abuse, and the like. The DSM-III correlated .89 with the Feighner Criteria and .84 with the RDC, diagnostic systems which were models for it. However, its correlations with the remaining four scales were considerably lower. Use of the DSM-III led to diagnosis of 28% of the sample as cases of schizophrenia; the range was 24 to 63% for the other diagnostic systems. Furthermore, only nine of the forty-four patients were diagnosed by all seven systems as either cases of schizophrenia (N = 3) or as not such cases (N = 6).

Somewhat comparable findings were reported in a more recent study (Hill et al., 1996). "The aim of this study was to determine the extent to which diagnoses of schizophrenia from forensic sources can be seen to meet formal diagnostic criteria through use of both a structured undiagnostic approach and a multidiagnostic chart review based on case histories" (p. 534). Each of the eighty-three subjects had a recorded diagnosis of schizophrenia at coronal autopsy. Thirty one percent did not meet the criteria for any of the five diagnostic systems, 68.7% met criteria for at least one system, and 20.5% met the criteria for all five diagnostic systems. Agreement ranged from 42.2% for the Feighner criteria to 63.9% for DSM-III-R.

Although it seems reasonably clear that DSM-III, DSM-III-R, and DSM-IV are more precise in their delineation of many mental disorders than was true of DSM-II and that the reliability of diagnosis has been enhanced in many instances, some important problems remain. The lack of comparability of diagnostic systems has already been noted, and this clearly presents problems for both research and practice. If everyone adhered to one diagnostic system exclusively, then perhaps this problem would be less serious. However, when new official systems are introduced within a period as brief as seven years, the comparability and meaningfulness of clinical diagnosis becomes more problematic. DSM-IV, for example, was being prepared while research studies on DSM-III-R were still underway. Thus, although the DSM-IV Task Force profited from the research and data sets resulting from DSM-III, it could have learned more by awaiting additional research based on DSM-III-R.

In the introductory section of the DSM-III-R manual, it is stated that the American Psychiatric Association decided in 1983 to work on a revision of DSM-III for several reasons. Data from new studies were inconsistent with some of the diagnostic criteria. "In addition, despite extensive field testing of the DSM-III diagnostic criteria before their official adoption, experience with them since their publication had revealed, as expected, many instances in which the criteria were not entirely clear, were inconsistent across categories, or were even contradictory" (American Psychiatric Association, 1987, p. XVII). Therefore, a thorough review was instituted, and the required modifications were made. Although the revised DSM-III essentially follows the same overall scheme and rationale as the original, some modifications were made and comparisons of the two diagnostic systems have been reported. Some examples of the problems encountered follow.

One study attempted to evaluate the reliability, sensitivity, and specificity of DSM-III and DSM-III-R criteria for the category of autism in relation to each other and to the clinical diagnoses made (Volkmar, Bregman, Cohen, & Cicchetti, 1988). The subjects were fifty-two individuals diagnosed as autistic and sixty-two considered developmen-tally disordered but not autistic. The reliability of the specific criteria tended to be high. The DSM-III criteria were judged more specific but less sensitive than the DSM-III-R criteria. As a result, the investigators concluded that the diagnostic concept of autism has been greatly broadened in the revised system.

In the light of the preceding paragraph, it is of interest to list the changes made for autistic disorder in DSM-IV:

Autistic Disorder. The DSM-III-R defining features (impaired social interaction, communication, and stereotyped patterns of behavior) are retained in DSM-IV, but the individual items and the overall diagnostic algorithm have been modified to

(1) improve clinical utility by reducing the number of items from 16 to 12 and by increasing the clarity of individual items;

(2) increase compatibility with the ICD-10 Diagnostic Criteria for Research; and (3) narrow the definition of caseness so that it conforms more closely with clinical judgment, DSM-III, and ICD-10. In addition an ''age of onset'' requirement (before age 3 years in DSM-IV), which had been dropped in DSM-III-R, has been reinstated to conform to clinical usage and to increase the homogeneity of this category. (American Psychiatric Association, 1994, p. 774)

Another problem in clinical diagnosis is illustrated in a study of the differential diagnosis of attention deficit disorder (ADD) and conduct disorder using conditional probabilities (Milich, Wi-diger, & Landau, 1987). Although these two disorders are considered separate disorders, there has been a substantial overlap in symptoms. Using a standardized interview designed to represent the diagnostic criteria contained in DSM-lIl, seventy-six boys referred to a psychiatric outpatient clinic were evaluated and the conditional probabilities and base rates of the symptoms for both disorders were ascertained. The results indicated that the symptom with the highest covariation with the specific disorder was not always the most useful in diagnosis. Furthermore, some symptoms are most useful as inclusion criteria, whereas some are most useful as exclusion criteria. The authors also point out that the interview used was based on DSM-III and that the application of different diagnostic criteria could change the pattern of results obtained. This, of course, is always a problem when diagnostic criteria are revised and new systems instituted. A final important point made by these investigators is that the symptom criteria offered for ADD in DSM-III-R are weighted equally "whereas the results of the present study suggest that some symptoms are more effective inclusion criteria than others. In addition, the DSM-III-R offers only inclusion criteria and makes no attempt to use symptoms as exclusion criteria" (Milich, Widiger, & Landau, 1987, p. 766).

Because of space limitations, only a few other studies can be mentioned. The introduction of many new diagnoses obviously created many new potential problems, among them, estimating the incidence of specific disorders. For example, "In the years since 1980, bulimia has gone from being virtually unknown to being described by some medical investigators as a 'major public health problem' (Pope, Hudson, & Yurgelun-Todd, 1984) and being designated by one prominent nonmedi-cal leader of contemporary female opinion as a disorder of 'epidemic proportions''' (Ben-Tovin, 1988, p. 1000). This author also states that "The use of DSM-III-R seems likely to lead to a dramatic decline in the diagnosis and prevalence of bulimia" (Ben-Tovin, 1988, p. 1002).

Somewhat comparable comparisons have been reported by others. In one study of the definitions of schizophrenia for 532 inpatients treated and reevaluated 15 years later, the use of DSM-III-R reduced the number of patients diagnosed with schizophrenia by 10%. However, the DSM-III di agnosed patients included and excluded by DSM-III-R did not differ in terms of demographic, premorbid, or long-term outcome characteristics. The authors of this report emphasized that in the absence of improved validity, frequent changes in diagnostic systems were likely to impede research progress (Fenton, McGlashan, & Heinssen, 1988). Zimmerman (1988) also expressed his doubts that new changes in such a short time would actually improve the practice of psychiatry.

A number of clinical researchers have published various critiques of some of the diagnostic categories listed in the new classification systems. Aronson (1987) stated that the definition of panic attack in DSM-III lacks precision and that the overlap with other disorders raises questions about what is a distinct psychiatric disorder. Leavitt and Tsuang (1988) reviewed the literature on schizo-affective disorder and concluded that "Until there is greater agreement on the criteria for and the meaning of schizoaffective disorder, reports on treatment results will not be generalizable'' (p. 935). In DSM-IV, the criteria set for schizoaffective disorder "has been changed to focus on an uninterrupted episode of illness rather than on the lifetime pattern of symptoms'' (American Psychiatric Association, 1994, p. 779).

Obviously, there are significant differences in validity and reliability of diagnosis among diagnostic categories. In recent years, some of the personality diagnoses such as narcissistic personality disorder or borderline personality disorder have been popularized by several psychoanalyti-cally oriented clinicians and, in part due to this, have been included as distinct disorders in the official nomenclature. Although a definite diagnosis of a borderline condition has always seemed rather illogical to me, apparently it is no problem to many people. However, as some have noted, the category of borderline personality disorder has been used to include a variety of pathological behaviors. ''Exhibiting almost all of the clinical attributes known to descriptive psychopathology, borderline conditions lend themselves to a simplistic, if not perverse, form of diagnostic logic, that is, patients who display a potpourri of clinical indices, especially where symptomatic relationships are unclear or seem inconsistent, must perforce be borderlines'' (Millon, 1988).

The use of a multiaxial system has also led to an increase in what has been termed "comorbidity," having two or more concurrent diagnoses. Axis I

in DSM-IV, Clinical Disorders, contains most of the more traditional psychiatric diagnoses, whereas Axis II includes only Personality Disorders and Mental Retardation. In DSM-III, it was originally stated that "This separation ensures that consideration is given to the possible presence of disorders that are frequently overlooked when attention is directed to the usually more florid Axis I disorder'' (American Psychiatric Association, 1980, p. 23). However, it appears that personality disorders are diagnosed quite frequently as either a primary diagnosis or as a secondary diagnosis. In a recent critical appraisal of the use of the terms of comorbidity or comorbid in psychopathology research, Lilienfeld, Waldman, & Israel (1994) indicated the growth in their use. After appearing only twice in 1986, "the number of journal abstracts or titles containing these terms increased to 21 in 1987, 43 in 1988, 97 in 1989, 147 in 1990, 192 in 1991, 191 in 1992, and 243 in 1993'' (Lilienfeld et al., 1994, p. 71). This trend raises some question concerning the use of traditional views of medical diagnosis in psychopathology and ''implicitly assumes a categorical model of diagnosis that may be inappropriate for personality disorders ..." (Lilienfeld et al., 1994, p. 79).

In a study of more than 200 adults at risk of AIDS, multiple diagnoses of personality disorder were recorded for most individuals with any DSM-III-R Axis II diagnosis. Almost half of the subjects with a diagnosis in one personality cluster also had a concurrent diagnosis in another cluster (Jacobsberg, Francis, & Perry, 1995). A study of the comorbidity of alcoholism and personality disorders in a clinical population of 366 patients also obtained comparable findings. There was extensive overlap between Axis I disorders and personality disorders, as well as among personality disorders themselves (Morgenstern, Langenbucher, Lubouvie, & Miller, 1997). In another study of 118 gay men conducted to investigate the stability of personality disorder, it was reported that diagnoses of personality disorders had low stability over a 2-year period (Johnson et al., 1997). A study of seventy-eight adult outpatients with attention deficit hyperactive disorder evaluated by standard tests showed high comorbidity with current depressive disorder, antisocial personality disorder, and alcohol and drug abuse dependence (Downey, Stanton, Pomerleau, & Giordani, 1997). In a sample of 716 opioid abusers, psychiatric comorbidity was documented in 47% of the sample based on a

DSM-III-R diagnostic assessment. Such comor-bidity was especially noted for personality and mood disorders for both sexes (Brooner, King, Kidorf, Schmidt, & Bigelow, 1997). There have been additional studies published on this issue, but there is no need to review them here. As Robins (1994) commented, ''When standardized interviews demonstrate that a single patient qualifies for an unreasonable number of diagnoses, that should motivate the field to rethink this proliferation of categories" (p. 94).

Thus, the categorical delineation of psychiatric disorders presents problems for meaningful diagnosis. Dumont (1984), for example, feels strongly that psychiatry errs in attempting to divide all abnormal behavior into discrete illness categories. He believes that labels such as "hyperactivity" and ''learning disorders" for children ''are a capricious and arbitrary drawing of lines through a spectrum of behavioral, intellectual, emotional, and social disabilities" (Dumont, 1984, p. 326). Marmor (1983) in discussing systems thinking in psychiatry also emphasizes ''that the growing tendency to think in terms of distinct and sharply demarcated phenomenological entities, as exemplified in DSM-III, deserves some skeptical evaluation despite its usefulness pragmatically'' (p. 834). Although categorization may be simpler to handle than a dimensional approach, a combination of the two suggested by Lorr (1986) and by McReynolds (1989) may be more meaningful in the long run and provide more information as well as potentially greater predictive power. Others also suggest that the use of both psychometric and fixed diagnostic criteria could lead to a better and more valid definition of schizophrenia (Moldin, Gottesman, & Erlenmeyer-Kimling, 1987). However, despite such criticisms, as well as others (Carson, 1996, 1997; Follette & Houts, 1996; Millon, 1991; Sarbin, 1997), the neo-Kraepelinian model has continued to be used.

