dra ft 1 Investigating subsumption in SNOMED CT: An exploration into large Description Logic-based biomedical terminologies Olivier Bodenreider a, Barry Smith b,c, Anand Kumar b, Anita Burgun d a U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA olivier@nlm.nih.gov b Institute for Formal Ontology and Medical Information Science, Saarland University, Germany c Department of Philosophy, University at Buffalo, New York, USA d EA 3888 Laboratoire d'Informatique Médicale, Université de Rennes I, France Abstract Formalisms based on one or other flavor of Description Logic (DL) are sometimes put forward as helping to ensure that terminologies and controlled vocabularies comply with sound ontological principles. The objective of this paper is to study the degree to which one DLbased biomedical terminology (SNOMED CT) does indeed comply with such principles. We defined seven ontological principles (for example: each class must have at least one parent, each class must differ from its parent) and examined the properties of SNOMED CT classes with respect to these principles. Our major results are: 31% of these classes have a single child; 27% have multiple parents; 51% do not exhibit any differentiae between the description of the parent and that of the child. The applications of this study to quality assurance for ontologies are discussed and suggestions are made for dealing with the phenomenon of multiple inheritance. The advantages and limitations of our approach are also discussed. Keywords: biomedical ontologies, SNOMED CT, description logics, ontological analysis. 1. Introduction Biomedical terminologies and ontologies are increasingly taking advantage of Description Logic (DL)-based formalisms in representing knowledge. GALEN1 and SNOMED Clinical Terms® (in what follows SNOMED CT)2 were both developed in a native DL formalism. Several other groups have 1 http://www.opengalen.org/ 2 http://www.snomed.org/snomedct_txt.html Submitted to Artificial Intelligence in Medicine (Special issue on Formal Biomedical Knowledge Representation) draft material -please do not cite Artificial Intelligence in Medicine, 2007, 39, 183-195. PMC2442845 dra ft 2 worked at converting existing terminologies into terminologies with a DL formalism, including the UMLS® Metathesaurus® [1-3] and Semantic Network [4], the Medical Subject Headings (MeSH) [5], the Gene OntologyTM [6] and the National Cancer Institute Thesaurus [7]. The Ontology Web Language (OWL) plug-in developed for the ontology editor Protégé now also allows developers of frame-based resources to export their ontologies into DL formalism. The validation of an ontology by a DL-based classifier serves to ensure compliance with certain rules of classification (e.g., absence of terminological cycles) and it brings also other benefits in terms of coherence checking and query optimization [8,9]. However, neither a DL formalism nor the use of a classifier can ensure compliance with all principles of a sound ontology [10]. The objective of this paper is to study the degree to which one DL-based biomedical terminology complies with a basic set of ontological principles. We selected SNOMED CT as target for this evaluation because it is the most comprehensive biomedical terminology recently developed in native DL formalism. Another reason for our choice is that SNOMED CT is now available as part of the UMLS3 at no charge for UMLS licensees in the U.S. It is therefore likely to become widely used in medical information systems. The paper is organized as follows. We first define a limited number of basic ontological principles with which biomedical ontologies are expected to be compliant. (These are in effect principles of good classification.) We then give a brief description of SNOMED CT, we present the methods used to test the compliance of SNOMED CT with these principles, and we summarize our results. Finally, we discuss the application of this method to quality assurance in ontologies and terminologies in general, laying special emphasis on the role of creating partitions in ontologies. The advantages and limitations of our approach are also discussed. 2. Background 2.1. Terms, classes, and instances We shall refer to the nodes in SNOMED CT not as concepts but rather on the one hand as terms (where we are interested in the hierarchy itself, as a syntactic structure), and on the other hand as classes (where we are interested in the biological entities to which these terms refer). It is classes, not concepts, which stand in IS A, PART OF and similar relations in biomedical ontologies. Classes have instances. In the biomedical domain, instances are generally rep- 3 http://umlsinfo.nlm.nih.gov/ dra ft 3 resented in health information systems (e.g., electronic patient records) or in reports of biomedical experiments (e.g., in the form of microarray data), while biomedical terminologies and ontologies are focused on what is general, on classes and their relations. 2.2. Relations among classes The possible relations of class A to class B which are relevant to our purposes here are defined in Table 1. A is the root of a given taxonomy if and only if every class in the taxonomy is a child of A; conversely, A is a leaf of a given taxonomy if and only if A has no children. 2.3. Principles of classification Scientific classification has evolved from Aristotle to Linnaeus to the large and varied classifications of modern times. Along the way, classification principles were elaborated. One such principle, resulting from the use of a unique fundamentum divisionis or single classificatory principle in differentiating the species of each successive genus, is that subclasses be mutually exclusive and jointly exhaustive [11]. Some other highly general organization and classification principles – which we believe rest on a wide consensus among those working on terminologies in biomedicine and elsewhere [12,13] – are: • Each hierarchy must have a single root • Each class (except for the root) must have at least one parent • Non-leaf classes must have at least two children. • Each class must differ from each other class in its definition. In particular: each child must differ from its parent and siblings must differ from one another. 2.4. Principles of subsumption Principles can also be derived from the study of the way subsumption is in fact treated in biomedical terminologies and ontologies. As noted by Bernauer [14], two major types of difference can be observed between a parent and its child: the introduction in the child of a new "criterion" (introduction of a role in DL parlance), and the refinement of an already existing criterion (corresponding to DL's refinement of a role value4). For example, the introduction of the role CAUSATIVE AGENT with value Infectious agent explains the sub- 4 Also called role filler in DL parlance. dra ft 4 sumption relation of Meningitis to Infective meningitis. Similarly, the subsumption relation of Infective meningitis to Viral meningitis is explained by the refinement of the role value for CAUSATIVE AGENT since Infectious agent subsumes Virus. Such refinement can be a matter of specialization as in the previous example, where the role value for the parent is more generic than that for the child. Less frequently, partitive refinement can occur. For example, Neuropathy subsumes Peripheral motor neuropathy because the value in the parent of the role FINDING SITE (Nerve structure) includes as part the corresponding value in the child (Peripheral motor neuron). The following inheritance principle is standardly taken for granted in work on ontologies and terminologies: • If A is a child of B then all properties of B are also properties of A. As a corollary, and assuming that A and B are distinct, we have the principle: • No cycles are allowed in an IS A hierarchy. Additionally, one inheritance principle based on Bernauer's approach to subsumption can be expressed as follows: • All roles of a parent class must either be inherited by each child or refined in the child. This principle can also be formulated from the perspective of the child as follows: • Differentia from child to parent should uniquely result in every case either from refinement of the value of a common role or introduction of a new role. 2.5. Single vs. multiple inheritance Some of the principles presented above enjoy a large degree of consensus (e.g., that each class must have at least one parent is needed if a terminology is to have a proper hierarchical structure). Others, however, still spur debate among terminology developers. This is the case in regard to the issue of single vs. multiple inheritance, i.e., of whether classes should be allowed to have more than one parent. As noted by Cimino [15]: "There seems to be almost universal agreement that controlled medical vocabularies should have hierarchical arrangements. [...] There is some disagreement, however, as to whether concepts should be classified according to a single taxonomy (strict hierarchy) or if multiple classifications (polyhierarchy) can be allowed." While it is bedra ft 5 yond the scope of this paper to argue for or against multiple inheritance, we will make some suggestions for dealing with this issue in the discussion. 3. Materials SNOMED CT was formed by the convergence of SNOMED RT and Clinical Terms Version 3 (formerly known as the Read Codes). The version used in this study (January 31, 2004) contains 269,864 classes5, named by 407,510 names6. The first level is subdivided into eighteen classes listed in Table 2 with their frequency distribution. Each SNOMED CT class has a description7 consisting of a variable number of elements. For example, the class Viral meningitis has a unique identifier (58170007), two parents (Infective meningitis and Viral infections of the central nervous system), several names (Viral meningitis, Abacterial meningitis, and Aseptic meningitis, viral). The roles present in the description of this class are listed in Table 3. In addition to a unique identifier, each class is assigned a unique, fully specified name consisting of a regular name suffixed (in parentheses) with a reference to what SNOMED CT calls the "primary hierarchy" of the class, the latter corresponding roughly to one of the top-level classes in the hierarchy. The list and frequency distribution of the primary hierarchies in SNOMED CT are presented in Table 4, along with their corresponding top-level classes. For example, the fully specified name for Viral meningitis is Viral meningitis (disorder)8. This assignment to a primary hierarchy is not explicitly recognized as a property of the class in the SNOMED CT representation. However, because the corresponding high-level category can be easily extracted from the fully specified name of the class, we found it useful it to use it for purposes of categorizing SNOMED CT classes. We use sans serif font to distinguish category names. Thus for example we use disorder as the category for Viral meningitis. Inheritance in SNOMED CT is indicated by the presence of IS A relationships among classes. For example, the class Fracture of calcaneus subsumes two classes (Closed fracture of calcaneus and Open fracture of calcaneus). The difference between the descriptions of the classes Fracture of calcaneus 5 SNOMED CT has a total of 357,135 classes of which 269,864 are "current" 6 Among the 957,349 names in SNOMED CT, 407,510 correspond to the 269,864 "current" classes, excluding fully specified names and keeping only names whose status is "current" 7 Throughout this paper, we use 'description' with the common meaning that is also standard in the DL-context, i.e., to refer to the list of properties of a given class (more precisely: of its instances), expressed by roles. In SNOMED CT parlance, however, a description corresponds to a name for a class. 8 The primary hierarchy for Viral meningitis is Clinical finding, while the category mentioned in parentheses in the fully specified name is disorder. dra ft 6 and Closed fracture of calcaneus lies in the presence of a specialized value for the role ASSOCIATED MORPHOLOGY in the child (Fracture, open9) compared to that of the parent (Fracture). Also of note, the class Fracture subsumes Fracture, open. The refinement of the value of the role ASSOCIATED MORPHOLOGY between the two classes constitutes the differentia, while the other roles are all inherited from the parent class. 4. Methods The methods presented below were developed for testing the compliance of SNOMED CT with the seven principles listed in Table 5. 4.1. Quantitative analysis: Number of children, parents and roots By simply counting the number of parents and children for each class, we verify the degree of compliance with P1, P2, and P3. Additionally, the existence of a path between each class and the eighteen top-level classes is tested by traversing the graph of all classes in SNOMED CT from each class upwards. We use this method for verifying P4. As illustrated in Figure 1, the toplevel class subsuming Viral meningitis is Clinical finding. 4.2. Qualitative analysis of differentiae In order to verify SNOMED CT's compliance with P5, we analyze the differentiae in pairs of parent-child classes by comparing the roles and role values for each class in the pair. First, we verify that at least one role or one role value is present in the description of the child but not in that of the parent. The second step consists in examining the roles shared by the two classes and those specific to each class. All roles of the parent are searched for in the description of the child in order to verify compliance with P6. The relationship between the values of a role shared by the parent and child classes is examined and, when the values differ, is expected to be either specialization (IS A) or partitive refinement (PART OF). The presence of roles specific to the child is also examined. The number of differentiae (i.e., the number of role values refined and of roles introduced in the child) is recorded. This step is used to verify P7. 9 Despite similarities in their names, Fracture, open (morphologic abnormality) and Open fracture (disorder) are distinct classes in SNOMED CT. dra ft 7 5. Results 5.1. Quantitative analysis: Number of children, parents and roots 5.1.1. Number of children The number of children per class ranges from 0 to 2532. The frequency distribution of the number of children is presented in Figure 2. 196,237 classes (73%) have no children. These classes are leaf nodes in the SNOMED CT hierarchy. Examples of such classes include the substance Tartrate dehydratase, the finding Anuria, the organism Trypanosoma evansi, and the body structure Upper left third premolar tooth. Out of 73,627 classes with children, 23,174 classes (31.5%) have a single child. As shown in Table 6, this proportion is relatively constant across SNOMED CT categories. Examples of classes with a single child include {Cervical secretion sample, child: Cervical mucus specimen} (specimen), {Deferoxamine, child: Deferoxamine mesylate} (substance), {Multiple polyps, child: Multiple adenomatous polyps} (morphologic abnormality), and {Referral to general medical service, child: General medical self-referral} (procedure). 8,034 classes (11%) have ten children or more and 150 have more than 99 children. The median number of children is 2. Example of classes with a large number of children include Infectious gastroenteritis (10 children), Operation on heart valve (25 children), Sodium compound (51 children), and Disorder of eye proper (100 children). Some classes have an unusually large number of children, including Veterinary proprietary drug AND/OR biological (2532 children), Biochemical test (996 children), the substance Oxidoreductase (580 children), the organism Bos taurus (551 children), and Congenital malformation (505 children). Although these classes often correspond to large collections of drugs, tests, or disorders, the large number of children in these classes may point to issues such as a lack of organization or incomplete descriptions. 5.1.2. Number of parents Except for the root, every class of SNOMED CT has at least one parent. The number of parents per class ranges from 1 to 13. (The three classes with 13 parents are Anoscopy with coagulation for control of hemorrhage of mucosal lesion, Mandibuloacral dysostosis, and Entire sternocleidomastoid muscle.) The frequency distribution of the number of parents is presented in Figure 3. 195,053 classes (72.3%) have a single parent, 53,517 classes (19.8%) have two parents, 13,969 classes (5.2%) have three, 4,692 classes (1.7%) have four, and 2,632 classes (1.0%) have five or more. dra ft 8 Overall, the proportion of classes having multiple parents, i.e., exhibiting multiple inheritance, is 27.7%. As shown in Table 6, this proportion tends to be higher in some categories (e.g., around 45% for body structure, disorder, and procedure) and lower in others (e.g., around 5-17% for cell, organism, and substance). 5.1.3. Number of roots Except for the root and for the eighteen top-level classes themselves, each class of SNOMED CT can be linked hierarchically to exactly one top-level class. This means that SNOMED CT consists of eighteen independent hierarchies. 5.2. Qualitative analysis of differentiae 5.2.1. Existence of a differentia between parent and child Out of the 377,681 parent-child relations examined, 193,957 (51%) do not exhibit any differentiae between the description of the parent and that of the child. However, as shown in Table 6, the presence or absence of differentiae in children varies considerably across categories. In most categories – including geographical location, organism, nd substance – no differentiae are ever mentioned. In the other categories, the proportion of children exhibiting differentiae in their description ranges from 29% (cell) to 86% (specimen). 5.2.2. Number and nature of differentiae When there does exist a differentia between a child and its parent, i.e., when their descriptions are not identical, the difference in the descriptions can affect one role or multiple roles, and one or more values within each role. Single differentia. Out of the 183,724 parent-child relations where there is at least one differentia between the child and its parent, 102,426 (56%) exhibit exactly one differentia. For example, the classes Fracture of calcaneus and Open fracture of calcaneus presented earlier differ only by the value of their common role ASSOCIATED MORPHOLOGY. In 60% of the cases, the differentia comes from the refinement of the value for a given role; in 40% of the cases, it comes from the introduction of a new role in the child. The example above (Fracture of calcaneus) illustrates the refinement (from Fracture to Fracture, open) of the role ASSOCIATED MORPHOLOGY. Conversely, the introduction of the role FINDING SITE (with value Ear structure) differentiates the class Otitis from its parent Inflammatory disorder. dra ft 9 Multiple differentiae. In case of multiple differentiae, the differentiae involved reflect the introduction of several roles (34%), the refinement of several values (20%), or the combination of introducing at least one role and refining at least one value (46%). For example, as illustrated in Figure 4, Endoscopy of jejunum differs from Procedure on jejunum by 1) the introduction of two roles (METHOD, with value Inspection – action, and ACCESS INSTRUMENT, with value Endoscope, device) and 2) the refinement of the role ACCESS (from Surgical access values to Endoscopic approach – access). Multiple differentiae are often associated with multiple inheritance. In the example above, the role METHOD is actually inherited from Gastrointestinal investigation, the second parent of Endoscopy of jejunum, and its value refined from Evaluation – action to Inspection – action. The role ACCESS INSTRUMENT, however, is truly specific to Endoscopy of jejunum (i.e., not present in any of its parents). Our analysis of differentiae reveals a number of other potentially problematic issues. In 7,226 cases, some role or value present in the parent is not inherited or refined in the child. For example, the role ONSET has two possible values in the class Subjective visual disturbance (Sudden onset and Gradual onset), of which Gradual onset is not inherited by its child class Sudden visual loss. The role ONSET – called a qualifier in SNOMED CT – is involved in roughly half of the cases where some role is specific to a parent class but eleven other roles are also involved in this phenomenon. In 21,799 cases, although the parent and child classes share a role, the values of this role are neither identical (inherited by the child from the parent) nor such as to stand in any taxonomic relation (with the specialized value in the child) or meronomic relation (with the part in the child). For example, as illustrated in Figure 5, the class Diabetic retinopathy and its child Diabetic retinal microaneurysm share the role FINDING SITE, but their values for this role (Retinal structure for the parent and Visual pathway structure and Structure of retinal artery for the child) do not stand in a hierarchical relation. Typically, this problem is associated with multiple inheritance. The role value which does not stand in hierarchical relation with corresponding role values in one parent most often does in one of its other parents. In the example above, Retinal structure and Structure of retinal artery are actually inherited from Retinal microaneurysm, the other parent of Diabetic retinal microaneurysm. 6. Discussion The work described in this paper is in the tradition of studies auditing large medical terminologies such as [16]. SNOMED CT itself has recently been investigated for inconsistencies and related types of errors [10,17]. However, we are interested here not in errors and inconsistencies in general but rather, more dra ft 10 positively, in the question of compliance of the terminological structure with general classification principles. We found SNOMED CT to be fully compliant with principles such as each class must have at least one parent and each hierarchy must have a single root. In contrast, we observed non-compliance with many other principles, and we will present the consequences of such noncompliance together with a discussion of the advantages and limitations of our approach. Finally, we will revisit the problem of single vs. multiple inheritance and outline a possible solution thereto. 6.1. Application to quality assurance for ontologies 6.1.1. Classes with a single child The recognition by biologists of the phylum Chordata rests on the distinction of several subphyla: Vertebrata (or Vertebrates), Cephalochordata, and Urochordata. Compared to Vertebrates, the latter two might be of lesser relevance to clinical medicine. However Vertebrates is de ined in opposition to the two other subphyla and all three should therefore be represented in a wellformed ontology of organisms. Moreover, in a world in which Vertebrates had only one child, the distinction between parent and child would not be made by biologists. Therefore, the presence of classes with just one child is reason to suspect the presence of error. The review of a limited number of such classes suggests the following possible issues. One is the incompleteness of the hierarchy (e.g., Subphylum Vertebrata is the only subphylum recorded in SNOMED CT for Phylum Chordata). Another issue is the presence of hybrid classes, resulting from the intersection of two parent classes and appearing as the single child of at least one of these (e.g., Closure of abdominothoracic fistula, hybrid child of Closure of fistula of thorax and Abdomen closure) and single child of Closure of fistula of thorax). Finally, the presence of redundant classes, where a parent and a child class bear no differences, can also be at the origin of the phenomenon of single child classes. This issue is discussed in detail in the next section. Among the 23,174 single child classes, 12,928 (56%) have a single parent and therefore do not correspond to hybrid classes. Examples of such classes can be found in virtually every category and include the procedure Arthroscopy of toe (single child of Arthroscopy of foot), the disorder Periappendicitis (single child of Atypical appendicitis), and the substance Urine (single child of Urinary tract fluid). Except when they are the product of hybrid classes, classes with a single child should be reviewed. For example, the classes Congenital absence of lobe of liver and its parent Congenital absence of liver do not look suspicious at first sight. However, knowing that Congenital absence of lobe of liver is the dra ft 11 single child of Congenital absence of liver raises the question of a possible confusion between a total absence of the liver and an absence of liver whose degree on the partial/total axis is not specified. If Congenital absence of liver is treated as a total absence of liver (hypothesis 1), it cannot subsume the absence of a lobe of liver (partial absence). Therefore the subsumption link is inaccurate. Conversely, if Congenital absence of liver is treated as unspecified absence of liver (hypothesis 2), the degree of the absence – total or partial – is expected to be reflected in its children, and having only one child makes the description incomplete. In this particular case, SNOMED CT lists Congenital absence of liver, total as a synonym for Congenital absence of liver (hypothesis 1). Therefore, Congenital absence of liver cannot subsume Congenital absence of lobe of liver. 6.1.2. Absence of difference in the description between children and parents Beyond hierarchy, one of the major reasons for interest in DL-based systems is that they promise to make detailed descriptions for each class available for use by formal reasoning tools, representing through roles the class's defining characteristics. However, DL systems can also accommodate classes with minimal descriptions (i.e., restricted to bare subsumption links). We reviewed a small number of classes (in the domain of disorders) for which no difference was provided between the parent and the child in terms of roles or role values. The major issue brought to light by this limited analysis is the incompleteness of many descriptions. For example, while no difference is provided between the descriptions of Bullous lichen planus and Lichen planus, such a difference is provided for Bullous dermatosis (ASSOCIATED MORPHOLOGY with value Blister) and Skin lesion. In other cases, the representation of some characteristics seems to have been purposely omitted (e.g., COURSE for acute and subacute variants of diseases, although there exists a class Courses whose children include Acute and Subacute). Generally, morphologic distinctions seem better represented than physiological ones. Also of note, some classes represent what are in fact mere collections (e.g., Extrapyramidal disease). These classes are defined in extension (i.e., via a list of their subclasses) rather than in intension (i.e., via a list of characteristics). Such extensional definitions are less desirable for a number of reasons, including: 1. they imply an unsatisfactory heterogeneity in the classification; 2. they imply missing information, which is not available, e.g., for automatic information extraction and which also implies obstacles to correct coding (why are these subclasses grouped together in these way); 3. they imply the need for revisions with each discovery of new types of cases. Finally, in some cases, there is actually no difference between the parent and the child class (e.g., Closed fracture of skull without intracranial injury dra ft 12 vs. Closed fracture of skull). The issue, in this case, is the presence of two terms naming two distinct classes in SNOMED CT for one and the same entity in reality. The distinction lies not on the side of the biomedical entities these terms represent (i.e., the skull is fractured, but not open), but rather merely in the associated knowledge on the part of the physician (that intracranial injuries might be associated with such fractures). In other words, this distinction is epistemological in nature and, arguably, should not be represented in an ontology [18]. It would be a valuable extension of the current DL in SNOMED CT if ways could be found to do justice to operators, such as 'with' and 'without,' which are characteristic of such epistemologically motivated admixtures and which play an important role in the organization of SNOMED CT's term hierarchy. As things stand, the information conveyed by such operators is not accessible in ways which would support reasoning with terminological knowledge in medicine. This means that in this respect, too, much of the information conveyed by the compositional structure of SNOMED CT's terms is at the moment not available for automatic retrieval. 6.1.3. Presence of roles specific to the parent class In most of the cases we examined, the presence in a parent's description of roles not inherited by its children has to do with the representation of specialization in DL-based structures. As noted earlier, Subjective visual disturbance is described as being such that it can have either a Sudden onset or a Gradual onset. However, the only valid onset for its child Sudden visual loss is Sudden onset. Therefore, Sudden visual loss can be seen as a specialization of Subjective visual disturbance. This could be represented in DL form by '∀(HASONSET Onsets)' for Subjective visual disturbance and '∃(HAS-ONSET Sudden onset)' for Sudden visual loss [19]. 6.2. Advantages and limitations The principles presented in this study are simple. Assessing the degree to which SNOMED CT complies with these principles can be easily implemented. Although a Description Logic (DL) was used in its development, SNOMED CT is not distributed through the UMLS in a way which would allow users to perform automatic classification by appealing to the DL structure. Instead, SNOMED CT classes appear as regular Metathesaurus concepts. Source transparency in the UMLS allows users to extract SNOMED CT information in the form of triples for relations, e.g., (Viral meningitis, IS A, Infective meningitis). Although we investigated a terminology developed in a DL environment, our method did not rely on any DL-specific feature. Therefore, it would be applicable not only to other DL-based terminologies, but also to dra ft 13 terminologies whose relations are represented as triples, provided that the description of the classes is sufficiently rich. Compliance with the seven principles investigated in this study is no guarantee of complete ontological soundness. Non-compliance with the principles should be interpreted rather as indicative of possible problems and so used to trigger the review of the classes and relations involved by the editors of the terminology in the way described in [20]. In some cases, there is an indication of an error that is as best tenuous, e.g., when a relation is in compliance with one principle, but violates another principle. In the example presented earlier in the discussion, except for the fact that Congenital absence of lobe of liver is the single child of Congenital absence of liver, our method provides no indication that the latter represents a total absence and can therefore not subsume the former, which represents a partial absence of the liver. The values for the roles ASSOCIATED MORPHOLOGY and FINDING SITE in Congenital absence of lobe of liver do refine that of the corresponding roles in Congenital absence of liver. The only indication of a possible problem is given by the fact that Congenital absence of lobe of liver is the single child of Congenital absence of liver. Similarly, the existence of multiple differentiae between Endoscopy of jejunum and Gastrointestinal investigation (Figure 4) – namely the refinement of both ACCESS and PROCEDURE SITE roles – should raise the possibility of a missing intermediary class or a missing subsbumption link. For example, although the duodenum and the jejunum are adjacent segments of the small intestine, Duodenoscopy is linked to Gastrointestinal investigation through three intermediary classes (Enteroscopy, Endoscopy of intestine, Gastrointestinal tract endoscopy), while the link is direct for Endoscopy of jejunum. A careful review of these classes and their relations is required to identify issues such as inaccurate subsumption links and missing intermediary classes. In the two examples above, the review could have been prompted by failure to comply with the principle that no class should have a single child or because of the presence of several differentiae between a parent and its child. Conversely, some of our principles may be too strict and may benefit from relaxation in some circumstances. More precisely, they may be refined in order to exploit implicit information. The principle of single differentia between a child and its parent, for example, rests on the assumption that roles are independent, which is not always the case. Although not explicitly related, the roles ACCESS (Endoscopic approach – access) and ACCESS INSTRUMENT (Endoscope, device) are indeed not independent. This explains in part why, as illustrated in Figure 4, there are several differentiae related to endoscope between Endoscopy of jejunum and Gastrointestinal investigation: the introduction of ACCESS INSTRUMENT with value Endoscope, device accompanies the refinement dra ft 14 of the value of ACCESS from Surgical access values to Endoscopic approach – access. 6.3. Characterizing inheritance The uncontrolled use of IS A to signify a variety of different sorts of relations (including PART OF, IS AN INSTANCE OF, and so on) results in what Guarino has called 'IS A overloading', which is often associated in turn with examples of incorrect subsumption [21]. Examples of this phenomenon in SNOMED CT include Both testes IS A Testis Structure, Deferoxamine mesylate IS A Deferoxamine, and Urine sediment IS A Urine. IS A overloading, which is often associated with multiple inheritance, may be alleviated by making explicit which sort of subsumption link is involved in each specific type of case – for example by replacing IS A as it occurs between Viral meningitis and Infective meningitis with IS AAGENT and as it occurs between Viral meningitis and Viral infection of the central nervous system with IS ASITE. The use of such explicit subsumption links also enables a large taxonomy such as SNOMED CT to be divided into partitions within and between which taxonomic reasoning can be more reliably performed [8]. Through a locative partition, for example, which we can think of as a window or view on reality with a specific type of focus, Viral meningitis would appear in its locative guise: as a Viral infection of the central nervous system, and inferences could be performed safely along the IS ASITE relationship within this partition. Analogously, in a causative partition, Viral meningitis would be linked to Infective meningitis and subsumption could be performed safely along the IS AAGENT relationship. The locative and causative partitions would then yield complementary views of different aspects of one and the same reality. This view is illustrated in Figure 6, and the underlying formal theory is presented in [22]. 7. Conclusions SNOMED CT is the most comprehensive biomedical terminology recently developed in native DL formalism and it is expected to play an important role in clinical information systems in the future. Unlike thesauri built for information retrieval purposes, SNOMED CT should enable reasoning about biomedical classes and relations of a sort which can support intelligent information retrieval of biomedical information. We have listed some principles, mostly related to classification, and tested the degree to which SNOMED CT complies therewith. While SNOMED CT appears to be more coherent than many other terminologies, we also found the description of many of its classes to be dra ft 15 minimal or incomplete, with possible detrimental consequences for inheritance. Description logics provide formalisms suitable for representing many features of a variety of different domains – including the biomedical domain – in ways that can support automatic reasoning and information retrieval. In and of themselves, however, DLs do not systematically ensure compliance with the principles of classification required if reasoning is to be performed accurately. More than the use of any formalism, we believe that compliance with sound ontological principles is what guarantees the accuracy of reasoning. Acknowledgements Smith and Kumar are supported by the Wolfgang Paul Program of the Alexander von Humboldt Foundation, by the EU FP6 Network of Excellence "Semantic Datamining" and by the Volkswagen Foundation project "Forms of Life". References [1] D.M. Pisanelli, A. Gangemi, G. Steve, An ontological analysis of the UMLS Methathesaurus, Proc AMIA Symp (1998) 810-814. [2] R. Cornet, A. Abu-Hanna, Usability of expressive description logics--a case study in UMLS, Proc AMIA Symp (2002) 180-184. [3] U. Hahn, S. Schulz, Towards a broad-coverage biomedical ontology based on description logics, Pac Symp Biocomput (2003) 577-588. [4] V. Kashyap, A. Borgida, Representing the UMLS Semantic Network using OWL: (Or "What's in a Semantic Web link?"), in: D. Fensel, K. Sycara, J. Mylopoulos (Eds.), The SemanticWeb ISWC 2003, Vol. 2870 (Springer-Verlag, Heidelberg, 2003) 1-16. [5] L. Soualmia, C. Golbreich, S. Darmoni, Representing the MeSH in OWL: Towards a semi-automatic migration, Proceedings of the KR 2004 Workshop on Formal Biomedical Knowledge Representation (2004) 81-87. http://sunsite.informatik.rwthaachen.de/Publications/CEUR-WS//Vol-102/soualmia.pdf [6] C.J. Wroe, R. Stevens, C.A. Goble, M. Ashburner, A methodology to migrate the gene ontology to a description logic environment using DAML+OIL, Pac Symp Biocomput (2003) 624-635. [7] J. Golbeck, G. Fragoso, F. Hartel, J. Hendler, J. Oberthaler, B. Parsia, The National Cancer Institute's Thesaurus and Ontology, Journal of Web Semantics 1 (2003). http://www.websemanticsjournal.org/volume1/issue1/Golbecketal2003 /index.html dra ft 16 [8] I. Horrocks, A. Rector, C. Goble, A Description Logic based schema for the classification of medical data, in: F. Baader, M. Buchheit, M.A. Jeusfeld, W. Nutt (Eds.), Proceedings of the 3rd Workshop KRDB'96 (1996) 24-28. [9] R. Stevens, P. Baker, S. Bechhofer, G. Ng, A. Jacoby, N.W. Paton, C.A. Goble, A. Brass, TAMBIS: transparent access to multiple bioinformatics information sources, Bioinformatics 16 (2000) 184-185. [10] W. Ceusters, B. Smith, J. Flanagan, Ontology and medical terminology: Why Description Logics are not enough, Proceedings of TEPR 2003 Towards an Electronic Patient Record. San Antonio, Texas, May 10-14, 2003 (2003) (CD-ROM publication). [11] A. Marradi, Classification, Typology, Taxonomy, Quality & Quantity 24 (1990) 129-157. [12] B. Smith, The Logic of Biological Classification and the Foundations of Biomedical Ontology, in: D. Westerståhl (Ed.), Invited Papers from the 10th International Conference in Logic Methodology and Philosophy of Science, Oviedo, Spain, 2003 (Elsevier-North-Holland, 2004) (to appear). [13] J. Michael, J.L. Mejino, Jr., C. Rosse, The role of definitions in biomedical concept representation, Proc AMIA Symp (2001) 463-467. [14] J. Bernauer, Subsumption principles underlying medical concept systems and their formal reconstruction, Proc Annu Symp Comput Appl Med Care (1994) 140-144. [15] J.J. Cimino, Desiderata for controlled medical vocabularies in the twenty-first century, Methods Inf Med 37 (1998) 394-403. [16] J.J. Cimino, Auditing the Unified Medical Language System with semantic methods, J Am Med Inform Assoc 5 (1998) 41-51. [17] W. Ceusters, B. Smith, A. Kumar, C. Dhaen, Ontology-Based Error Detection in SNOMED-CT, Proceedings of MEDINFO 2004 (2004) (to appear). [18] O. Bodenreider, B. Smith, A. Burgun, The ontology-epistemology divide: A case study in medical terminology, Proceedings of the Third International Conference on Formal Ontology in Information Systems (FOIS 2004) (2004) (in press). [19] A. Rector, Defaults, context, and knowledge: Alternatives for OWLindexed knowledge bases, Pac Symp Biocomput (2004) 226-237. [20] M.C. dos Santos, C. Dhaen, M. Fielding, W. Ceusters, Philosophical scrutiny for run-time support of application ontology development, Proceedings of the Third International Conference on Formal Ontology in Information Systems (FOIS 2004) (2004) (in press). [21] N. Guarino, Some ontological principles for designing upper level lexical resources, in: A. Rubio, N. Gallardo, R. Castro, A. Tejada dra ft 17 (Eds.), Proceedings of First International Conference on Language Resources and Evaluation. ELRA European Language Resources Association, Granada, Spain (1998) 527-534. [22] T. Bittner, B. Smith, A theory of Granular Partitions, in: M. Duckham, M.F. Goodchild, M.F. Worboys (Eds.), Foundations of Geographic Information Science (Taylor & Francis, London, 2003) 117-151. dra ft 18 Relation Definition A = B A and B are the same entity (i.e., they have the same definition, and thus also the same family of instances at any given time) A IS A B 1. A and B are classes and 2. all instances of A are instances of B A is a child of B 1. A IS A B, 2. A  B, and 3. if A IS A C and C IS A B then A = C or C = B A and B are siblings 1. there is some C of which A and B are both children and 2. A  B A is a parent of B B is a child of A C is a differentia of A with respect to B 1. A IS A B, 2. A  B, and 3. instances of A are marked out within the wider class B by the fact that they exemplify C Table 1 – Definition of the relations between classes A and B dra ft 19 Top-level class Frequency Attribute 991 Body structure 30,652 Clinical finding 95,605 Context-dependent categories 3,649 Environments and geographical locations 1,620 Events 87 Observable entity 7,274 Organism 25,026 Pharmaceutical / biologic product 16,867 Physical force 199 Physical object 4,201 Procedure 46,066 Qualifier value 8,134 Social context 4,896 Special concept 178 Specimen 1,053 Staging and scales 1,098 Substance 22,267 Table 2 – The eighteen top-level classes in SNOMED CT and their frequency distribution dra ft 20 Role Value CAUSATIVE AGENT Virus ASSOCIATED MORPHOLOGY Inflammation FINDING SITE Meninges structure ONSET Sudden onset; Gradual onset SEVERITY Severities EPISODICITY Episodicities COURSE Courses Table 3 – Roles present in the description of Viral meningitis dra ft 21 Category Freq. Corresponding top-level class administrative concept 54 Qualifier value assessment scale 870 Staging and scales attribute 991 Attribute body structure 25,395 Body structure cell 603 Body structure cell structure 501 Body structure context-dependent category 3,649 Context-dependent categories disorder 62,301 Clinical finding environment 1,007 Environments and geographical locations environment / location 1 Environments and geographical locations ethnic group 254 Social context event 87 Events finding 33,304 Clinical finding geographic location 612 Environments and geographical locations inactive concept 7 Special concept life style 21 Social context morphologic abnormality 4,153 Body structure namespace concept 5 Special concept navigational concept 165 Special concept observable entity 7,274 Observable entity occupation 4,153 Social context organism 25,026 Organism person 302 Social context physical force 199 Physical force physical object 4,201 Physical object procedure 42,782 Procedure product 16,867 Pharmaceutical / biologic product qualifier value 8,080 Qualifier value regime/therapy 3,284 Procedure religion/philosophy 145 Social context social concept 21 Social context special concept 1 Special concept specimen 1,053 Specimen staging scale 15 Staging and scales substance 22,267 Substance tumor staging 213 Staging and scales Table 4 – The list of high-level categories ("primary hierarchies") in SNOMED CT with their frequency distribution and corresponding top-level class dra ft 22 P1 Each class must have at least one parent P2 Non-leaf classes must have at least two children P3 Children should have exactly one parent P4 Each hierarchy must have a single root P5 Each child's description must differ from its parent's description P6 All roles of a parent class must either be inherited by each child or refined in the child P7 Differentia from child to parent should uniquely result in every case either from refinement of the value of a common role or introduction of a new role Table 5 – Ontological principles studied in SNOMED CT dra ft 23 Children Parents Differentiae Category Med Max % Mul Med Max % Mul None Single Mult. administrative concept 2 13 57.1% 1 1 0.0% 100.0% 0.0% 0.0% assessment scale 2 724 55.0% 1 1 0.0% 100.0% 0.0% 0.0% attribute 3 142 69.7% 1 2 1.2% 100.0% 0.0% 0.0% body structure 2 295 53.9% 1 13 45.5% 46.3% 29.8% 23.9% cell 3 206 75.0% 1 3 16.7% 71.4% 21.8% 6.8% cell structure 2 98 76.1% 1 4 27.5% 52.8% 40.8% 6.4% context-dependent category 3 150 78.7% 1 2 0.1% 60.9% 38.6% 0.5% disorder 3 505 72.9% 1 13 45.9% 24.3% 43.3% 32.4% environment 3 39 79.1% 1 2 0.6% 100.0% 0.0% 0.0% environment / location 2 2 100.0% 1 1 0.0% 100.0% 0.0% 0.0% ethnic group 3 54 84.6% 1 2 1.6% 100.0% 0.0% 0.0% event 3 17 81.0% 1 2 1.1% 100.0% 0.0% 0.0% finding 3 251 78.1% 1 5 15.2% 67.9% 23.1% 9.0% geographic location 5 46 94.6% 1 3 2.3% 100.0% 0.0% 0.0% inactive concept 6 6 100.0% 1 1 0.0% 100.0% 0.0% 0.0% life style 3.5 6 83.3% 1 1 0.0% 100.0% 0.0% 0.0% morphologic abnormality 3 410 70.4% 1 4 30.2% 99.3% 0.5% 0.2% namespace concept 4 4 100.0% 1 1 0.0% 100.0% 0.0% 0.0% navigational concept 164 164 100.0% 1 1 0.0% 100.0% 0.0% 0.0% observable entity 2 77 73.8% 1 3 4.9% 99.8% 0.2% 0.0% occupation 3 34 81.1% 1 3 15.7% 100.0% 0.0% 0.0% organism 2 551 64.5% 1 4 4.9% 100.0% 0.0% 0.0% person 2 23 83.8% 1 2 23.2% 100.0% 0.0% 0.0% physical force 2 21 66.7% 1 2 6.5% 100.0% 0.0% 0.0% physical object 2 118 74.3% 1 4 7.0% 100.0% 0.0% 0.0% procedure 2 996 67.7% 1 13 45.6% 22.6% 34.9% 42.5% product 2 2532 69.2% 1 4 7.6% 65.4% 30.8% 3.8% qualifier value 3 359 79.6% 1 3 6.9% 100.0% 0.0% 0.0% regime/therapy 2 51 69.1% 1 7 26.0% 60.9% 23.6% 15.6% religion/philosophy 2 29 74.1% 1 2 1.4% 100.0% 0.0% 0.0% social concept 2 10 71.4% 1 1 0.0% 100.0% 0.0% 0.0% special concept 3 3 100.0% 1 1 0.0% 100.0% 0.0% 0.0% specimen 2 82 70.3% 1 4 17.2% 13.8% 68.0% 18.1% staging scale 6 6 100.0% 1 1 0.0% 100.0% 0.0% 0.0% substance 2 763 64.8% 1 6 13.8% 100.0% 0.0% 0.0% tumor staging 3 23 91.7% 1 2 0.5% 100.0% 0.0% 0.0% total 2 2532 68.5% 1 13 27.7% 51.4% 27.1% 21.5% Table 6 -Distribution of the number of children and parents per class (Med: median, Max: maximum, % Mul: proportion of classes with multiple children/parents) and of the presence of differentiae between parents and children (proportion of parent-child pairs with no differentia [None], a single differentia [Single] and multiple differentiae [Mult.]) dra ft 24 viral infections of the CNS infectious disease of CNS infectious disease of NS disorder of body system finding by site clinical finding infection by site disorder by body site disorder of the CNS disorder of NS viral infection by site viral disease infectious disease infective meningitis meningitis disorder of meninges inflammation of specific body organs inflammation of specific body structures or tissue inflammatory disorder inflammatory disease of the CNS inflammation of specific body systems viral meningitis disease Figure 1 – Ancestors of Viral meningitis in SNOMED CT dra ft 25 0 5,000 10,000 15,000 20,000 25,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  20 Number of children N u m b er o f cl as se s Figure 2 – Distribution of the number of children dra ft 26 0 25,000 50,000 75,000 100,000 125,000 150,000 175,000 200,000 1 2 3 4 5 Number of parents N u m b er o f cl as se s Figure 3 – Distribution of the number of parents dra ft 27 Procedure on jejunumProcedure on jejunu ACCESS Surgical access values PRIORITY Priorities PROCEDURE SITE Jejunal structure Gastrointestinal investigationastrointestinal investigation ACCESS Surgical access values PRIORITY Priorities PROCEDURE SITE GI tract structure METHOD Evaluation – action Endoscopy of jejunumEndoscopy of jejunu ACCESS Endoscopic approach – access PRIORITY Priorities PROCEDURE SITE Jejunal structure METHOD Inspection – action ACCESS INSTRUMENT Endoscope, device Figure 4 – Inheritance of role values for Endoscopy of jejunum. dra ft 28 Diabetic retinopathyiabetic retinopathy FINDING SITE Retinal structure ASSOCIATED Diabetes mellitus ETIOLOGIC FINDING Retinal microaneurysmetinal icroaneurys FINDING SITE Visual pathway structure FINDING SITE Structure of retinal artery Diabetic retinal microaneurysmiabetic retinal icroaneurys FINDING SITE Visual pathway structure FINDING SITE Structure of retinal artery ASSOCIATED Diabetes mellitus ETIOLOGIC FINDING Figure 5 – Inheritance of role values for Diabetic retinal microaneurysm (partial representation). dra ft 29 Meninges structure Viral meningitis Virus Central nervous system structure Infective agentViral infection of CNS Infective meningitis Loc ative wind ow (Other window) Disease ontology Anatomy ontology Organism ontology Causativewindow IS A IS A IS ASITE IS AAGENT Figure 6 – Two views (locative and causative) on Viral meningitis.