DRAFT Jan. 26, 2005

 

Guidelines for Semantic Interoperability

 

Introduction

It has been said that any two objects, slammed together with sufficient force, can be made to fit. So too with information systems. If librarians and programmers push hard enough, multiple controlled vocabularies and knowledge organization systems can be forced to interoperate. If too much pressure is applied, though, the damaged parts will be ‘non-semantic’ and of little value to users. So, given that converging information systems—with their idiosyncratic histories and social functions—are likely to produce overlaps, seams, and gaps in the composite whole, what techniques are currently being employed by developers to minimize damage and create true semantic interoperability (SI)?

The ALA/ALCTS/SAC Subcommittee on Semantic Interoperability is attempting, as part of its charge, to devise ways to evaluate a wide array of SI projects. Based on opinions expressed during the recent ALA conference in Boston, it seems whatever measuring device we decide upon should include items that can be tested empirically. The current consensus seems to favor putting together some sort of checklist, which could then be test-deployed against 6 of the more interesting projects currently under subcommittee review, after which the checklist itself could be evaluated for how well it seems to capture the most important SI details, and how well it can characterize each project according to objective criteria as well as on its own terms.

Before going any further, it is worth recalling the definition we are currently using for semantic interoperability: 

 

“The ability of two or more systems or components to exchange or harmonize cognate subject vocabularies and/or knowledge organization schemes to be used for the purpose of effective and efficient resource discovery without significant loss of lexical or connotative meaning and without special effort by the user”

 

Looking to the Literature for Ideas on How to Proceed

 

Whatever rating system we decide upon should be informed by the current research literature. Some researchers have been making close examinations of individual projects, while others focus mainly on theoretical issues. Recent noteworthy articles of both types in the library and information science domain include Chan & Zeng (2002), Tennis (2004), and Zeng & Chan (2004); while those in the computer science and database design domain include Dhamankar, et al. (2004) Park & Ram (2004), and Parsons & Wand (1997).

The work of Chan and Zeng is particularly useful for breaking down the many variables that make up semantic interoperability. One major variable involves the selection of data types, systems, or standards, to be made interoperable. There are projects, for example, that harmonize different controlled vocabularies in the same language, e.g., Northwestern University’s concordance of LCSH and MeSH (Olson, 2001), the Wilson Megathesaurus (Kuhr, 2001), and CARMEN’s integration of multiple German thesauri (2004); ones that aggregate subject vocabularies from among different languages and classification systems, e.g., the Unified Medical Language System (UMLS) (National Library of Medicine, 2005), the High Level Thesaurus (HILT, 2005; Nicholson, Wake, & Currier, 2001), and the DARPA Unfamiliar Metadata Project (Buckland et al., 1999); ones that map a controlled vocabulary to a universal classification system such as OCLC’s correlation of LCSH with DDC (Vizine-Goetz, et al., 2004), and the mapping of UDC to General Finnish Subject Headings (Himanka & Kautto, 1992); and ones that harmonize heterogeneous classification schemes such as the American Mathematical Society’s mapping of Mathematics Subject Classification to Schedule 510 of the DDC (Iyer & Giguere, 1995).

Other SI variables are more methodological in nature. Following (again) the work of Chan and Zeng (2002), these may be sorted into six categories: (1) “Derivation/Modeling,” where a relatively simple vocabulary is derived from a more complicated pre-existing source (the way FAST is extracted from LCSH, for example); (2) “Translation/ Adaptation” (e.g., the Bibliothèque Nationale’s Rameau system, generated through translation and adaptation of LCSH and CSH); (3) “Satellite and Leaf Node Linking,” where specialized thesauri (such as LIV, TGIM, GLIN) are treated as satellites of a larger entity (LCSH) or conceptualized as leaves (specialized thesauri) attached to a tree structure (the larger thesaurus or vocabulary list); (4) “Direct mapping,” where equivalence between differently-sourced terms and classification numbers are established, usually requiring intensive intellectual effort; (5) linking through a “temporary union list”; and (6) linking through a “thesaurus server protocol,” as with the Alexandria Digital Library project.

Other variables discussed in the literature include: How are SI links stored and managed? Do they rely on authority records, concordance tables, a central switching language, semantic networks, lexical databases, semantic layers (Tennis, 2004), or some other structure?  How are data and metadata in general stored? This is to say, are they being gathered into a union catalog (e.g., American Memory Project, NSDL), or living in a distributed system. How is data structured? For example, do they rely on XML, MARC, Dublin Core, or some other framework standard?

Yet another set of variables involves difference in degree granularity, and logical structure. In the chapter “Compatibility and Convertibility” (pp. 179-216) of his Vocabulary Control for Information Retrieval, W.F. Lancaster points out several difficulties with which anyone attempting semantic interoperability (or “vocabulary reconciliation”, as he puts it) must contend: How to reconcile vocabularies which have different degrees of specificity, different degrees of precoördination, overlap in subject matter, and different arrangements of hierarchy (Lancaster, 1986) (p. 211ff.). Vizine-Goetz, et al. (2004) paraphrases Lancaster’s observations, and add to them more recently discussed problems of: common versus scientific names  (Doerr, 2001; T. Olson & Strawn, 1997) and “differences in meaning resulting from different classifications of terms” (Doerr, 2001; Whitehead, 1990). In an automated environment there is also the problem of different methods and standards for encoding and preserving metadata.

In the following provisional checklist, I try to capture concerns and observations from the research literature discussed above, and incorporate some of the draft criteria proposed by Shelby Harken (2005) and those of Joseph Tennis (derived from Elaine Svenonius, 2000).

 

Provisional Checklist

 

1. Types of data being integrated

 

(a) different controlled vocabularies in same language?

(b) different controlled vocabularies in different languages?

(c) different classification schemas (e.g., DDC, UDC, LCC)

(d) controlled vocabularies combined with classification schemas

(e) different metadata framework schemas (e.g., XML, MARC, Dublin Core);

(f) different communication protocols

(g) other …

2. Autonomy and Integrity of Constituent Parts

(a) Is standardization, reconciliation, or conversion of semantic data reversible? Can precoördinated strings, once filtered or deconstructed for semantic matching, later be put back together again?

(b) Is full complement of metadata and indigenous subject hierarchies preserved? If so, how?

(c) Does project rely on principle of least common denominator? If so, many data sets may be able to coexist in database, but given resulting stripped-down or ‘dumbed-down’ resource descriptions, may no longer serve the interests of readers. (cf. recently cited problems with Dublin Core (Tennant, 2004))

(d) How are data stored: gathered into a union catalog (e.g., American Memory Project, NSDL), vs. distributed database?

(e) How are metadata (including SI links) stored?  (e.g., via authority records, concordance tables, a central switching language, semantic networks, lexical databases, semantic layers, etc.)

3. Reconciliation of heterogeneous vocabularies

(a) How are correlations established when a single term in one source has no equivalent term in the other, but instead, say, three quasi-subordinate terms?

(b) Certain vocabularies are highly structured and hierarchical, while others contain terms lacking any structure at all aside from serial numbers or other unique identifiers. How are these differences reconciled?

(c) How are conflicts resolved when an established heading in one vocabulary matches a cross reference in other vocabularies? (E.g., Tumors is an established LCSH heading, but in MeSH it is a cross reference to Neoplasms; and vice versa.
(d) If multiple vocabularies are used in a single bibliographic record, and the headings from such vocabularies are identical (after normalization), how are duplicate retrievals handled? (This situation may not occur in all SI projects, but it will occur in a few.)

4. Effective and Efficient Resource Discovery (Precision and Recall), Satisfying User Needs

(a) Does project provide high or satisfactory levels of precision and recall?

(b) to what extent does project rely on precoördination? If mostly post-coordinate, then:

i)        by what means is recall maximized?

ii)      by what means is precision maximized?

(d) Does project provide faceted approach (facilitating polysemy) while retaining option for browsable hierarchy (facilitating navigation)?

(e) Are the following objectives and functions supported in the S.I. environment? (Tennis; Svenonius, 2000)?

i)        Locate entities in the system via surrogates (find)

ii)      Identify a surrogate that matches an entity (collocate)

iii)     Select an entity appropriate to a user’s need via surrogates (choice facilitation)

iv)    Obtain access to the entity via the system and its surrogates (acquisition)

v)     Navigate the system and its surrogates (navigation)

(f) Has developer released beta version for general testing?

(g) Have user satisfaction surveys been conducted?

5. Ease of Use (this is actually part of our definition, i.e., SI should function “without special effort by the user,” (where “users” include information creators and managers, and end-users)).

(a) Intuitive interface for data entry, searching, browsing, etc.?

(b) Automate validation, mapping, metadata extraction, etc., as much as possible?

(c) Availability of documentation?

6. Long-term viability

(a) Master plan for life-cycle management and data migration?

(b) Reliance on open-source international standards versus proprietary standards?

(c) Viable business model? Based on research grant with likely expiration?

 

 

Conclusions

 

The need for improved semantic interoperability between and among vocabularies and knowledge organization schemas is undeniable and growing in importance. In order to understand the emerging options in SI, knowledge workers—be they practitioners or scholars—need to experiment with a wide variety of projects and stay current with the literature. However, trying to understand, never mind evaluate, the large number of projects currently under development, can be a daunting and even disorienting experience.  A glance at the subcommittee’s growing list of projects (http://www.und.nodak.edu/dept/library/Departments/abc/SACSEM-ResearchProjects.htm) gives in indication of the challenge before us. My hope is, therefore, that this checklist might at least simplify the process, and help us frame the key issues a manner which can facilitate comprehension, comparison and perhaps even evaluation. In its current form the checklist is of admittedly limited value, but perhaps with further refinement it can become a truly useful tool for assessing the state of the art in semantic interoperability.

References

CARMEN. WP12: Cross concordances of classifications and thesauri. , 2004 from http://www.bibliothek.uniregensburg.de/projects/carmen12/index.html.en.

Buckland, M., Chen, A., Chen, H., Kim, Y., Lam, B., & Larson, R. et al. (1999). Mapping entry vocabulary to unfamiliar metadata vocabularies. D-Lib Magazine, 5(1)Retrieved January 18, 2005,

Chan, L. M., & Zeng, M. L. (2002). Ensuring interoperability among subject vocabularies and knowledge organization schemes: A methodological analysis. IFLA Journal, 28(5/6), 323-327.

Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on management of data, Paris, France, 383-394.

Doerr, M. (2001). Semantic problems of thesaurus mapping. Journal of Digital Information, 1(8)Retrieved January 26, 2005,

Harken, S. (2005). SAC subcommittee on semantic interoperability: Draft criteria., 2005 from http://www.und.nodak.edu/dept/library/Departments/abc/SACSEM-Criteria.htm

HILT. (2005). High-level thesaurus project proposal  Retrieved January 7, 2005 from http://hilt.cdlr.strath.ac.uk/AboutHILT/proposal.html

Himanka, J., & Kautto, V. (1992). Translation of the finnish abridged edition of UDC into general finnish subject headings. International Classification, 19(3), 131-4+.

Iyer, H., & Giguere, M. D. (1995). Towards designing an expert system to map mathematics classificatory structures. Knowledge Organization, 22(3-4), 141-147.

Kuhr, P. S. (2001). Putting the world back together: Mapping multiple vocabularies into a single thesaurus., Subject retrieval in a networked environment: Papers presented at an IFLA satellite meeting sponsored by the IFLA section on classification and indexing & IFLA section on information technology, Dublin, Ohio,

Lancaster, F. W. (1986). Vocabulary control for information retrieval. 2nd ed.

National Library of Medicine. (2005). Fact sheet: UMLS metathesaurus . Retrieved January 7, 2005 from http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html

Nicholson, D., Wake, S., & Currier, S. (2001). High-level thesaurus project: Investigating the problem of subject cross searching and browsing between communities. In C. C. Chen (Ed.), Global digital library development in the new millennium: Fertile ground for distributed cross-disciplinary collaboration. Beijing: Tsinghua University Press.

Olson, T. (2001). Integrating LCSH and MeSH in information systems. Subject retrieval in a networked environment: Papers presented at an IFLA satellite meeting sponsored by the IFLA section on classification and indexing & IFLA section on information technology, Dublin, Ohio,

Olson, T., & Strawn, G. L. (1997). Mapping the LCSH and MeSH systems. Information Technology and Libraries, 16, 5-19.

Park, J., & Ram, S. (2004). Information systems interoperability: What lies beneath? ACM Trans.Inf.Syst., 22(4), 595-632.

Parsons, J., & Wand, Y. (1997). Choosing classes in conceptual modeling. Communications of the ACM, 40, 63-69.

Svenonius, E. (2000). The intellectual foundation of information organization. Cambridge, Mass.: MIT Press.

Tennant, R. (2004). Metadata's bitter harvest. Library Journal (1976), 129(12), 32.

Tennis, J. (2004). Layers of meaning: Disentangling subject access interoperability. Advances in Classification Research, 12

Vizine-Goetz, D., Hickey, C., Houghton, A. H., & Thompson, R. (2004). Vocabulary mapping for terminology services. Journal of Digital Information, 4(4)

Whitehead, C. (1990). Mapping LCSH into thesauri: The AAT model , edited by T. petersen and P. molholt (boston: G.H. hall). In T. Peterson, & P. Moholt (Eds.), Beyond the book: Extending MARC for subject access (pp. 81). Boston: G.H. Hall.

Zeng, M. L., & Chan, L. M. (2004). Trends and issues in establishing interoperability among knowledge organization systems. Journal of the American Society for Information Science and Technology, 55(5), 377-395.