Introduction

 

To improve subject retrieval across different domains using a variety of subject vocabularies, the Subject Analysis Committee's Subcommittee on Semantic Interoperability has prepared a number of documents that together are intended to provide guidance to creators of semantically interoperable vocabularies and systems. Most online library systems worldwide utilize some type of controlled vocabulary.  From a librarian's point of view keyword searching on the Internet has its limitations. Yet, online catalogs exist in the Internet environment along with other remotely accessed databases. The Subcommittee recognizes some of the same problems identified by the Bates report, in particular the myriad thesauri presented to users during an information seeking experience. Although some Internet search engines function fairly well, the committee felt it needed to limit its focus to environments using some type of structured subject-based metadata or embedded metatags, rather than random or weighted keywords.

 

In the ALCTS report, "Subject data in the metadata record" (1) functional requirements for subject access to Internet resources include: a) to assist searchers in identifying the most efficient paths for resource discovery and retrieval; b) help users focus their searches; c) enable optimal recall; d) enable optimal precision; e) assist searchers in developing alternative search strategies; f) provide all of the above in the most efficient, effect and economical manner.

 

In a networked environment, interoperability among disparate systems is necessary to allow users to search among resources from multiple sources generated and organized according to different standards and approaches. Lois Chan in her paper for the Bicentennial Conference on Bibliographic Control for the New Millenium 2000 summarizes the interoperability requirements as follows: a) interoperability among different systems, metadata standards, and languages; b) flexibility and adaptability to different information communities, not only different types of libraries, but also other communities such as museums, archives, corporate information system, etc; c) extensibility and scalability to accommodate the need for different degrees of depth and different subject domains; d) simplicity in application, i.e. easy to use and to comprehend; e) versability, i.e. the ability to perform different functions; and f) amenability to computer application. 

 

Doerr (2001) notes that terminological resources are increasingly important for information retrieval in the networked environment for retrieving documents by querying databases and metadata employing controlled vocabularies. There is a growing interest in developing automated intermediaries to negotiate the differences between controlled vocabulary schemes so that a user can use a familiar set of terms to search collections using other vocabulary schemes. (3)

 

Hunter (2001) (4) points out that networked knowledge organization systems typically contain objects of mixed media types which are described using a multitude of divers metadata schemas. Hence machine understanding of metadata descriptions which conform to schemas from different domains is a fundamental requirement for access. Yet, problems arise from the differences in terminological semantics and hierarchical relationships within various subject schemes.

 

Bella Hass-Weinberg (5) at Thesaurus Design for Semantic Information Management suggested that "semantic information management” really just means vocabulary control; that ontology usually just means classification scheme, but sometimes gets used as a synonym for thesaurus, and that taxonomy is just a synonym for classification. Subject headings lists, such as LCSH are essential tools for managing information in a print environment, while true thesauri are often more useful in the online environment (where they can be viewed hierarchically or combined in Boolean searches) Thesauri often run into the problem of needing to distinguish homographs. The problem in the selection of thesaurus terms is largely one of determining a set of appropriate lexemes, that is, the smallest units of lexicon that can be understood on their own terms. Synonymy is a common problem, though easily managed, e.g. Cancer, see Neoplasm. Other problems: having to choose between singular and plural, parts of speech, etc.

 

A subject portal connects users to a site focusing on a particular subject, with access to high-quality information resources, allowing aggregated cross-searching, streamlined account management, user profiling, or additional services (6) However, the user has to know to go to the portal. There is a growing number of portals.

 

Renardus is an example of a subject gateway/portal project to provide users with integrated access by searching or browsing, through a single interface, to partners' quality-controlled subject gateways. Further goals are to develop and define organizational models, business models, technical solutions and metadata standards (Renardus Application Profile, Renardus Namespaces, Renardus Collection Level Description). The following elements can be used to define a quality-controlled subject gateway: a) selection and collection development, b) collection management, c) creation, d) resource description and metadata, e) subject access, f) search and browse access, g) standards, h) value-adding features. Each participating partner is responsible for mapping its metadata format to the common Renardus metadata format, derived from Dublin Core. A generic normalization toolkit with Z39.50 configuration files and a conversion script were provided. Each participant set up a Renardus server with their content normalized to the Renardus datamodel. A set of screens were built for the user interface: a) homepage, b) advanced search screen, c) index scan window, d) advanced search page after index scan, e) browse by subject screen, f) (preliminary) result screen, g) sorted result screen, h) participating gateways screen, and I) help (index) screen. In order to accomplish subject browsing, the various systems, are mapped to a common classification system. The Renardus service provides access to resources from all kinds of subjects, published world-wide and in many languages and it is intended to be offered to an international multi-disciplinary community of users. Dewey Decimal Classification was chosen because of: online availability and tools, global usage, suitability of the classification system and its functionality, frequency and character of the updates, Research and methodological development efforts. (7)

 

About the same time the SAC Subcommittee on Semantic Interoperability was formed, NISO decided Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Thesauri needed changing to meet the needs of the changing information environment. Their rationale included, "Developers of Internet and Intranet-accessible Web pages, databases, and information systems need better metadata to support non-expert information searches, and metadata developers are recognizing the value of incorporating high-quality, interoperable controlled vocabularies and taxonomies into their schemes."(8)

 

The goal of proposing the accompanying objectives and guidelines should enable developers to create an environment / system / method by which even multiple portals could be accessed via subject metadata using software that is neutral and available ubiquitously or directly to the user, that could be copied by libraries for use in their own environment.

 

Footnotes

 

1. American Library Association (1999). "Subject data in the metadata record", Division of Association for Libraries and Technical Services, Cataloging and Classification Section, Subcommittee on Metadata and Subject Analysis. http://www.ala.org/ala/alctscontent/catalogingsection/catcommittees/subjectanalysis/metadataandsubje/subjectdata.htm

 

2. Chan, Lois (2000) "Exploiting LCSH, LCC, and DDC to retrieve networked resources issues and challenges", Library of Congress. http://lcweb.loc.gov/catdir bibcontrol/chan_paper.html

 

3. Doerr, M. (2001) "Semantic problems of thesauri mapping," Journal of Digital Information, vol. 1, no. 8 http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Doerr/

 

4. Hunter, Jane  (2001) "MetaNet: a metadata term thesaurus to enable interoperability between metadata domains", Journal of Digital Information, vol. 1, no. 8 http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Hunter/

 

5. Lovins, Daniel. "Summaries and reflections of Thesaurus Design for Semantic Information Management". a day-long seminar led by Prof. Bella Hass-Weingberg in New York, April 16, 2002. [email May 6, 2002]

 

6. Resource Discovery Network (2002). "Renardus"; "Subject Portals Development Project". http://rdn.ac.uk/projects/#Euro

 

7. Neuroth, Heike and Koch, Traugott (2001) Cross-browsing and cross-searching in a distributed network of subject gateways: architecture, data model, and classification, at European Library Automation Group’s 25th library systems seminar, Integrating Heterogeneous Resources, Prague 6-8, June 2001 (http://www.kbr.be/elag/) http://www.stk.cz/elag2001/Papers/HeikeNeuroth/HeikeNeuroth.htm Accessed Aug. 8, 2002

 

8. NISO. Developing the Next Generation of Standards for Controlled Vocabularies and Thesauri. http://www.niso.org/committees/MT-info.html Accessed Feb. 15, 2005

 

 

Subject Semantic Interoperability

 

Subcommittee charge:

Specific tasks include, but are not necessarily limited to:

a) An inventory of known semantic interoperability projects, with descriptions;

b) An evaluation of selected projects, in terms of those projects' stated objectives;

c) An investigation of the various concepts involved in the harmonization of indexing languages, such as switching languages, concordance tables, front-end thesauri, meta-thesauri, and mapping.

 

Issues to examine with regard to the above tasks may include:

d) Conditions which optimize the effectiveness of harmonization, both among indexing languages of the same type, and among languages of different types;

e) Simplification of existing indexing languages, in the context of interoperability;

f) Approaches to integration and harmonization of subject vocabularies and knowledge organization schemes used in various metadata standards for the purposes of effective and efficient resource discovery.

 

Definition of Subject Semantic Interoperability: The ability of two or more systems or components to exchange or harmonize cognate subject vocabularies and/or knowledge organization schemes to be used for the purposes of effective and efficient resource discovery without significant loss of lexical or connotative meaning and without special effort by the user.

 

Goals

 

Work began with the assumptions we would try to provide recommendations to:

a) Serve as guidelines in structuring a system that supports semantic interoperability among vocabularies by employing one or several of the methods listed below:

1. harmonization of indexing languages (assumes simplification)

                        2. switching languages  

                        3. concordance tables

                        4. front-end thesauri or front-end "cluster"

                        5. metathesaurus

                        6. semantic networks

                        7. multilingual thesauri

                        6. mapping (methodologies)/types

a) among multiple vocabularies in different languages and classification systems

                                    b) between a controlled vocabulary and a universal classification system    

                                    c) between classification systems

                                    d) to a new system/metathesaurus

                                    e) to another thesaurus or classification not used by the participants

f) among controlled vocabularies in the same language: thesauri, controlled lists of keywords, ontologies, clustering approaches, taxonomies, lexical databases, concept maps/spaces, semantic road maps, etc.

 

b) guide development of database management structures to allow automated artificial intelligence and manual methods to create the appropriate relational links              

            1. multilinguality

            2. synonyms

            3. homographs

            4. singulars and plurals

            5. parts of speech

            6. cultural differences affecting meaning

            7. narrower and broader and related terms

            8. Syntactical diferences

  

            c) guide development of interfaces for entry of semantically appropriate terms by:

                        1. trained institutional staff

                        2. novice, non-institutional creators

                        3. method to alert a trained staff of new terms for enhancement

 

Objectives

 

A set of objectives were identified to assist users in identifying and selecting appropriate target resources.

 

To provide users with integrated access through a single interface (or recommended interface format) to distributed quality services, web pages or catalogs.

To enable a single search interface across heterogeneous metadata descriptions. (i.e. Ability for user to search descriptive metadata in multiple metadata forms)

To enable users to determine the target resources most useful to their research     

 

To provide for cross-browsing and cross-searching. To enable the integration or merging of descriptions which are based on complementary but possibly overlapping metadata schemas or standards

            a) across multiple controlled vocabularies

            b) across multiple domains (archives, art works, etc.)

            c) multiple subject (topic) areas

            d) multiple metadata schemes

            e) multiple internet-based resources

 

To enable users to access information in the language, script, form they prefer. To enable different views of the one underlying and complete metadata description, depending on the user's particular interest, perspective or requirements.

 

To bring precision to searches and response content

            a) assisting users in identifying the most efficient paths for resource discovery

            b) help users focus their searches

            c) enable optimal recall

            d) enable optimal precision

            e) assist searchers in developing alternative search strategies

             

 

To provide consistency for users by controlling forms used for access and displays

 

To be able to support groupings and rich descriptions of resources through search interfaces and architecture to

a) enable navigation

b) and provide explanations for variations and inconsistencies

c) show relationships (broader, narrower, etc.)

 

To help collocate similar words or phrases

 

To be able to utilize controlled forms of names and titles and subjects to link to the authorized forms of names, titles, and subject that are used in various tools, like directories, biographies, abstracting and indexing services, and achieve goals and objectives in the most effective, efficient, and economical manner

 

To facilitate sharing to reduce cataloguing costs to libraries, museums, archives, rights management agencies, etc.

 

To simplify creation and maintenance of subject-related databases or authority records internationally for institutions participating in managed projects

 

Issues

 

Working from the definition, a number of issues were identified that need to be addressed.

 

Subject Semantic Interoperability: The ability of two or more systems or components to exchange or harmonize cognate subject vocabularies and/or knowledge organization schemes to be used for the purposes of effective and efficient resource discovery without significant loss of lexical or connotative meaning and without special effort by the user.

 

 

"two or more systems or components"

- implies the systems remain independent at some level with each maintaining its own metadata standard

- individual resources - each internally constructed in their own semantically consistent fashion

            - can support multiple languages

            - an operator interface for data entry

- able to accommodate differences in technical approach and working practices

- flexible and adaptable to different information communities

- extensible and scalable to accommodate the need for different degrees of depth and different subject domains

 

"to exchange or harmonize cognate subject vocabularies and/or knowledge organization schemes"

            - able to support interoperability across diverse information sectors

            - able to employ one or more functional mechanisms (e.g cross-search functionality)

                        <list, e.g. Z39.50, XML, etc.>

            - a structured database where harmonization is pre-coordinated, or --

            - a scripted protocol that harvests data on the fly

            - distributed use of thesauri, vocabularies, etc.

- able to harmonize the semantics of hierarchical relations and term overlap among multiple vocabularies

 

"to be used for the purposes of effective and efficient resource discovery"

            - amenable to computer application

- able to be made available for gateways/portals or bibliographic catalogs

            - improve retrievability of web resources

- able to alert the user to different terms to describe similar concepts, or even identical terms to mean very different things

            - able to provide the user with a means for selecting the appropriate resource(s)

            - versatile, i.e. able to perform different functions

- search interfaces and architecture that support groupings and rich descriptions of resources

- able to save and export search results

- able to link search results to full-text or other content delivery options

- able to manage access to target resources 

            - able to search by specific fields in advanced searches

- support keyword and browse searches, including:

a) ability to browse a list of targets

b) ability to search target descriptions by keyword

c) ability to present different views of targets (e.g. by subject, user group, etc.), d) ability to brows target resources in hierarchical displays

e) ability to browse a composite list of target resources (aggregated databases)

- able to present different views of the target resources

- able to integrate metadata for target resources from more than one source.

 

 

"without significant loss of lexical or connotative meaning"

            - method of identifying the meaning of one word/phrase with another        

            - able to employ one or more of these methods to achieve semantic interoperability

                        <list, e.g. mapping>

 

"without special effort by the user" (including creators, institutional staff, and general users)

            - system with structural specifications for searching and navigation

            - system with structural specification for display

- automated checking against specified controlled heading database(s) for linking/updating

- relatively simple to apply, use, and comprehend

 

To achieve interoperability between systems, those systems need to follow / apply:

-          develop and define organizational models

-          develop business models

-          develop shareable technical solutions and metadata standards (e.g. Renardus Application Profile, Renardus Namespaces, Renardus Collection Level Description)

-          establish (publicly) guidelines for:

o        Selection and collection development

o        Collection management

o        Record creation

o        Resource description and metadata

o        Subject access

o        Search and browse access

o        Standards

o        Value-adding features

 

Levels

 

There are three levels to be approached: 

a) the vocabulary level

            - thesauri

            - vocabularies

            - word lists

            - subject heading lists    

            - classification systems

            - concept maps

            - lexical databases

- etc.

 

b) the system level

- structured database of subject-related terms or authority records controlled by an agency(ies) following policies and procedures for a defined project; implies workforms, manuals, standards, etc.

            - dispersed input by local agencies into a larger database

            - automated harvesting, with minimal human input

            - user-generated indexes based on natural language

            - management software

            - protocols

            - registries

            - metadata and record format

            - query language

            - etc.

 

c) the user level

            - easy to understand presentation

- several options available to devise a search strategy, e.g. search and browse, basic and advanced

            - several options available to manipulate search response, e.g. limit, re-sort, etc.

            - etc.

 

Modes

 

There are several modes in which vocabularies may interoperate:

 

Global or structured or a mixture

"global" - where any internet-based resources could be identified, harvested and presented to the user on the fly based on artificial intelligence algorithms 

 

Structured as in a subject gateway

Subject gateway: a subject-based resources discovery guide which provides links to information resources (documents, collections, sites or services), predominantly accessible via the internet, and applies a documented set of quality measures to support systematic resource discovery. It is also managed, collected by humans according to documented selection criteria, with maintenance criteria, with a fixed metadata set and controlled subject classification.

 

Mixture

End-user thesauri developed from analysis of search transactions by humans and machines from commonly used natural language terms and thereby become descriptors based on user warrant