Introduction
To improve subject retrieval across different domains using
a variety of subject vocabularies, the Subject Analysis Committee's
Subcommittee on Semantic Interoperability has prepared a number of documents
that together are intended to provide guidance to creators of semantically
interoperable vocabularies and systems. Most online library systems worldwide
utilize some type of controlled vocabulary.
From a librarian's point of view keyword searching on the Internet has
its limitations. Yet, online catalogs exist in the Internet environment along
with other remotely accessed databases. The Subcommittee recognizes some of the
same problems identified by the Bates report, in particular the myriad thesauri
presented to users during an information seeking experience. Although some
Internet search engines function fairly well, the committee felt it needed to
limit its focus to environments using some type of structured subject-based
metadata or embedded metatags, rather than random or
weighted keywords.
In the ALCTS report, "Subject data in the metadata
record" (1) functional requirements for subject access to Internet
resources include: a) to assist searchers in identifying the most efficient
paths for resource discovery and retrieval; b) help users focus their searches;
c) enable optimal recall; d) enable optimal precision; e) assist searchers in
developing alternative search strategies; f) provide all of the above in the
most efficient, effect and economical manner.
In a networked environment, interoperability among
disparate systems is necessary to allow users to search among resources from
multiple sources generated and organized according to different standards and
approaches. Lois Chan in her paper for the Bicentennial Conference on
Bibliographic Control for the New Millenium 2000
summarizes the interoperability requirements as follows: a) interoperability
among different systems, metadata standards, and languages; b) flexibility and
adaptability to different information communities, not only different types of
libraries, but also other communities such as museums, archives, corporate
information system, etc; c) extensibility and scalability to accommodate the
need for different degrees of depth and different subject domains; d)
simplicity in application, i.e. easy to use and to comprehend; e) versability, i.e. the ability to perform different
functions; and f) amenability to computer application.
Doerr (2001) notes that terminological resources are increasingly important
for information retrieval in the networked environment for retrieving documents
by querying databases and metadata employing controlled vocabularies. There is
a growing interest in developing automated intermediaries to negotiate the
differences between controlled vocabulary schemes so that a user can use a
familiar set of terms to search collections using other vocabulary schemes. (3)
Hunter (2001) (4) points out that networked knowledge
organization systems typically contain objects of mixed media types which are
described using a multitude of divers metadata schemas. Hence machine
understanding of metadata descriptions which conform to schemas from different
domains is a fundamental requirement for access. Yet, problems arise from the
differences in terminological semantics and hierarchical relationships within
various subject schemes.
Bella Hass-Weinberg (5) at Thesaurus Design for Semantic
Information Management suggested that "semantic information
management” really just means vocabulary control; that ontology usually just means
classification scheme, but sometimes gets used as a synonym for thesaurus, and
that taxonomy is just a synonym for classification. Subject headings lists,
such as LCSH are essential tools for managing information in a print
environment, while true thesauri are often more useful in the online
environment (where they can be viewed hierarchically or combined in Boolean
searches) Thesauri often run into the problem of needing to distinguish
homographs. The problem in the selection of thesaurus terms is largely one of
determining a set of appropriate lexemes, that is, the smallest units of
lexicon that can be understood on their own terms. Synonymy is a common
problem, though easily managed, e.g. Cancer, see Neoplasm. Other problems:
having to choose between singular and plural, parts of speech, etc.
A subject portal connects users to a site focusing on a
particular subject, with access to high-quality information resources, allowing
aggregated cross-searching, streamlined account management, user profiling, or
additional services (6) However, the user has to know to go to the portal.
There is a growing number of portals.
Renardus is an example of a subject
gateway/portal project to provide users with integrated access by searching or
browsing, through a single interface, to partners' quality-controlled subject
gateways. Further goals are to develop and define organizational models,
business models, technical solutions and metadata standards (Renardus Application Profile, Renardus
Namespaces, Renardus Collection
Level Description). The following elements can be used to define a
quality-controlled subject gateway: a) selection and collection development, b)
collection management, c) creation, d) resource description and metadata, e)
subject access, f) search and browse access, g) standards, h) value-adding
features. Each participating partner is responsible for mapping its metadata
format to the common Renardus metadata format,
derived from Dublin Core. A generic normalization toolkit with Z39.50
configuration files and a conversion script were provided. Each participant set
up a Renardus server with their content normalized to
the Renardus datamodel. A
set of screens were built for the user interface: a) homepage, b) advanced
search screen, c) index scan window, d) advanced search page after index scan,
e) browse by subject screen, f) (preliminary) result screen, g) sorted result
screen, h) participating gateways screen, and I) help (index) screen. In order
to accomplish subject browsing, the various systems, are mapped to a common
classification system. The Renardus service provides
access to resources from all kinds of subjects, published world-wide and in
many languages and it is intended to be offered to an international
multi-disciplinary community of users. Dewey Decimal
Classification was chosen because of: online availability and tools,
global usage, suitability of the classification system and its functionality,
frequency and character of the updates, Research and methodological development
efforts. (7)
About the same time the SAC Subcommittee on Semantic
Interoperability was formed, NISO decided Z39.19 Guidelines for the
Construction, Format, and Management of Monolingual Thesauri needed
changing to meet the needs of the changing information environment. Their
rationale included, "Developers of Internet and Intranet-accessible Web
pages, databases, and information systems need better metadata to support
non-expert information searches, and metadata developers are recognizing the
value of incorporating high-quality, interoperable controlled vocabularies and
taxonomies into their schemes."(8)
The goal of proposing the accompanying objectives and
guidelines should enable developers to create an environment / system / method
by which even multiple portals could be accessed via subject metadata using
software that is neutral and available ubiquitously or directly to the user,
that could be copied by libraries for use in their own environment.
Footnotes
1. American Library Association (1999). "Subject
data in the metadata record", Division of Association for Libraries and
Technical Services, Cataloging and Classification Section, Subcommittee on
Metadata and Subject Analysis. http://www.ala.org/ala/alctscontent/catalogingsection/catcommittees/subjectanalysis/metadataandsubje/subjectdata.htm
2. Chan, Lois (2000) "Exploiting LCSH, LCC, and DDC to
retrieve networked resources issues and challenges", Library of Congress. http://lcweb.loc.gov/catdir
bibcontrol/chan_paper.html
3. Doerr, M. (2001) "Semantic
problems of thesauri mapping," Journal of Digital Information, vol. 1, no.
8 http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Doerr/
4. Hunter, Jane (2001) "MetaNet:
a metadata term thesaurus to enable interoperability between metadata domains",
Journal of Digital Information, vol. 1, no. 8 http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Hunter/
5. Lovins, Daniel. "Summaries and reflections of Thesaurus Design for Semantic
Information Management". a day-long
seminar led by Prof. Bella Hass-Weingberg in
6. Resource Discovery Network (2002). "Renardus"; "Subject Portals Development
Project". http://rdn.ac.uk/projects/#Euro
7. Neuroth, Heike and Koch, Traugott (2001) Cross-browsing and cross-searching in a
distributed network of subject gateways: architecture, data model, and
classification, at European Library Automation Group’s 25th library
systems seminar, Integrating Heterogeneous Resources, Prague 6-8, June 2001 (http://www.kbr.be/elag/) http://www.stk.cz/elag2001/Papers/HeikeNeuroth/HeikeNeuroth.htm
Accessed Aug. 8, 2002
8. NISO. Developing the Next
Generation of Standards for Controlled Vocabularies and Thesauri.
http://www.niso.org/committees/MT-info.html
Accessed
Subject
Semantic Interoperability
Subcommittee charge:
Specific tasks include, but are not necessarily limited to:
a) An inventory of known semantic interoperability
projects, with descriptions;
b) An evaluation of selected projects, in terms of those
projects' stated objectives;
c) An investigation of the various concepts involved in the
harmonization of indexing languages, such as switching languages, concordance
tables, front-end thesauri, meta-thesauri, and mapping.
Issues to examine with regard to the above tasks may
include:
d) Conditions which optimize the effectiveness of
harmonization, both among indexing languages of the same type, and among
languages of different types;
e) Simplification of existing indexing languages, in the
context of interoperability;
f) Approaches to integration and harmonization of subject
vocabularies and knowledge organization schemes used in various metadata
standards for the purposes of effective and efficient resource discovery.
Definition of Subject Semantic Interoperability: The
ability of two or more systems or components to exchange or harmonize cognate
subject vocabularies and/or knowledge organization schemes to be used for the
purposes of effective and efficient resource discovery without significant loss
of lexical or connotative meaning and without special effort by the user.
Goals
Work began with the assumptions we would try to
provide recommendations to:
a) Serve as guidelines in structuring a system that supports
semantic interoperability among vocabularies by employing one or several of the
methods listed below:
1. harmonization
of indexing languages (assumes simplification)
2.
switching languages
3.
concordance tables
4.
front-end thesauri or front-end "cluster"
5.
metathesaurus
6.
semantic networks
7.
multilingual thesauri
6.
mapping (methodologies)/types
a) among
multiple vocabularies in different languages and classification systems
b)
between a controlled vocabulary and a universal
classification system
c)
between classification systems
d)
to a new system/metathesaurus
e)
to another thesaurus or classification not used by the
participants
f) among controlled vocabularies
in the same language: thesauri, controlled lists of keywords, ontologies, clustering approaches, taxonomies, lexical
databases, concept maps/spaces, semantic road maps, etc.
b) guide
development of database management structures to allow automated artificial
intelligence and manual methods to create the appropriate relational links
1.
multilinguality
2.
synonyms
3.
homographs
4.
singulars and plurals
5.
parts of speech
6.
cultural differences affecting meaning
7.
narrower and broader and related terms
8.
Syntactical diferences
c) guide development of interfaces for entry
of semantically appropriate terms by:
1.
trained institutional staff
2.
novice, non-institutional creators
3.
method to alert a trained staff of new terms for
enhancement
Objectives
A set of objectives were identified to assist users in identifying and
selecting appropriate target resources.
To provide users with integrated access through a single
interface (or recommended interface format) to distributed quality services,
web pages or catalogs.
To enable
a single search interface across heterogeneous metadata descriptions. (i.e. Ability for user to search
descriptive metadata in multiple metadata forms)
To enable users to
determine the target resources most useful to their research
To provide for cross-browsing and cross-searching. To enable
the integration or merging of descriptions which are based on complementary but
possibly overlapping metadata schemas or standards
a) across multiple controlled vocabularies
b) across multiple domains (archives, art works, etc.)
c) multiple subject (topic) areas
d) multiple metadata schemes
e) multiple internet-based resources
To enable users to access information in the language,
script, form they prefer. To enable different views of the one underlying and
complete metadata description, depending on the user's particular interest,
perspective or requirements.
To bring precision to searches and response
content
a)
assisting users in identifying the most efficient
paths for resource discovery
b)
help users focus their searches
c)
enable optimal recall
d)
enable optimal precision
e)
assist searchers in developing alternative search
strategies
To provide consistency for users by controlling
forms used for access and displays
To be able to support groupings and rich
descriptions of resources through search interfaces and architecture to
a) enable
navigation
b) and
provide explanations for variations and inconsistencies
c) show
relationships (broader, narrower, etc.)
To help collocate similar words or phrases
To be able to utilize controlled forms of names
and titles and subjects to link to the authorized forms of names, titles, and
subject that are used in various tools, like directories, biographies,
abstracting and indexing services, and achieve goals and objectives in the most
effective, efficient, and economical manner
To facilitate sharing to reduce cataloguing costs to
libraries, museums, archives, rights management agencies, etc.
To simplify creation and maintenance of subject-related
databases or authority records internationally for institutions participating
in managed projects
Issues
Working from the definition, a number of issues were
identified that need to be addressed.
Subject Semantic Interoperability: The ability of two or
more systems or components to exchange or harmonize cognate subject
vocabularies and/or knowledge organization schemes to be used for the purposes
of effective and efficient resource discovery without significant loss of
lexical or connotative meaning and without special effort by the user.
"two or more systems or
components"
- implies the systems remain
independent at some level with each maintaining its own metadata standard
- individual
resources - each internally constructed in their own semantically consistent
fashion
- can
support multiple languages
- an operator interface for data entry
- able
to accommodate differences in technical approach and working practices
- flexible
and adaptable to different information communities
- extensible
and scalable to accommodate the need for different degrees of depth and
different subject domains
"to exchange or harmonize
cognate subject vocabularies and/or knowledge organization schemes"
- able to support interoperability across diverse information
sectors
- able to employ one or more functional mechanisms (e.g cross-search functionality)
<list, e.g. Z39.50, XML, etc.>
- a structured database where harmonization is
pre-coordinated, or --
- a scripted protocol that harvests data on the fly
- distributed use of thesauri, vocabularies, etc.
- able
to harmonize the semantics of hierarchical relations and term overlap among
multiple vocabularies
"to be used for the purposes
of effective and efficient resource discovery"
- amenable to computer application
- able
to be made available for gateways/portals or bibliographic catalogs
- improve retrievability of web resources
- able
to alert the user to different terms to describe similar concepts, or even
identical terms to mean very different things
- able to provide the user with a means for selecting the
appropriate resource(s)
- versatile, i.e. able to perform different functions
- search
interfaces and architecture that support groupings and rich descriptions of
resources
- able
to save and export search results
- able
to link search results to full-text or other content delivery options
- able
to manage access to target resources
-
able to search by specific fields in advanced searches
- support keyword and
browse searches, including:
a) ability to browse a list of targets
b) ability to search target descriptions
by keyword
c) ability
to present different views of targets (e.g. by subject, user group, etc.), d)
ability to brows target resources in hierarchical displays
e) ability
to browse a composite list of target resources (aggregated databases)
- able
to present different views of the target resources
- able
to integrate metadata for target resources from more than one source.
"without significant loss of
lexical or connotative meaning"
- method of identifying the meaning of one word/phrase with
another
- able to employ one or more of these methods to achieve
semantic interoperability
<list, e.g. mapping>
"without special effort by
the user" (including creators, institutional staff, and general
users)
- system with structural specifications for searching and
navigation
- system with structural specification for display
- automated checking against
specified controlled heading database(s) for linking/updating
- relatively
simple to apply, use, and comprehend
To achieve interoperability between systems, those systems
need to follow / apply:
-
develop and define organizational models
-
develop business models
-
develop shareable technical solutions and metadata
standards (e.g. Renardus Application Profile, Renardus Namespaces, Renardus
Collection Level Description)
-
establish (publicly) guidelines for:
o
Selection and collection development
o
Collection management
o
Record creation
o
Resource description and metadata
o
Subject access
o
Search and browse access
o
Standards
o
Value-adding features
Levels
There are three levels to be approached:
a) the vocabulary level
- thesauri
- vocabularies
- word lists
- subject heading lists
- classification systems
- concept maps
- lexical databases
- etc.
b) the system level
- structured database of subject-related
terms or authority records controlled by an agency(ies) following policies and procedures for a defined
project; implies workforms, manuals, standards, etc.
-
dispersed input by local agencies into a larger database
-
automated harvesting, with minimal human input
- user-generated indexes based on natural language
- management software
- protocols
- registries
- metadata and record format
- query
language
- etc.
c) the user level
- easy to understand presentation
- several
options available to devise a search strategy, e.g. search and browse, basic
and advanced
- several options available to manipulate search response,
e.g. limit, re-sort, etc.
- etc.
Modes
There are several modes in which vocabularies may
interoperate:
Global or structured or a mixture
"global" - where any
internet-based resources could be identified, harvested and presented to the
user on the fly based on artificial intelligence algorithms
Structured as in a subject gateway
Subject gateway: a subject-based resources discovery guide
which provides links to information resources (documents, collections, sites or
services), predominantly accessible via the internet, and applies a documented
set of quality measures to support systematic resource discovery. It is also
managed, collected by humans according to documented selection criteria, with
maintenance criteria, with a fixed metadata set and controlled subject
classification.
Mixture
End-user thesauri developed from analysis of search
transactions by humans and machines from commonly used natural language terms
and thereby become descriptors based on user warrant