Text Information Retrieval Systems

3rd Edition - December 19, 2006
Latest edition
Authors: Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, Carol L Barry
Language: English

This will be the third edition of the highly successful ‘Text Information Retrieval Systems’. The book's purpose is to teach people who will be searching or designing text… Read more

Purchase options

This will be the third edition of the highly successful ‘Text Information Retrieval Systems’. The book's purpose is to teach people who will be searching or designing text retrieval systems how the systems work. For designers, it covers problems they will face and reviews currently available solutions to provide a basis for more advanced study. For the searcher its purpose is to describe why such systems work as they do. The book is primarily about computer-based retrieval systems, but the principles apply to nonmechanized ones as well. The book covers the nature of information, how it is organized for use by a computer, how search functions are carried out, and some of the theory underlying these functions. As well, it discusses the interaction between user and system and how retrieved items, users, and complete systems are evaluated. A limited knowledge of mathematics and of computing is assumed.

This third edition will be updated to include coverage of the WWW and current search engines. In many cases, examples of non-web searching will be replaced with web-based illustrations. Coverage of interfaces, various features available to assist searchers, and areas in which search assistance is not available will also be covered. In addition, the book will have a web dimension which will include relevant material available online, to be used in conjunction with the text.

Contents

Preface

1
Introduction

1.1 What Is Information?

1.2 What Is Information Retrieval?

1.3 How Does Information Retrieval Work?

1.3.1 The User Sequence

1.3.2 The Database Producer Sequence

1.3.3 System Design and Functioning

1.3.4 Why the Process Is Not Perfect

1.4 Who Uses Information Retrieval?

1.4.1 Information Specialists

1.4.2 Subject Specialist End Users

1.4.3 Non-Subject Specialist End Users

1.5 What Are the Problems in IRS Design and Use?

1.5.1 Design

1.5.2 Understanding User Behavior

1.6 A Brief History of Information Retrieval

1.6.1 Traditional Information Retrieval Methods

1.6.2 Pre-computer IR Systems

1.6.3 Special Purpose Computer Systems

1.6.4 General Purpose Computer Systems

1.6.5 Online Database Services

1.6.6 The World Wide Web
Recommended Reading

2
Data, Information, and Knowledge

2.1 Introduction

2.2 Definitions

2.2.1 Data

2.2.2 Information

2.2.3 News

2.2.4 Knowledge

2.2.5 Intelligence

2.2.6 Meaning

2.2.7 Wisdom

2.3 Metadata

2.4 Knowledge Base

2.5 Credence, Justified Belief, and Point of View

2.6 Summary

3
Representation of Information

3.1 Information to Be Represented

3.2 Types of Representation

3.2.1 Natural Language

3.2.2 Restricted Natural Language

3.2.3 Artificial Language

3.2.4 Codes, Measures, and Descriptors

3.2.5 Mathematical Models of Text

3.3 Characteristics of Information Representations

3.3.1 Discriminating Power

3.3.2 Identification of Similarity

3.3.3 Descriptiveness

3.3.4 Ambiguity

3.3.5 Conciseness

3.4 Relationships Among Entities and Attribute Values

3.4.1 Hierarchical Codes

3.4.2 Measurements

3.4.3 Nominal Descriptors

3.4.4 Inflected Language

3.4.5 Full Text

3.4.6 Explicit Pointers and Links

3.5 Summary

4
Attribute Content and Values

4.1 Types of Attribute Symbols

4.1.1 Numbers

4.1.2 Character Strings—Names

4.1.3 Other Character Strings

4.2 Class Relationships

4.2.1 Hierarchical Classification

4.2.2 Network Relationships

4.2.3 Class Membership—Binary, Probabilistic, or Fuzzy

4.3 Transformation of Values

4.3.1 Transformation of Words by Stemming

4.3.2 Sound-Based Transformation of Words

4.3.3 Transformation of Words by Meaning

4.3.4 Transformation of Graphics

4.3.5 Transformation of Sound

4.4 Uniqueness of Values

4.5 Ambiguity of Attribute Values

4.6 Indexing of Text

4.7 Control of Vocabulary

4.7.1 Elements of Control

4.7.2 Dissemination of controlled vocabularies

4.8 The Importance of Point of View

4.9 Summary

5
Models of Virtual Data Structure

5.1 The Concept of Models of Data

5.2 Basic Data Elements and Structures

5.2.1 Scalar Variables and Constants

5.2.2 Vector Variables

5.2.3 Structures

5.2.4 Arrays

5.2.5 Tuples

5.2.6 Relations

5.2.7 Text

5.3 The Common Structural Models

5.3.1 The Linear Sequential Model

5.3.2 The Relational Model

5.3.3 Hierarchical and Network Models

5.4 Applications of the Basic Models

5.4.1 Hypertext

5.4.2 Spreadsheet Files

5.5 The Entity-Relationship Model

5.6 Summary

6
The Physical Structure of Data

6.1 Introduction to Physical Structures

6.2 Record Structures and Their Effects

6.2.1 Basic Structures

6.2.2 Space-Time and Transaction Rate

6.3 Basic Concepts of File Structure

6.3.1 The Order of Records

6.3.2 Finding Records

6.4 Organizational Methods

6.4.1 Sequential Files

6.4.2 Index-File Structures

6.4.3 Lists

6.4.4 Trees

6.4.5 Direct-Access Structures

6.5 Parsing of Data Elements

6.5.1 Phrase Parsing

6.5.2 Word Parsing

6.5.3 Word and Phrase Parsing

6.6 Combination Structures

6.6.1 Nested Indexes

6.6.2 Direct Structure with Chains

6.6.3 Indexed Sequential Access Method

6.7 Summary

7
Querying the Information Retrieval System

7.1 Introduction

7.2 Language Types

7.3 Query Logic

7.3.1 Sets and Subsets

7.3.2 Relational Statements

7.3.3 Boolean Query Logic

7.3.4 Ranked and Fuzzy Sets

7.3.5 Similarity Measures

7.4 Functions Performed

7.4.1 Connect to a Remote IRS

7.4.2 Select Database

7.4.3 Search the Inverted File or Thesaurus

7.4.4 Create a Subset of the Database

7.4.5 Search for Strings

7.4.6 Analyze a Set

7.4.7 Sort, Display, and Format Records

7.4.8 Handle the Unstructured Record

7.4.9 Download

7.4.10 Order Documents

7.4.11 Save, Recall, and Edit Searches

7.4.12 Current Awareness Search

7.4.13 Cost Summary

7.4.14 Terminate a Session

7.5 The Basis for Charging for Searches

8
Interpretation and Execution of Query Statements

8.1 Problems of Query Language Interpretation

8.1.1 Parsing Command Language

8.1.2 Parsing Natural Language

8.1.3 Processing Menu Choices

8.2 Executing Retrieval Commands

8.2.1 Database Selection

8.2.2 Inverted File Search

8.2.3 Set or Subset Creation

8.2.4 Truncation and Universal Characters

8.2.5 Left-Hand Truncation

8.3 Executing Record Analysis and Presentation Commands

8.3.1 Set Analysis Functions

8.3.2 Display, Format, and Sort

8.3.3 Offline Printing

8.4 Executing Other Commands

8.4.1 Ordering

8.4.2 Save, Recall, and Edit Searches

8.4.3 Current Awareness

8.4.4 Cost Summation and Billing

8.4.5 Terminate a Session

8.5 Feedback to Users and Error Messages

8.5.1 Response to Command Errors

8.5.2 Set-Size Indication

8.5.3 Record Display

8.5.4 Set Analysis

8.5.5 Cost

8.5.6 Help

9
Text Searching

9.1 The Special Problems of Text Searching

9.1.1 Note on Terminology and Symbols

9.1.2 The Semantic Web

9.2 Some Characteristics of Text and Their Applications

9.2.1 Components of Text

9.2.2 Significant Words—Indexing

9.2.3 Significant Sentences—Abstracting

9.2.4 Measures of Complete Texts

9.3 Command Language for Text Searching

9.3.1 Set Membership Statements

9.3.2 Word or String Occurrence Statements

9.3.3 Proximity Statements

9.3.4 Web Based Text Search

9.4 Term Weighting

9.4.1 Indexing with Weights

9.4.2 Automated Assignment of Weights

9.4.3 Improving Weights

9.5 Word Association Techniques

9.5.1 Dictionaries and Thesauri

9.5.2 Mini-Thesauri

9.5.3 Word Co-occurrence Statistics

9.5.4 Stemming and Conflation

9.6 Text or Record Association Techniques

9.6.1 Similarity Measures

9.6.1 Clustering

9.6.3 Signature Matching

9.6.4 Discriminant Methods

9.7 Other Processes with Words of a Text

9.7.1 Stop Words

9.7.2 Replacement of Words with Roots or Associated Words

9.7.3 Varying Significance as a Function of Frequency

9.7.4 Comments on the Computation of the Strength of Document Association

10
System-Computed Relevance and Ranking

10.1 The Retrieval Status Value (rsv)

10.2 Ranking

10.3 Methods of Evaluating the rsv

10.3.1 The Vector Space Model

10.3.2 The Probabilistic Model

10.3.3 The Extended Boolean Model

10.4 The rsv in Operational Retrieval

11
Search Feedback and Iteration

11.1 Basic Concepts of Feedback and Iteration

11.2 Command Sequences

11.3 Information Available as Feedback

11.3.1 File or Database Selection

11.3.2 Terms Search or Browsing

11.3.3 Record Search or Set Formation

11.3.4 Record Display and Browsing

11.3.5 Record Acquisition

11.3.6 Requests for Information about the Retrieval System

11.3.7 Establishing Communication Parameters

11.3.8 Trends over Sequences and Cycles

11.4 Adjustments in the Search

11.4.1 Improve Term Selection

11.4.2 Improve Set Formation Logic

11.4.3 Improve Final Set Size

11.4.4 Improve Precision, Recall, or Total Utility

11.5 Feedback from User to System

12
Multi-Database Searching and Mapping

12.1 Basic Concepts

12.2 Multi-database Search

12.2.1 The Nature of Duplicate Records

12.2.2 Detection of Duplicates

12.2.3 Scanning Multiple Databases

12.3 Mapping

12.4 Value of Mapping

13
Search Strategy

13.1 The Nature of Searching Reconsidered

13.1.1 Known Item Search

13.1.2 Specific Information Search

13.1.3 General Information Search

13.1.4 Exploration of the Database

13.2 The Nature of Search Strategy

13.2.1 Search Objective

13.2.2 General Plan of Operation

13.2.3 The Essential Information Elements of a Search

13.2.4 Specific Plan of Operation

13.3 Types of Strategies

13.3.1 Categorizing by Objective

13.3.2 Categorizing by Plan of Operation

13.4 Tactics

13.4.1 Monitoring Tactics

13.4.2 File Structure Tactics

13.4.3 Search Formulation Tactics

13.4.4 Term Tactics

13.5 Summary

14
The Information Retrieval System Interface

14.1 General Model of Message Flow

14.2 Sources of Ambiguity

14.3 The Role of a Search Intermediary

14.3.1 Establishing the Information Need

14.3.2 Development of a Search Strategy

14.3.3 Translation of the Need Statement into a Query

14.3.4 Interpretation and Evaluation of Output

14.3.5 Search Iteration Within the Strategic Plan

14.3.6 Change of Strategy when Necessary

14.3.7 Help Using an IRS

14.4 Automated Search Mediation

14.4.1 Early Development

14.4.2 Fully Automatic Intermediary Functions

14.4.3 Interactive Intermediary Functions

14.5 The User Interface as a Component of All Systems

14.6 The User Interface n Web Search Engines

15
A Sampling of Information Retrieval Systems

15.1 Introduction

15.2 Dialog

15.2.1 A Command Language Using Boolean Logic

15.2.2 Target

15.2.3 DIALOGWeb— A Web Adaptation

15.3 AltaVista

15.3.1 Default Query Entry Form

15.3.2 Advanced Search Form

15.4 Google

15.4.1 The Web Crawler

15.4.2 Searching

15.4.3 Google Advanced Search

15.5 PubMed

15.6 EBSCO Host

15.7 Summary

16
Measurement and Evaluation

16.1 Basics of Measurement

16.1.1 The Data Manager

16.1.2 The Query Manager

16.1.3 The Query Composition Process

16.1.4 Deriving the Information Need

16.1.5 The Database

16.1.6 Users

16.2 Relevance, Value, and Utility

16.2.1 Relevance as Relatedness

16.2.2 Aspects of Value

16.2.3 Relevance as Utility

16.2.4 Retaining Two Separate Relevance Measures

16.2.5 The Relevance Measurement Scale

16.2.6 Taking the Measurements

16.2.7 Questions About Relevance as a Measure

16.3 Measures Based on Relevance

16.3.1 Precision (Pr)

16.3.2 Recall (Re)

16.3.3 Relationship of Recall and Precision

16.3.4 Overall Effectiveness Measures Based on Re and Pr

16.4 Measures of Process

16.4.1 Query Translation

16.4.2 Errors in a Query Statement

16.4.3 Average Time per Command or per User Decision

16.4.4 Elapsed Time of a Search

16.4.5 Number of Commands or Steps in a Search

16.4.6 Cost of a Search

16.4.7 Size of Final Set Formed

16.4.8 Number of Records Reviewed by the User

16.4.9 Patterns of Language Use

16.4.10 Measures of Rank Order

16.5 Measures of Outcome

16.5.1 Precision

16.5.2 Recall

16.5.3 Efficiency

16.5.4 Overall User Evaluation

16.6 Measures of Environment

16.6.1 Database Record Selection

16.6.2 Record Content

16.6.3 Measures of Users

16.7 Conclusion

Bibliography

Index

Charles T. Meadow

Charles T. Meadow, professor emeritus, University of Toronto, and has been visiting professor at the Universities of North Carolina and the West Indies. He edited the Journal of the American Society for Information Science and the Canadian Journal of Information Science and was president of the Canadian Association for Information Science. Received Research Award and shared Annual Information Science Book Award from ASIS&T.

Affiliations and expertise

University of Toronto, Ontario, Canada

Bert R. Boyce

Bert Boyce has been an Information System Research Analyst, for the Information Systems Office, at the Library of Congress, a faculty member and acting Dean of the School of Library and Information Science, University of Missouri, Columbia, Missouri, and Dean of the School of Library and Information Science, Louisiana State University, where he is now Professor and Dean Emeritus. He is currently Editor of the Academic Press Library and Information Science Series. He received the ASIS&T Outstanding Information Science Teacher Award in 1989, and has shared the Annual Information Science Book Award from ASIS&T.

Affiliations and expertise

Louisiana State University, Baton Rouge, U.S.A.

Donald H. Kraft

Donald Kraft is professor at LSU and Distinguished Visiting Professor at the U.S. Air Force Academy. He is a fellow of IEEE and AAAS and editor of the Journal of the American Society for Information Science and Technology He received the Research Award, Watson Davis Award, and shared the Annual Information Science Book Award from ASIS&T and the LSU Distinguished Faculty award.

Affiliations and expertise

Louisiana State University, Baton Rouge, U.S.A.

Carol L Barry

Carol Barry is associate professor in the School of Library and Information Science, Louisiana State University. She has received the Best JASIS Paper Award, 1995; the LSU Alumni Association Teaching Award, 1995; and the American Society for Information Science, Doctoral Forum Award, 1993. She is associate editor of JASIS&T, a Member of the Board of ASIS&T, and a member of the LSU Faculty Senate and its vice president in 2000-2001. She has authored or co-authored over 30 research papers.

Affiliations and expertise

Associate Professor at Louisiana State University, USA.

Life Sciences

Physical Sciences & Engineering

Social Sciences & Humanities

Health

Text Information Retrieval Systems

Purchase options

Description

Key features

Readership

Table of contents

Product details

About the authors

Charles T. Meadow

Bert R. Boyce

Donald H. Kraft

Carol L Barry