ILS506 - Information Analysis and Organization

Custom Search


Cataloging the Internet for the Sake of the User

__________________________

____________________________________________________________________________

Introduction

Historically, libraries and librarians have always sought to collect, organize, preserve, and disseminate the collective knowledge of the world.  Although the Diamond Sutra holds the distinction as being the earliest dated book ever printed, it wasn’t until 1436 when Johannes Gutenberg invent a printing press with wooden (and later metal) moveable type that printed material became accessible, although not affordable, by the masses. Primarily due to economic factors, book ownership has been a fairly recent luxury, while people previously sought out books from their local academic or public libraries.  Initially, libraries kept track of their collections by creating lists in books. In 1901 the Library of Congress began to sell printed cards of bibliographic information to libraries. This step of placing the bibliographic cards in a card catalog allowed users to find the information through multiple access points as opposed to the one point of access when materials were cataloged in books.
   
With the subsequent development of the Dewey Decimal and Library of Congress Classification systems, the Anglo-American Cataloging Rules (AACR2), and MARC format, users were able to access information locally that adhered to consistent standards. With the development of MARC records, card catalogs in individual libraries gave way to electronic cataloging systems, further streamlining the information seeking process. In 1967, the Ohio College Library Center (OCLC), a consortium of 54 colleges in Ohio formed a network to share their collections catalogued on MARC records. Since that beginning, the name has changed to the Online Computer Library Center and membership is opened to all libraries through WorldCat.
   
Today, using a computer with Internet access, a user is able to locate items based on any one of the 138 million bibliographic records held by WorldCat (OCLC). O’Daniel (1992) points out that “the ability to operate as a collective requires consistent standards for precise communication”(p.2).  The Library of Congress developed these consistent standards by creating Authority Headings for subject, name, title, and name/title combinations and permitting users to harvest this data at http://authorities.loc.gov/. This controlled vocabulary ensures users consistency and accuracy in bibliographic records and increases the likelihood of accessing the information they seek.

Resources on the World Wide Web

Over the past quarter century, however, there has been a staggering increase in the availability of information in digital format on the World Wide Web (hereafter referred to as the Web). The most recent study conducted by the faculty and students at the School of Information Management and Systems at the University of California at Berkeley (Lyman et al, 2003) sought to quantify how much new information is created each year. The growth in information listed from 2000 to 2003 is astounding. In terms of information on the Internet, they estimated that in the year 2000, there were 20 to 50 terabytes of information on the Surface Web. In three short years that number had more than tripled – from 20 – 50 to 167 terabytes. Lyman et al (2003) cite Bright Planet, the industry leader in harvesting information from the Deep Web when they estimated that the Deep Web holds 400 to 450 times the information of the Surface Web. Their estimate places information on the Deep Web at between 66,800 and 91,850 terabytes. As a point of reference, it would require 10 terabytes to contain all the information in the entire print collections of the U.S. Library of Congress. In 2003, this equaled the equivalent of between 6,600 to 10,000 times the entire LOC collections on the Web, most of it on the Deep Web.

Looking at these figures, it is not difficult to understand the complexity of the task to organize and make sense of all this available information. A review of literature published between 1996 and 1998 (Morgan 1996, Beall 1997,Shafer 1997, Porter and Bayard 1999) highlights the critical issues of Internet resources and the desire by many in the Information and Library Science field to catalog and include websites in online catalogs.  One has only to peruse some of those titles - Why Catalog Internet Resources?, , Cataloging world wide web sites consisting mainly of links, Including Web Sites in the Online Catalog: Implications for Cataloging, Collection Development and Access, and read their abstracts to realize that the issues facing the field on this topic have remained fairly constant over the past decade.

Why Catalog When There is Google?

One may wonder why there is a need to catalog the information on the Web when people are able to access information through generalized search engines such as Google. In fact, “Google's mission is to organize the world's information and make it universally accessible and useful” (Google 2009). Looking at the statistics, it appears Google is doing just that. Currently, Vogelstein (2009) notes Google’s “millions of servers process about 1 petabyte of user-generated data every hour. It conducts hundreds of millions of searches every day” (64). While Google is able to process the equivalent of 6,600,000 to 10,000,000 times the amount of information in the collections of the Library of Congress every hour, volume and speed cannot be construed as true indicators of accuracy of information or relevancy of items retrieved. Bergman (2001) points out that traditional search engines like Google operate based on a system of creating indices by crawling Web pages. Pages need to be static for this form of indexing to work as well as be linked to other pages. Content in the Deep Web cannot be indexed this way because the majority of it doesn’t exist in a static format. The information is stored in searchable databases and results are created “on the fly” in response to a specific query.

According to Bergman (2001), the key findings in a study conducted by BrightPlanet include:
  • The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.
  • More than 200,000 deep Web sites presently exist.
  • Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web forty times.
  • On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median)
  • deep Web site is not well known to the Internet-searching public.
  • The deep Web is the largest growing category of new information on the Internet.
  • Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
  • Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.
  • Deep Web content is highly relevant to every information need, market, and domain.
  • More than half of the deep Web content resides in topic-specific databases.
  • A full ninety-five per cent of the deep Web is publicly accessible information -- not subject to fees or subscriptions.                 

Electronic Resources

Today’s academic libraries recognize the importance of providing access to high quality, peer-reviewed journals found on the Deep Web for their students and faculty. Kyrillidou (2000) reports that “experimental data collected by ARL libraries over the last decade indicate that the portion of the library materials budget that is spent on electronic resources is indeed growing rapidly, from an estimated 3.6% in 1992-93 to 10.56% in 1998-99” (p. 1). With colleges and universities expending over 10% of their budget to purchase subscription databases and journals, it would seem that students would express a high level of satisfaction with their access to the. Yet the Online Computer Library Center’s (OCLC) College Students’ Perceptions of Libraries and Information Resources, 2005 (Perceptions) survey did not confirm this. In contrast, DeRosa et al. found that while students show “high levels of awareness of library electronic resources” (6-3) they were dissatisfied with their library websites. In fact, only 2% of college students began their information searches on their library websites (6-2) and 90% were dissatisfied with the information they found when a general search engine directed them to electronic resources in their library’s collection (6-3). High on the list of “Ten Things Google has Found to be true” is  “Fast is better than slow. Google believes in instant gratification. You want answers and you want them right now. Who are we to argue?” Who indeed. In fact, the OCLC survey documented the perception of students is that “search engines deliver better quality and quantity of information than librarian-assisted searching-and at greater speed”( DeRosa et al. 6-4).

Information Seeking Behaviors

Carol C. Kuhithau (1990) at the School of Communication, Information, and Library Studies at Rutgers has conducted extensive research on information seeking behavior from the user’s perspective. She states that in the process of seeking information, “[t]he bibliographic paradigm is based on certainty and order, whereas users’ problems are characterized by uncertainty and confusion” (p. 361). This uncertainty frequently causes feelings of anxiety in the student and the search for broad, generalized information compounds this state. Kuhithau’s (1990) research demonstrated that “a clear formulation reflecting a personal view of the information encountered is the turning point of the search… confusion decreases, and interest intensifies” (p. 370) It is exactly because of this uncertainty and confusion that generalized search engines are detrimental to the information seeking process.

In a basic search with generalized search engines like Google, a 1995 study by Taylor and Clemson (1996) found:
  • There is much duplication of entries within the same set of retrieved hits.
  • Results are unpredictable.
  • Results can be quite misleading--the same search can retrive no hits by one engine and many hits by another.
  • Search engines do not readily disclose the contents of their databases nor do they provide a description of the criteria used to include a document in their files.
  • Vocabulary is not controlled, and punctuation and capitalization rules are not standardized.
  • Relationships and relevance often cannot be analyzed without actually examining each item--that is, there is not enough information in the "index entry" to allow one to make educated choices. (p. 1)
John Lubans’ (1998) research at Duke University on freshmen internet use found of the students who responded to the survey, only 7% rank their ability to use the web as “best”, while 23% see their use as “better”, and 29% rank their abilities as “good”. Clearly these students could benefit from the organization and cataloging of Internet resources. Librarian and educator Rita Vine (2004) supports these findings and believes “search engines, which enable users to keyword-pattern-match against billions of web pages, are very good at finding distinctive phrases” (p. 20). The problems arise when you are in the beginning stages of discovery and are unsure exactly what you are looking for.

Porter and Bayard (1999) believe these weaknesses, added to the cumbersome and labor-intensive nature of subject-organized URL lists on websites are important reasons for cataloging the Web. In addition they cite complaints from librarians “about Web resources invariably center on the difficulties in organizing and archiving them… inconsistent quality… disappearance of their…URLs, resulting in the dreaded “404” message”(p.1). Oder (1998) posits that while subject-organized lists are not the same as cataloged Internet resources, the Michigan Electronic Library (MEL), Internet Public Library (IPL) and INFOMINE are a few excellent examples that illustrate the organizational abilities of individual librarians to organize small portions of the Internet. The main problems these individual indexes or catalogs face, though, is their size and frequent redundancy of items.  

OCLC's Internet Cataloging Project

In 1991, OCLC’s Internet Cataloging Project began to address the need to develop a consortia approach to the problem with 30 catalogers spearheading the movement to catalog Internet resources. Juls (1992) states the findings at the end of the project demonstrate that overall, MARC/AACR2 cataloging supported cataloging Internet resources, a method to link the record to the resource was beneficial for the user and instructional materials should be developed.  A manual was published by the OCLC in response to these findings, library system vendors embraced the 856 MARC field for electronic location and access, and the Web OPAC was introduced. By 1998, Jul reported over 18,000 Internet resources had been cataloged by over 5,000 OCLC libraries.
   
In spite of this initial success, there are many inherent difficulties when cataloging Internet resources. O’Daniel (1992) cites three; “lack of universally accepted controlled vocabulary; the lack of stability due to frequenct of change of data; and the lack of quality standards” (p. 3). On the subject of cataloging electronic serials, Hawkins (1998) cites these difficulties; locating prior issues for descriptive information, publishers frequently update digital information including titles, HTML and ASCII versions may have subtle differences, variations in paper and digital versions, many lack a table of contents. Further difficulties are encountered when websites move or even disappear. This causes time-consuming up-dates as each local record holder must update their  OPAC. To address this problem, the OCLC’s Office of Research developed persistant URLs or PURLs. These aliases are assigned so that if a URL is changed for any reason or for any amount of times, the PURL need only be changed one time through the PURL server.

The development of the Dublin Core Metadata Element Description addressed another problem with cataloging Internet resources; the need to standardize metadata found on websites and to streamline the complexity of the MARC format. Using 15 predetermined but flexible elements, metatags are created and embedded with the documents. MARCit software was developed to specifically pull the metadata from the title and URL fields and place that metadata in the 245 and 856 MARC fields. Although cataloging is a time-consuming and often cost-prohibitive activity, it will only be through these efforts to mesh Internet resources with local systems that Internet cataloging efforts will be successful.

Academic Library Projects

Oden (1998) believes that “subject gateways” may appeal more to academic libraries whose mission is to support the academic curricula and research needs of their students and faculty. This mission differs from public libraries that find smaller more localized resources meet the information needs of their patrons. INFOMINE, developed in 1994 at the University of California, Riverside, has embraced a combination of 100,000 librarian created links with 75,000 web crawler links. They use modified LC subject headings and “focused, automatic Internet crawling as well as automatic text extraction and metadata creation functions to assist our experts in content creation and users in searching”( http://infomine.ucr.edu/).

Cross-searching vs Local Indexing

When students complain about the speed of searches through their library databases, searching across multiple databases at one time frequently causes these. Because these databases have not been indexed locally, each search query is created on the fly. This is the critical flaw with general search engines inability to access information that resides on the Deep Web. In this instance, library search engines have access to the subscription databases, it is just that the search is too cumbersome due to lack of indexing. At the 1999 digital libraries conference in Santa Fe, several inherent problems with the Z-39.50 cross-search were identified; namely the tools are too slow, results are limited, and they frequently time out. These are problems any freshman who has attempted one database search is familiar with. Rochkind (2007) believes “none of these problems are inherent to metasearch, but they are inherent challenges to the cross-search approach” (28). The “Open Archives Initiative – Protocol for Metadata Harvesting” (OAI-PMH), used for harvesting metadata (or commonly referred to as “local indexing”) was a result of the Santa Fe conference. This harvesting, or local indexing, is the type used by Google Scholar and is what makes partnering with them so appealing for academic libraries.

A major roadblock for many academic libraries to index locally is a lack of cooperation and permissions from their content providers. Rochkind (2007) notes, though, that content providers are beginning to provide Google and Google Scholar with access to their metadata hoping for placement recognition in search results. EBSCOhost and Gale have followed suit and allowed Google and other web crawlers to index their metadata. The problem with partnering with Google Scholar is that libraries still don’t know what Google has or has not indexed. Rochkind(2007) recommends “if libraries licensed full text or metadata by cooperating with the content provider, they could know exactly what they have in their index and be assured of its completeness” (p. 29).

Conclusion

In the course of a few short decades most libraries have or will become digital libraries on one scale or another. At this point Google and Google Search with their local indexing protocol, have set the stage for academic and public libraries alike to utilize the technology that allows their users to access information quickly, efficiently, and with verified authority. Today’s student wants to access information in a seamless environment in a timely fashion. It will be through the cooperative efforts of libraries, librarians, catalogers and content providers utilizing the transfer of licensed content from providers to indexers utilizing the OAI-PMH harvester process, that today’s patrons will be able to access information in the time and format that they require.

___________________________________________________________

References



Baruth, B. E. Is your catalog big enough to handle the web? American Libraries, 31(7), 56-60.

Bergman, M. K. (2001). The deep web: Surfing hidden value. Retrieved July 12, 2009, from http://brightplanet.com/

Dowling, T. P. (1997). The world wide web meets the OPAC. OhioLINK central catalog web interface. ALCTS Newsletter, 8(2), A-D.

Glaser, R. Internet sites in the library catalog: Where are we now? Alabama Librarian, 56(2), 10-12.

Hawkins, L. (1997). Serials published on the world wide web: Cataloging problems and decisions. The Serials Librarian., 33(1-2), 123.

Jul, E. (1996). Why catalog internet resources? Computers in Libraries, 16(1), 8.

Kuhlthau, C. C. (1991). Inside the search process: Information seeking from the user's perspective.
Journal of the American Society for Information Science (1986-1998), 42(5), 361.

Kyrillidou, M. (2000). Research Library Spending on Electronic Scholarly Information is on the Rise. The Association of Research Libraries. Retrieved July 11, 2009, from http://tinyurl.com/l5qm3c

Lubans, J. (1998, April). How first-year university students use and regard Internet
resources. How First-Year University Students Use and Regard Internet Resources. Retrieved July 10, 2009, from http://www.lubans.org/docs/1styear/firstyear.html

Nichols releases MARCit for cataloging internet resources.(1998). Information Today, 15(3), 51.

OCLC Internet Cataloging Colloquium, & OCLC. (1996). Proceedings of the OCLC internet cataloging colloquium.

OCLC., Weitz, J., Greene, R. O., & OCLC. (1998). Cataloging electronic resources OCLC-MARC coding guidelines.

O'Daniel, H. B. (1999). Cataloguing the internet. Retrieved July 12, 2009, from
http://associates.ucr.edu/heather399.htm.

Oder, N. (1998). Cataloging the net: Can we do it? Library Journal, 123(16), 47-51.

Porter, G. M., & Bayard, L. (1999). Including web sites in the online catalog: Implications for cataloging, collection development, and access. The Journal of Academic Librarianship, 25(5), 390-394.

Rochkind, J. (2007). (Meta)search like google. Library Journal, 132(3), 28-30.

Shafer, K. E. (1997). Scorpion helps catalog the web. research project at OCLC. Bulletin of the American Society for Information Science, 24(1), 28-29.

Taylor, A. & Clemson, P. (1998). Access to networked documents: Catalogs? Search engines? Both? Retrieved July 11, 2009, from
http://worldcat.org/arcviewer/1/OCC/2003/07/21/0000003889/viewer/file9.html

Vine, R. (2004). Going beyond google for faster and smarter web searching. Teacher Librarian, 32(1), 19.

Vogelstein, F. (2009, August 2009). Keyword:Monopoly. Wire, , 58-59-65.