Review #4: Anatomy of Aggregate Collections
Since D-Lib seems to be the premier e-journal for digital-library- and especially metadata- related research, I have tried to develop a habit of checking their website regularly. For that reason, I came across an article by Brian Lavoie,
Lynn Silipigni Connaway and Lorcan Dempsey in which they analyze the Google Print collection using a traditional collection development model.
They answer four important questions. First they seek to determine its coverage of unique books. Second, they try to find out the distribution of languages represented. Thirdly, they estimate the percentage of books that is out of copyright and therefore freely availble. Finally the most interesting question of all: how does the Google Print collection compare when the same experiment is done using five different libraries that more accurately reflect the typical libraries in North America. With this study, an important model is developed in which to determine the most effective digitization program efforts.
Firstly, the article seeks to determine the amount of coverage between the Google 5, as Lavoie, Conway and Dempsey term the 5 premier research university libraries that are contributing to the Google Print program. In doing so, they make several intriguing discoveries. First, they uncover that in these 5 libraries the duplication of collecting has been steadily decreasing over the past thirty years at a rate of approximately 1-2% every five years. Secondly, they discover that out of a total 32 million books cataloged in OCLC Worldcat only 10.5 million unique books are covered in the Google Print system by the 5 contributing libraries. That is, 33% of the system-wide collection is represented in the Google Print program. They do, however, make one caveat about this data: they have used the FRBR definition of expression and manifestation to determine these numbers whereby 2 different imprints of a single title are 2 different manifestations. Also, because the duplication of materials collection between these 5 libraries is steadily decreasing the likelihood of current materials being uniquely represented in the 5 is greater than for older materials.
Secondly, the authors seek to determine how many languages are represented in the Google 5's combined collection. They find that, just like the OCLC WorldCat system, just under fifty percent of this North American-centric collection is English language, while French, German, and Spanish language materials make up 25% of what's left. That is, while there is an English-language bias reflected naturally by libraries in an English-language speaking nation, there is still a significant number of languages represented.
Thirdly, the authors seek to solve the question of just how many of the books in the combined Google 5 collection are in or out of copyright. They have determined that about 6.5% of the combined collection is out of copyright and the percentage of that that is uniquely held by any one library is approximately 70%. That is, a significant fraction of the unique materials in the Google 5 collection are in fact out of copyright and therefore immediately available and what's more uniquely represented in each individual collection.
The most intriguing question, for me, is how the Google 5 collection compares to a similar study done for a different hypothetical combined collection. The authors use for this study a small liberal arts college, a large American public university, a large Canadian public university and a large metropolitan American public library. The authors hoped to determine what differences in coverage were represented by a more typical sample of libraries. Since over 40% of print books in the collection of 32 million represented by WorldCat are uniquely held by a single library the authors sought to discover how coverage might be increased by digitizing collections from other libraries and not merely the top 5 universities in the world.
They discovered that the most effective digitization efforts would be to enlist all the OCLC system-wide libraries in providing digital texts from their unique collection in order to ensure the highest percentage of unique titles. There were 5.6 million unique titles in the new collection, which roughly equalled 74% of the total collection, while only 58% of the Google 5 collection is unique titles. That is, the most effective digitization effort would be to enlist a great proportion of the system-wide libraries in OCLC WorldCat in order to retrieve a higher percentage of unique titles.
Thus, the best model for digital libraries to implement is one of collaboration between different types of collections. This collaboration would best serve the combined users of all the libraries involved by providing the maximum output of unique information available to all users.
Lynn Silipigni Connaway and Lorcan Dempsey in which they analyze the Google Print collection using a traditional collection development model.
They answer four important questions. First they seek to determine its coverage of unique books. Second, they try to find out the distribution of languages represented. Thirdly, they estimate the percentage of books that is out of copyright and therefore freely availble. Finally the most interesting question of all: how does the Google Print collection compare when the same experiment is done using five different libraries that more accurately reflect the typical libraries in North America. With this study, an important model is developed in which to determine the most effective digitization program efforts.
Firstly, the article seeks to determine the amount of coverage between the Google 5, as Lavoie, Conway and Dempsey term the 5 premier research university libraries that are contributing to the Google Print program. In doing so, they make several intriguing discoveries. First, they uncover that in these 5 libraries the duplication of collecting has been steadily decreasing over the past thirty years at a rate of approximately 1-2% every five years. Secondly, they discover that out of a total 32 million books cataloged in OCLC Worldcat only 10.5 million unique books are covered in the Google Print system by the 5 contributing libraries. That is, 33% of the system-wide collection is represented in the Google Print program. They do, however, make one caveat about this data: they have used the FRBR definition of expression and manifestation to determine these numbers whereby 2 different imprints of a single title are 2 different manifestations. Also, because the duplication of materials collection between these 5 libraries is steadily decreasing the likelihood of current materials being uniquely represented in the 5 is greater than for older materials.
Secondly, the authors seek to determine how many languages are represented in the Google 5's combined collection. They find that, just like the OCLC WorldCat system, just under fifty percent of this North American-centric collection is English language, while French, German, and Spanish language materials make up 25% of what's left. That is, while there is an English-language bias reflected naturally by libraries in an English-language speaking nation, there is still a significant number of languages represented.
Thirdly, the authors seek to solve the question of just how many of the books in the combined Google 5 collection are in or out of copyright. They have determined that about 6.5% of the combined collection is out of copyright and the percentage of that that is uniquely held by any one library is approximately 70%. That is, a significant fraction of the unique materials in the Google 5 collection are in fact out of copyright and therefore immediately available and what's more uniquely represented in each individual collection.
The most intriguing question, for me, is how the Google 5 collection compares to a similar study done for a different hypothetical combined collection. The authors use for this study a small liberal arts college, a large American public university, a large Canadian public university and a large metropolitan American public library. The authors hoped to determine what differences in coverage were represented by a more typical sample of libraries. Since over 40% of print books in the collection of 32 million represented by WorldCat are uniquely held by a single library the authors sought to discover how coverage might be increased by digitizing collections from other libraries and not merely the top 5 universities in the world.
They discovered that the most effective digitization efforts would be to enlist all the OCLC system-wide libraries in providing digital texts from their unique collection in order to ensure the highest percentage of unique titles. There were 5.6 million unique titles in the new collection, which roughly equalled 74% of the total collection, while only 58% of the Google 5 collection is unique titles. That is, the most effective digitization effort would be to enlist a great proportion of the system-wide libraries in OCLC WorldCat in order to retrieve a higher percentage of unique titles.
Thus, the best model for digital libraries to implement is one of collaboration between different types of collections. This collaboration would best serve the combined users of all the libraries involved by providing the maximum output of unique information available to all users.

0 Comments:
Post a Comment
<< Home