Monday, December 05, 2005

Technical metadata can be fun too!

I'm in the LETRS lab for the last Monday morning and working on the last project I'll do for this internship. Rather bittersweet actually. I'll stil have one hour left to fill on Thursday morning to round out to an even 180, but this is my last real protracted period of work.

This last assignment is to flesh out a processHistory XML template for the technical metadata that the DLP is using that my supervisor has started. I'm supposed to create the process history information for encoding the digibeta file to mpeg, which is the source for the end product of streaming video that will eventually be put up online for end users to view. It's a useful assignment and a natural connection to what I was working on Thursday and Friday last week.

I was given the task of reviewing the MIX metadata scheme that they plan on using for still image technical metadata and write up my recommendations for implementing this scheme. Judging from the info they were previously collecting, I went with a light level of description approach. I included recommendations for image height, width, compressionLevel, targtetID, encoding, resolution information and that sort of thing. I don't think they want to have exhaustive Level 3 records for every image they create, especially since it seems that most of the images they amass at the DLP are scans of book pages that are naturally fairly simple images where the important info is the level of contrast between the light and dark and how crisp the image is. That's the argument I made, and I suppose it was the right once since my supervisor just told me that it looked great.

This new assignment for the day should keep me busy until 5 at least. Like all metadata tasks it won't necessarily be the data entry that is time consuming but fixing validation errors.

I'm looking forward to presenting on this wonderful adventure as an intern metadata librarian on Thursday night. Still haven't written up what I'm going to say, but I've got the general idea. I don't want to scare everybody with a screenful of angle brackets, so I might go low-tech and just present from note cards. Five minutes really isn't a very long time and If I use Powerpoint for that I'll have all of five slides, if that.

That's it for now. Will post one last time on Thursday to wrap up this project.

Friday, December 02, 2005

Review #9: the DLESE model for digital library planning

In an article for D-Lib Magazine this month, Kim Kastens et. al. creates a model on which to frame initial planning for a digital library initiative. Written as a Q&A, the article might be valuable to a digital librarian just beginning to embark on a new collection initiative or else reworking an existing project so that it may be more effective and efficient.

There are two main themes to the questions in the article: the idea of knowing the audience and knowing the collection that’s being digitized. These two themes ought to define the need for which the project is being created to fulfill. While the DLESE (Digital Library for Earth System Education) initiative has many answers that other organizations might find they answer similarly, the most interesting and compelling aspect of this article is not the answers necessarily but the questions. These progress from the very general (what is the goal? who is your audience?) to the very specific (will there be sub-collections? what kind of metadata is necessary to fulfill the goal of the collection?). The primary purpose of these core 12 questions is intended to focus the initiative onto the collection itself and in providing the best level of access for the needs of the end-users.

Still, one thing that some readers might find surprising is the emphasis on evaluating resources for inclusion in the digital collection. For some collections this might not be quite as important as it was for DLESE, but it ought to be at least one of the concerns addressed during the planning process. Oftentimes, knowing how to evaluate a resource will drive the metadata creation process as your evaluation system might mirror the end-user’s evaluation purpose (in some ways although not all). Also, evaluation of the resource is tied to cataloging workflows, as Kastens et. al. argue, Adding evaluation of the resources into the planning process for a digital initiative will undoubtedly recursively improve every other aspect, focusing the direction on the needs of the users in order to fulfill their information needs regarding the collection.

In the final third of the article, Kastens et. al. shift focus from presenting a model for digital initiative planning and towards a discussion of existing and future challenges that still need to be addressed and solved in the digital library field. It might be troubling for some readers to realize that many of the challenges that have yet to be solved are the same problems that have been dealt with since the beginning of digital libraries. However, as recent as it has been since digital libraries began becoming developed it is not a source for dismay that these issues, such as creating completely accessible interfaces and resources; mobilizing research communities to participate in digital library initiatives and, most importantly, balancing the needs of end-users for simplicity and the needs of library administrators for precise, rich information in metadata development. The DLESE organization does not have any answers to these challenges, however one such answer might be standardized metadata schemes and extensively collaborative environments to establish a digital library program that is an integral part of the campus on which it resides.

See Kastens, Kim. (2005). "Questions and challenges arising in building the collection for a digital library for education." D-Lib Magazine, 11(11): Last accessed at http://www.dlib.org/dlib/november05/kastens/11kastens.html on December 2, 2005.

Monday, November 28, 2005

The end is in sight. Or, is it?

As I approach the end of the semester and the end of this internship, I feel like my brain has been wrung out over a keyboard. I have stared at infinite angle brackets and decrypted countless needlessly abbreviated error messages, but because of all the aggravations I feel as if I have learned an enormous amount of information in a very short period of time. The LETRS lab has become my home for better or worse during this semester and I’ve become quite unafraid of asking infinite questions when I get stumped (this has happened a lot during the course of this internship). My brain has been wrung out, yes, but in the process of that wringing it has absorbed an awful lot of new knowledge.

The last two projects I’ve done are proof of this. The Hohenberger stylesheet was a Sisyphean task of sorts, though in the sense that it was interminable not pointless. The United Nations/Nobel Peace Prize winners stylesheet was less so, although since this project is just barely at the lifting-off point for the DLP it’s been more an exercise in planning than anything else.

For the Hohenberger stylesheet, I learned at least two valuable lessons (though I’m sure I learned at least three or four more that I’m just not aware of yet). I learned the power and utility of the XPath toolkit. The trick for this project was learning how to use recursive processing via parent-child axis and to grasp the concept of hierarchy. I last encountered this when I had a crash course in Java programming, but I think until now I had forgotten most of it and hierarchical inheritance is the crux of contemporary Web development. The other trick to successfully completing this project was of course judicious use of xsl:if and xsl:for-each, the latter of which took me some doing to fully grasp despite its prevalence in Perl programming.
The primary problem of this project has been the sheer size of the file: it contains over 8,000 records, which translates to even more individual items. Since MODS, which is the scheme I’m mapping the original EAD records into, is designed to present individual records for complete items the final output is quite large. I never quite believed the ancient geek lament for more power until last week when I watched in dismay as my trusty Dell Optiplex workstation was brought to its metaphorical knees by the combined power of the demands of my stylesheet and the original EAD file.
Related to the size issue is that it has been near impossible to examine the complete file carefully so I initially missed a whole slew of records that were buried in the middle of the file that were set off as sub sub groups of some record groups but not of others. This is the problem and the blessing of EAD as hierarchical levels of description is allowed for the archivist but makes the job of the developer transforming these records into some new standard much more difficult. I also managed to fail to make an instance of subject elements that appear in a handful of records in the EAD file. Creating a transformation for these was itself an issue due to the nature of recursive processing, which continually goes up and down throughout the file and at first was retrieving every subject element in the entire file and placing them all in each MODS record output. This was solved by the wonderful little tool <xsl:if<.

While recursive processing was not an issue for the United Nations/Nobel Peace Prize winners project, since the input metadata for this was in Excel spreadsheets which import nicely into XML as small, neat packages of whole and complete metadata per row. The problems with this project relate to the issue of data quality and subjective use of elements. The individuals who created these Excel files (35 workbooks and about 60 worksheets) made them more-or-less human-readable with seeming little thought towards machine-readability, and so before they could be made into XML files it required some fiddling and roundabout steps to tidy up the data by removing extraneous spaces and titles). The other issue has been that while most of the elements in the original data seem required there a few optional elements. Since I’m not completely certain what each of the elements is present for nor what each element explains about the items being described I had to create a series of <xsl:if> statements, <xsl:when> loops as well as many assumptions about documents versus images, titles versus abstracts, and what constitutes technical information. In all, though, this second project was a treat to do after the slog that was transforming the Hohenberger EAD records into MODS.

Review #8: Dennis Meissner on finding aid implementation for EAD

In his 1997 article for American Archivist, Dennis Meissner presents a compelling argument for caution regarding legacy data when converting to a new standard. He discusses the Minnesota Historical Society’s situation when they were faced with converting to EAD and how they took advantage of the situation to become even more customer-focused than before.
Meissner explains how in the process of converting their finding-aids to EAD the MHS was faced with two problems. First, they discovered that many of the elements in their legacy finding-aids did not fit nicely into the EAD logical organization of those elements. He cites the situation with the identifier number, which the MHS originally treated as a string of numbers at the bottom of every finding-aid page. Secondly, they discovered that the actual structure of their finding-aids were woefully difficult to understand as they relied on archival expertise and jargon to create them. Rather than simply convert these problematic finding-aids into EAD and forget about them, Meissner and his colleagues took the opportunity presented them to "reengineer" their finding-aids, hence the name of the article as "First things first: reengineering finding-aids for implementation of EAD."

For Meissner, the primary problem with the traditional finding-aids is the bias inherent in serving users of the physical archives while EAD finding-aids are meant to serve remote users as well as those patrons who actually visit the archives. With the traditional finding-aid any problems of understanding that a user had could be remedied through user education and while this had been attempted to be transmitted to the Web there was very little evidence that this was actually successful. With the conversion to EAD, the MHS decided to make the finding-aids more transparent and readable so that a remote user could quickly retrieve the information he or she needed from the document on the computer screen.

In order to do this, the MHS adopted a customer-centered approach to the creation of the finding-aid that prefigured the traditional archivist-centered approach. With this as their vision, Meissner and his colleagues proceeded to utilize the structure of the EAD document to create an HTML page from the EAD records that would allow the remote user as well as the physically present patron to quickly read and interpret the finding-aids. Foremost, they presented the information from the general (name of the institution and logo) to the specific (item level descriptions for the collection being described in the finding-aid with the administrative information tightly packed into a single part of the document).

Meissner presents an argument that EAD conversion provided the impetus for making cleaner, more transparent finding-aids and he implies that without this conversion such an exercise (onerous as it was) would not have been pursued. This article presents a useful message to metadata specialists to always keep in mind that the ultimate goal of metadata is the output and that before creating new metadata records it is important to have a vision of what one wants to present to the user through that metadata and most importantly how that new output is going to aid the user in his or search for information.

Review #7: Janice Ruth on EAD development

As part of my archives and management class, I’ve been researching about Electronic Archival Description (EAD) on account it seems to be a very significant part of the metadata world. Back in 1997, shortly after EAD was officially released to the archival community, the America Archivist did a special two-part issue on this most complicated and most useful scheme. While most of the articles are boringly technical (detailed descriptions of what the <frontmatter> element versus the <eadheader> element is and is supposed to do), there are a couple of articles that are extremely interesting and bring up some compelling points about just why EAD ought to by all rights take the archival community by storm: here is where it becomes funny, for me, reading these arguments eight years after EAD has taken archives, shook them up, and sparked countless debates about the actual usefulness of this not-so-humble metadata scheme.

Janice Ruth hedges on the boringly technical as she discusses the finer points of the development of EAD and why the working group made the decisions it did. She begins by arguing the merits of SGML as an open-source "technique for defining and expressing the logical structure of documents." Ruth then goes onto discuss the merits of some of the more significant decisions of the EAD working group and attempts to prove why the EAD DTD is ideally suited to archival use.
In the beginning of her article, Ruth presents an overview of SGML. She explains that the working group created EAD to reflect the content and not the structure of the traditional finding-aids since local practices vary so much in how finding-aids actually look since SGML is better suited to establishing a logical context rather than a strict physical structure (the look and feel) of a document.. That the SGML DTD specifies where and when a particular element may appear in the EAD record is not physical structure but logical context for certain bits of information that belong in one element and not the other so that the machine can process it. Finally, she explains that a powerful value of SGML is the ability to create attributes that create further machine-readable granularity. As a broad summary of the technical merits of SGML, this 1.5 pages is invaluable justification for the EAD metadata scheme over, say, a legacy Access database or, worse still, traditional paper finding-aids.

With this knowledge in hand, Ruth goes onto to justify the decisions that the EAD working group made. First, she justifies the decision to minimize the number of elements created, arguing that very early on it was decided not to create an element for every strucutural decision that might have been made in various local practices and instead allow for local practice with the <odd> and <add> (Other Descriptive Data and Additional Descriptive Data respectively). Each has its own use: the <odd> element provides a space for local practice data while the <add> element allows an archivist to present further information as he or she sees fit, enclosing a series of <p> tags within this poncho style element. The other major decision made, according to Ruth’s article, was to use generic terms rather than more specific vocabulary that may be familiar in some institutions but not in others.

When she gets into a discussion of hierarchy in EAD, Ruth gets to the heart and soul of the value of EAD for the archival community. This is probably why she spends the bulk of her essay discussing the finer points of this topic. The EAD DTD is divided into four major sections: the <eadheader>, <frontmatter>, <titlepage>, and the <archdesc> elements. Each element encloses information relevant to the overall collection of records, but for hierarchy the <archdesc> element is the most important. Using the <dsc> and <did> elements, according to Ruth, the EAD working group allowed for near infinite subordinate components--records series, record groups, sub groups, sub sub groups, and so on, to be presented in a single EAD file and therefore reflect the hierarchical and non-item-specific nature of an archival repository. In this way, the intellectual arrangement of the finding-aids are reflected in the EAD DTD but not the physical structure of the documents themselves, since stylesheets can be used to transform the EAD records into HTML documents that are human-readable and therefore the EAD serves as an intermediary like most metadata records in the digital era and not the final product presented to the user as metadata was used pre-digital.

Ruth spends the rest of the article presenting a broad overview (with samples) of the EAD record itself. In this A few highlights is an example of the hierarchical work of the <dsc> element, which allows for varying levels of description from record series to group to sub group to item all of which is inherited from higher levels by the lower levels. She concludes then with a summary of the meritorious history of EAD development, which has been done since the beginning according to the needs of the archival community in order to benefit their users.

The article is an interesting glimpse, for me, into the development of a major metadata scheme as well as a revealing look into SGML, of which I know very little. It is helpful to see why certain decisions were made (from an individual who was present during those decisions) and why they were made. It is also helpful to see that many times decisions are made not for short-term convenience, but rather for long-term quality and ease of use of the finished products. The EAD DTD was developed in order to facilitate intellectual access to all the archival collections and while this is an ambitious, long-term project it is a worthy goal to pursue and for that reason the DTD was developed with a mind towards making the highest quality finding-aids possible for use in the digital age. It has its problems, as many have pointed out, but EAD is an admirable if complicated metadata scheme.

Friday, November 18, 2005

musing on what it's all about...

My supervisor posted a very interesting question to the metadata librarians listserv that generated a good deal of discussion for a relatively quiet listserv. The fact that this question (what is it that metadata does for the whole information industry and more importantly what a metadata specialist does for metadata and for the information industry) caused me to think. Metadata is a relatively new word for a very old concept--information that structures still other information--but it's doing something that has never been done before, by necessity of course in the digital world of exploding formats and volume of information. I put some though to the question and this is what I came up with.

Metadata is digital information. It is more than merely scattered bits of data, since if it were scattered it would be of little use to anybody or anything. It is more than a series of fields pulled from multiple tables in a relational database, however it is less than content. Rather, metadata is more like grammar and syntax: it is the structure for the digital sentence. This by itself might be simple enough, but digital sentences reflect our increasingly specialized era and no longer are mere periods, question marks and semi-colons effective for creating the kind of structure that different information communities need. Yes, there is still a place for the plain but reliable period or the awkward yet useful comma, and that is why we have Dublin Core, but there is need now for more specialized marks like preferred citation form for a manuscript collection or the instrument a particular sheet music or the location on the network or the place in the streaming video file that the resource is at. All of this, both traditional syntax and new, more specific, more granular, structural elements represent metadata and makes sense of all those digital sentences floating out there that without metadata would be meaningless, un-findable and not of use to anybody much less the individuals in the information communities that need the information.

A metadata specialist, therefore, has at least two primary purposes in the organization in which he or she works. First, the most significant task is to know almost all (nobody can know every metadata scheme or digital sentence structure) and how to use them to produce effective and findable information. Knowing how to use them most effectively is not itself a separate purpose because without it knowing what the different digital sentence structures are is of little purpose at all. Yet being able to uncover the most effective uses of these structures is more than merely reading a set of documentation about a particular sentence structure. It is very much problem-solving: knowing what to use when and how to use it. Second, the metadata specialist is an ambassador for better digital sentence structure. These structures are complex, their meanings and reasoning obscured to most except those who have trained to decipher them, and often the other information professionals as well as the professionals who create the information in the first place need convincing in order to grasp the full scope and value of the digital sentence structure to themselves and to others. A metadata specialist therefore walks in two worlds: the world of code and angle brackets and machine-readability and the world of human emotions, insecurities and anxieties and being able to switch back and forth between those worlds is essential to being a good metadata diplomat.

The problem that metadata specialists encounter is that most people are not this way or at least they don’t think of themselves this way. They are either "people persons" or they are "gearheads" and ne’er the two shall meet. This is the way we have learned to think about the world and the people on it. The metadata specialist is out there beating the bushes, trying to draw everybody into the new digital reality and admit the truth that this dichotomy is non-existent and that we are all both emotional and intuitive as well as computer literate and minded sometimes scares people.

The trick over the next few years will be assuaging that fear and showing it as baseless anxiety. I’m still not certain how I will do that, however, just that this is the root of the resistance that metadata specialists encounter.

Monday, November 14, 2005

Baby steps towards knowing XSLT even better...

This morning I worked on an XSLT stylesheet to transform a slew of Excel spreadsheets into a collection of MODS records. This itself is no difficult task; transforming anything to MODS metadata is practically second nature at this point in the semester and, I think that transforming Excel spreadsheets into something as simple as MODS ought to be a couple hours on Tuesday afternoon job. However, in this case, there are a couple of complications that will slow things down a bit.

The most important thing is the state of the data that has been recieved. Out of 35 individual spreadsheets for approximately that many individual Nobel Peace Prize winners (and a couple of random collection of photographs just to spice things up), there are 6 different description schemes with some interchangable elements, some unique elements in each one of the schemes. The other, much more problematic, obstacle against easy transformation is that while the spreadsheets are human-readable (with all sorts of nifty spacing, padding and big bold titles), all of this human-readability makes it impossible to be machine-readable. This adds an extra step to making it friendly; I have to go through all 35 Excel workbooks and break it all up into separate worksheets and eliminate all those extra spaces that mess with the XML export.

I suppose this is the big lesson of this project. I ought to just expect messy data. The second lesson is that messy data is why things like <xsl:if> and <xsl:choose> exist. Of course, there's a myriad other things to do with these wonderfully diverse tools (the hammer of the XSL language), but for what I'll be doing with them they exist to deal with messy original data.

Thursday, November 10, 2005

METS: useful but no fun at all

Since my supervisor is away at a conference learning all sorts of interesting stuff this week, I'm my own boss with a list of projects to work on. One of these projects was to learn about METS metadata, which is a highly structured metadata scheme used not so much for descriptive but technical metadata to run the pageturner application that allows the user to read a digitized text online in much the same way (page by page) that he or she would do so with a print book.

I've created the XML record that I was instructed to create in METS. It's...interesting. I can see how it's useful immediately. Anything that will tie individual URLs together to form a cohesive whole--like a book, for example, or a series, perhaps--is an extremely valuable tool for a digital library to have. However, my supervisor was right, it's no fun to create a METS document. It's a matter of entering URI references over and over again, establishing mapped tables of contents both for the human users of the document or book that METS is helping to put together into a single unified whole again instead of just a scattered collection of image files as the digital document is without METS. That is, the true power of METS is in its ability to create both a human-readable table of contents and a machine-readable table of contents simultaneously. METS is truly a wrapper metadata scheme meant to facilitate the use of various other metadata schemes within a tidy application that ties it all together for the user so that the general public never knows or sees the complications going on behind the scenes. Truly, METS is the wizard of the metadata scheme.

Saturday, November 05, 2005

Review #6: Peter Morville on Findability

In a recent article for information Today, Peter Morville, who has always been a lightning rod for bleeding-edge information and library science theories and ideas makes a controversial argument for a new theory of these two joint areas of scholarship and industry. He proposes that rather than merely focusing on the organization of knowledge so that users who advance upon it in the traditional or non-traditional venues that we as librarians have come to expect and build architectures around that we instead focus on making information findable through a variety of paths and directions that we as librarians can’t possibly predict or hope to index in total.

Mr. Morville uses as evidence of the need for this new model of information and library science two examples. First, he cites the continuing use of popular search engines like Google that wreak havoc on our carefully constructed information architecture by ripping apart our web sites and caching the individual web pages that can then be retrieved by users through keyword relevancy searches. This can cause numerous difficulties for the users when they click on dynamically created links to pages that, once they are there, present little or no information about where the information is coming from as well as information that is only peripherally relevant to their search. It also wreaks havoc with the concept of authoritative information as users find pages that are part of sites developed by commercial interests or less savory individuals who are distributing incorrect or, worse, false information rather than retrieving pages on sites developed by highly authoritative, reputable and distinguished organizations that present carefully researched, proven and well-written information that is of high relevance and importance to the user’s actual information need. The popular search engines, while simultaneously making information easily retrievable by all people regardless of information retrieval skills and ability, also disrupts the traditional notions of information authority on which our concepts of information architecture, cataloging and library access are built upon. Instead, the users of the Web are left with a hodge-podge of files containing bits and pieces of information that may or may not be helpful to them.

According to Mr. Morville it is the duty of the information and library professionals to adapt our current activities to bring our historically proven valuable constructions and models of authority and information organization and architecture to the Web search realm by simultaneously optimizing our architecture for random entrances to pages deep within the site and not merely for users accessing the site through the front door or home page as well as optimizing our pages for keyword search relevance and link algorithms the likes of Google.

In order to do this, Mr. Morville proposes a model based on three questions that any information architect needs to ask him or herself before beginning the development of a site. First, he or she must ask: Can the user find the website? Second, he or she must ask: Can the user navigate the site? And, finally, he or she must ask: Can the user find the content despite the site? While all three combine optimizing for search engines with traditional information architecture and organization that library and information science has been built upon, the third is the most significant for this new era of find-ability that Mr. Morville is proposing. He argues by seeking to develop a web site with informational content that is easily and intuitively available to all users who will be seeking out this information using a variety of keywords or terms that we can bring our notions of information authority, organization and access into the age of Web search engine.