Skip Navigation

White Papers

ACLS Humanities E-Book XML Conversion Experiment:
Report on Workflow, Costs, and User Preferences

ACLS Humanities E-Book
White Paper No. 2

By Nina Gielen
Editor for Digital Content and Production, ACLS Humanities E-Book
Contact: ngielen@hebook.org

DATE: February 11, 2009

White Paper: HTML version [this page] | PDF version | Print version via Amazon


Contents

Executive Summary

I. Introduction
A. Purpose of and target audience for this report
B. Brief overview of ACLS Humanities E-Book
C. HEB's two e-book formats

II. XML Backlist Conversion Experiment
A. Description of experiment
B. Initial considerations
C. Production process
D. Addressing unforeseen issues
E. Hits received by XML versus Page-Image titles
F. Production cost analysis

III. User Survey
A. Description of survey
B. Outcome and implementation of results

IV. Desirability and Viability of Extensive XML Conversion
A. Practical considerations for undertaking mass conversion
B. Future formats for digital publishing

APPENDIX: Survey Results

Notes

Illustrations
Figure 1: ACLS Humanities E-Book home page
Figure 2: Table of contents for frontlist XML title, with only lowest-level subdivisions in each
section accessible
Figure 3: Table of contents for backlist XML title, with all section levels accessible
Figure 4: Tables in Newman, Voice of the Living Light, page-image version
Figure 5: The same tables in the XML version of Newman, Voice of the Living Light
Figure 6: First page of HEB survey
Figure 7: Example of a print page number in an XML title (see circled text)
Figure 8: Example of pull-down menu for page-image title viewing options (see circled text)



Executive Summary

In 2008, ACLS Humanities E-Book (HEB)—a subscription-based online collection of over 2,200 digital titles in the humanities—undertook an experiment to investigate the possibility of a future mass conversion of e-books preexisting in a scanned, page-image format into XML-encoded files. HEB hypothesized that such a conversion, if doable, would create an overall more user-friendly collection, as the resulting books would be text-based, allowing for easier access and export of contents, and would include interactive features such as links and enlargeable figures.

HEB had 20 sample page-image titles from its backlist converted to XML, using OCR-derived text files that had been created during the initial scanning process to enable searching. The books were tagged using a simplified version of HEB's standard specifications, to reduce the need for editorial intervention. The results included a significant amount of errors due to imperfect OCR and other factors. Several titles were proofread for this reason, but in the end HEB found that such oversights were impossible to resolve without devoting additional staff time and handwork to the effort. The cost of creating the XML titles was considerably greater than that associated with scanning (about $400 versus $170 per title).

The XML books were presented in the HEB collection side by side with their page-image counterparts. Despite any conversion-related flaws, our subsequent user survey indicated that readers preferred the XML format by a margin of about two to one, the most relevant factors cited in this regard being readability, accessible text, and additional features and functions not available in the page-image version.

In addition to cost-related concerns, HEB's stance is that, before retroactively converting its entire backlist, a study of alternate digital formatting options is needed, in conjunction with an assessment of further possible uses and applications for our titles. Our preliminary conclusions are as follows: due to its extensibility and flexibility, XML is still well suited to web-based presentation, and therefore may well remain viable for use with online archival collections and extensive research-oriented projects; while a number of alternate encoded-text formats, among them ePub as well as certain proprietary formats, are likely in the future to predominate for non-web-based applications, such as handheld electronic readers.

[ back to TOC]

I. Introduction

A. Purpose of and target audience for this report

This report is largely derived from an experiment in digital publishing formats recently undertaken by ACLS Humanities E-Book (HEB). It will describe the parameters and outcome of this experiment, results of a survey HEB conducted upon its completion, and some conclusions regarding the viability of XML and various other e-book formats drawn from these. We hope our findings will be of interest to other publishing entities involved (or considering involvement) in electronic publishing, funding organizations interested in supporting e-publishing ventures, and librarians trying to determine how to best serve their respective institutions through use of digital resources.

B. Brief overview of ACLS Humanities E-Book

ACLS Humanities E-Book (see figure 1) is a subscription-based, online collection of over 2,200 digital titles in the humanities spanning various disciplines, including history (all areas), women's studies, methods and theory, science and technology, literature, and the arts; as well as offering several discrete series (Gutenberg-e, the John Harvard Library, Records of Civilization, the American Historical Association'sGuide to Historical Literature, the Catalogus Translationum et Commentariorum: Mediaeval and Renaissance Latin Translations and Commentaries, monographs published by the College Art Association, and the Collected Writings of Walt Whitman).[1] In order to select and obtain these titles, HEB collaborates with 14 learned societies and about 100 contributing publishers. The books are hosted and disseminated by the Scholarly Publishing Office (SPO) of the University of Michigan Library, which also provides extensive technical support.

Figure 1: ACLS Humanities E-Book home page
[click to enlarge]

HEB's mission is to further digital scholarship by providing publishers with an incentive and a forum for electronic publication, in addition to encouraging libraries and other institutions to offer access to these high-quality works in the humanities for reference and research within their own collections. HEB is a not-for-profit venture, funded upon inception in 1999 by grants from The Andrew W. Mellon and Gladys Krieble Delmas Foundations, but self-sustaining since 2005 through income from subscriptions and print-on-demand sales.

C. HEB's two e-book formats

The HEB collection includes two distinct e-book formats. Most of our books (about 97%) are presented as scanned page images and thus represent an exact replica of the print book from which they originate. Underlying each image is an OCR (optical character recognition)-derived text document, which allows users to search these titles. This technology is similar to that long employed by JSTOR for offering its digitized archive of journal articles.

The collection also includes several dozen text-encoded titles, tagged using HEB's own specifications for XML (extensible markup language) encoding. The books are dynamically transformed into HTML for web delivery, and this is the format users see when they access the title. These titles differ noticeably in appearance from the print version—and some are in fact born digital and thus lack a print counterpart altogether—and allow for the inclusion of links and other interactive features that arenít a possibility for the scanned books.[2]

Within our browse lists and search interfaces, we use different icons to designate each type of e-book: a red circle with a stylized book image for page-image titles (), a blue circle with a computer for XML titles ().

[ back to TOC]

II. XML Backlist Conversion Experiment

A. Description of experiment

In 2007 HEB decided to undertake an experiment to determine whether it might be desirable and feasible in the future to offer a greater number of titles in XML. There seemed to be a number of potential benefits to this. While our scanned books are relatively static, the encoded titles are able to make use of the same interactive elements that users are accustomed to finding all over the web: references can be hyperlinked, images can be enlarged, supplemental audio and video files can be offered alongside the text, and the text itself can be copied and pasted. In addition, HEB had been approached a number of times over the years by subscribers who pointed out that, as a text-based format, XML was more advantageous for visually impaired users employing electronic screen readers. Bearing all this in mind, we also had some reservations about the prospect of XML conversion on a large scale. For one thing, this process normally requires a higher degree of editorial input to establish how certain elements are to be handled. But in contrast to our regular frontlist XML titles, which are developed in consultation with the originating presses, no actual production editor would be involved this time.[3] Since a number of different files need to be assembled to make up the complete book (the properly tagged XML file, figure entities, and any other media) there is also generally a higher cost associated with XML conversion. For these reasons it seemed essential to find a way of gathering additional data on whether or not such a project was actually worth undertaking.

We picked 20 titles that were already available in the HEB collection as page-image books in order to retroactively convert them into encoded text, with the intention of offering the two versions side by side for direct comparison. Nine of the titles were originally published by the University of California Press and four by Cambridge University Press (both presses had been asked to participate in the selection process). The remaining seven titles were selected by HEB in accordance with factors such as diversity of formatting and layout, number of images, use of various non-text elements (such as musical scores), and inclusion of special characters, to test how all this would translate to the encoded-text version. Once the XML versions had been released into production, HEB planned to carry out a reader survey, asking users to access and evaluate each title in its two different incarnations. The idea was to complete the conversion process by early 2008 and conduct the survey for the duration of several months, through the spring and into summer, giving participants ample time to test the books and log in their responses.

B. Initial considerations

One of the first questions we needed to address before getting started was who would handle the actual production and to what extent this could be incorporated into the preexisting workflow at SPO. We consulted with SPO director Maria Bonn to ascertain whether it seemed practicable for their team to take over this process entirely, which is in effect how the regular page-image backlist titles are processed: rights and physical copies are secured by HEB staff at our New York office, after which the books are shipped to the University of Michigan, where SPO has them scanned and performs any other tasks associated with preparing them for upload. Eventually it was determined that it made more sense for HEB to communicate directly with the vendor, in order to address any editorial queries that might come up along the way. HEB elected to work with Aptara, a data-conversion service provider with whom SPO had an established relationship. In addition, Aptara was already somewhat familiar with the collection through their previous work on the backlist and was competitively priced.

A second question that immediately presented itself was how to obtain the text that was to be tagged. The two possibilities were a.) using the preexisting OCR files generated for the purpose of searching the page-image titles, as described above, or b.) having each book rekeyed in its entirety (using a double-key process to minimize errors). The OCR-derived text files had the advantage of already including minimal tagging that could be built on for XML encoding. However, we knew they were on average considered to be only 99.99% accurate (with 19th-century titles slightly less so, about 99%)—enough for performing searches, but meaning the books would ultimately require greater scrutiny to correct inevitable errors. Double-keying, on the other hand, would result in fairly reliable output that could also include additional structural and formatting cues for deeper tagging purposes. After obtaining quotes for this process, we decided that the expense of rekeying was probably greater than that associated with the more extensive quality control potentially required if we chose to work with the existing text files.[4] We therefore went with what we believed was the more practical and cost-effective choice of using the OCR texts already on file at SPO.[5]

Due to the aforementioned lack of press oversight for this project, it made sense to streamline the conversion process by applying a pared-down version of the DTD (Document Type Definition) and specifications normally used for tagging HEB's XML titles. Normally, two versions of each title are created: a "master" XML that's more deeply tagged for future use outside of HEB, and the slightly simplified version submitted for upload to SPO, conforming to an alternate DTD. In this case there was no master, only the simplified version intended for online display. These no-frills XML titles wouldn't include any elements that did not preexist in print or could not be readily derived from the page-image / print version (e.g., a list of illustrations would be included only if its counterpart already existed in print, whereas normal HEB protocol would call for its addition to any XML book with illustrations)—nor would any unnecessary formatting changes be performed, regardless of whether or not the titles matched the collection's predominant style—again, all deliberate omissions to circumvent the need for significant editorial intervention.

Probably the most crucial deviation from HEB's regular XML protocol involved the books' structure: XML titles normally suppress all higher-level "container" sections, so that users always access only the smallest available text chunk in each overarching section. For example, if a chapter is divided into a number of subsections, the chapter head would be displayed in the table of contents, but only the subsection heads would include active links for pulling up the text chunk in question. (See figure 2.) It then often becomes necessary to create additional introductory sections (for which a header must also be created) that don't exist in the print version for the opening paragraphs of suppressed container sections, which could otherwise not be accessed by the reader. Since this seemed difficult to accomplish without case-by-case examination, for this set of titles, we would simply make all section levels accessible. (See figure 3.) This would affect the process of tallying hits for these titles—something needed in order to calculate royalties for publishers and usage statistics for libraries—as users could now potentially read an entire book by accessing only a small number of chapter-level sections (which in turn would generate fewer hits than reading the page-image version). However, the schema for tallying hits in XML titles had always been a formula-based approximation of print pages accessed, and by making an adjustment to this formula (applying an average of fewest possible hits and largest possible number of hits) we hoped to account for the new system of access.[6]

Figure 2: Table of contents for frontlist XML title, with only lowest-level subdivisions in each section accessible
[click to enlarge]

Figure 3: Table of contents for backlist XML title, with all section levels accessible
[click to enlarge]

Finally, there were cataloging and display concerns to be addressed with these books. Readers needed to be able to differentiate between the two formats, which was ensured within the collection's browse and search interfaces by means of the logos listed above. But we also had to make sure the two electronic editions would be adequately differentiated for cataloging and registration purposes. Together with our MARC (machine-readable cataloging) specialist, David Richtmyer, it was decided to simply add one consistent statement to the copyright page for each title ("ACLS Humanities E-Book XML edition 2008"), which would otherwise feature the same copy as the page-image books, even though some of this copy might pertain specifically to the print edition.

C. Production process

(The following two sections, C. and D., contain detailed production-related information of interest primarily to publishing professionals. Other readers may wish to move on to section E, "Hits received by XML versus Page-Image titles.")

Before commencing production, the various materials needed for each title had to be assembled and prepared. The OCR-derived text would be transmitted to the vendor by SPO for further encoding. While each page of every title already existed as a scan, HEB wanted to apply "error diffusion," a higher-quality scanning procedure, to figures for this set of titles, since we would be offering our standard XML image-enlargement option for most. This meant HEB had to go over each physical copy again and flag the pages in question. We also flagged line art, including elements such as musical scores and complex tables that could not easily be presented as encoded text, with the understanding that it would ultimately be necessary for the vendor to refer back to the print books in order to insert the figures in the right order. All figures thus marked were scanned by the vendor as needed. They were then cropped to trim away any text, rotated for correct orientation, and the final entities renamed by SPO in keeping with HEB's standard naming protocol, under the guidance of Digital Projects Librarian (and SPO's HEB project manager) Terri Geitgey. As the final component, HEB renamed the preexisting book-cover files from the set of page-image titles already online in accordance with the new encoded set's ID numbers, so that cover images would be available for each version.

HEB sent the 20 flagged books to the vendor's Ohio facilities, from which they were subsequently shipped to the actual hands-on production team in Delhi, India. SPO sent CDs with the digital text files and images via the same route. Next, HEB followed up with a set of instructions specific to these titles, as well as sending the more reductive DTD referenced above against which to parse the XML output,[7] along with a tagging sample. Finally, we provided a list of URLs and access info for the 20 page-image counterparts already available in the collection, to be used for additional reference as the vendor saw fit.

HEB asked for one particular title to be encoded and reviewed as a test case before proceeding with the remaining 19. This title, Voice of the Living Light: Hildegard of Bingen and Her World by Barbara Newman,[8] was a collection of essays and contained several of the aforementioned elements whose effective conversion we were eager to test: tables, bylines, line art, et cetera. (For a sample of how these were rendered, compare figures 4 and 5.)

Figure 4: Tables in Newman, Voice of the Living Light, page-image version
[click to enlarge]

Figure 5: The same tables in the XML version of Newman, Voice of the Living Light
[click to enlarge]

[ back to TOC]

D. Addressing unforeseen issues

HEB had assumed that most conversion steps we requested could be performed programmatically, or, if this were not the case, that we would be notified that certain tasks could not be performed as requested at all. This, in turn, would help us establish guidelines for efficiently encoding these backlist titles. When files for the sample book were delivered back to us we saw that differentiating between numbered and unnumbered paragraphs had been highly successful.[9] When queried whether it had been possible to arrive at this distinction based on the pre-tagged OCR-derived text alone, the vendor replied, "We checked it thoroughly against the print book." Referring back to HEB's online tagging specifications, the vendor had also added a number of hyperlinks for internal cross-references, although this was not among our list of requests. The problem with programmatically adding these links was that for certain elements, such as figures, the labeling in the text might not directly correspond to the figure entity name itself, especially as HEB may have also flagged tables or other graphic components to be treated as figures that hadn't been treated as such in the print book. For example, "Figure 3" might not refer to figure entity "heb90001.0003" but could in fact mean entity "heb90001.0005". Thus the target could only be correctly identified by hand-checking or through use of a previously established chart, something that seemed rather labor-intensive and which we were not planning on asking for.

In the end all this handwork was beneficial for the 20 experimental titles in question, because it made them somewhat more user-friendly, but less beneficial for tracking actual effort expended with the intent of establishing whether or not the experiment was ultimately replicable and scalable. Of course, one might also argue that if this extra effort didnít result in an exorbitant cost increase, we were in fact getting everything we needed out of the experiment, and consequently there was no need to establish a routine that avoided most handwork.

Another post-conversion issue we encountered was an unexpected amount of OCR-related errors. We received notification from the vendor that they'd found a number of typos resulting from imperfect OCR in the sample title (e.g., "i" or "I" substituted for "1" and extraneous spaces between numerals, both problematic for a historical text making extensive use of dates). On closer inspection we realized there were quite a lot of these types of errors, and that there would be no means of correcting these without proofreading the entire book. Based on these findings, we decided that HEB would spend a short amount of time (about 10 or 15 minutes) upon receipt checking each title for OCR-related problems and typos and accordingly decide, on a case-by-case basis, whether word-for-word proofing were warranted. We discovered eventually that the extent of typos depended in part on the font used in the print edition and that some books therefore turned out much cleaner. Proofreading and implementation of any resulting corrections would be handled by the vendor as well. The other option of hiring freelance proofreaders seemed less practical simply because—as in-house production work is not the norm for HEB—we did not have adequate numbers of these at our disposal for checking multiple titles at the same time.

Five books in total among the sample set were selected for proofing. Beyond the numeral/letter confusion already noted, HEB also instructed the vendor to be on the lookout for missing italics and diacritics. When the corrected books were returned to us, improvements had been made in all these areas, and in addition paragraph numbering had been corrected (as certain paragraph breaks had been missing from the OCR source). But HEB still found a fairly large number of oversights involving other items that hadn't been explicitly mentioned for review. Examples of these are the use of italics in phrases and titles where italic and roman font alternate, use of hyphens versus m-dashes, and additional spaces or missing spaces that were overlooked by the proofreader. Even upon resubmitting the files to the vendor for multiple rounds of proofing, the results did not significantly improve. HEB therefore concluded that this particular step was not something that could easily be performed outside of a traditional publishing set-up; that is, without drawing on an established copyediting / proofreading review cycle and its associated personnel, accustomed to making these types of editorial distinctions.

Before submitting these books to SPO for upload, HEB opted to correct a number of other oversights not addressed by proofing, with the intention of devoting at least the same amount of time and attention to these as to the regular page-image books, which undergo a standardized quality-control assessment before being released to the live collection site. In keeping with the general protocol for this set of books, the plan was to tackle only those items that would otherwise interfere with usability and reader comprehension, leaving alone anything deemed merely a question of aesthetics or formatting preference. An example of the former might be missing indentation for nested lists in the index, since leaving this uncorrected could have the effect of obfuscating the relationship between main entry and sub-entries. An example of the latter might be switching to title case for text originally styled in small caps—something our XML titles can't include—and converted by default as regular caps. In the end, HEB did devote somewhat more time to streamlining the XML backlist than is normally spent on checking page-image books, spending between four and eight hours on each experimental title.

E. Hits received by XML versus Page-Image titles

Once the books were uploaded, HEB compared the number of hits received by each version. For this report we examined the entire time span for which the experimental XML titles had been available online, mid-January through mid-November 2008. During this initial ten-month period, seven XML titles outperformed their page-image counterparts, with another three receiving a comparable number of hits.

There are some caveats here, though: for one thing, MARC records for the XML versions were not disseminated to subscribers until May, as HEB generally distributes MARCs for new titles in one large batch only once a year. Hence, these books were less likely to be found by readers using the subscribing institution's digital catalog (the most common route of access) until this point. Also, for the duration of our reader survey (see section III below), April through July, users were explicitly instructed to examine both versions—meaning, numbers for both are artificially inflated during this period, and the two versions were also more likely to get equal attention. We therefore assume that the numbers become more reliable only starting in August, after which point neither format was more likely to be found or clicked than the other. The majority of page-image books outperformed XML titles during this latter period (August through November 2008) by a margin somewhere between 2:1 and 20:1, with only three XML titles receiving the greater number of hits.[10] Of these three, the total number of hits received ranged from 17 to 180, indicating they probably do not constitute the most reliable sample.[11]

It is also important to remember that the approximation of "page" hits in XML books—based on print page count and the number of accessible sections—is complicated further in this case by the fact that the XML backlist titles can be accessed at any level, although, as previously mentioned, HEB has attempted to compensate for this by averaging out all possible levels of access. Taking all this into account, it may simply be necessary to monitor usage for a longer period of time before any sort of definitive pattern can be established.

F. Production cost analysis

Based on other price quotes as well as past experience, HEB did not find the XML conversion from OCR text to be overpriced. Still, at an average cost of about $285 per title, this was considerably higher than the average $170 scanning and processing fee associated with our page-image titles.[12] Added to this was the cost of proofreading the books and entering corrections, about $470 per title for the five books we elected to have proofed. For this experiment, the average price for the 20 books in question therefore came to about $400 per title.[13]

 

  TOTAL AVERAGE COST PER TITLE
Conversion only $5,724.75 $286.24* (for 20 books)
Proofing only $2,339.75 $467.95 (for 5 books)
Conversion and proofing $8,064.50 $403.23** (for 20 books)

* If developing books from scratch, the cost of scanning, OCR, and other processing fees would need to be added for an accurate total, bringing the average conversion cost to about $455 per title.

** If developing books from scratch, the cost of scanning, OCR, and other processing fees would need to be added for an accurate total, bringing the average cost of conversion and proofing to about $570 per title.

Not factored into this was the overhead for post-proofing in-house review and corrections. However, whether or not this should be added to the total is open to debate; as noted above, a certain amount of quality control goes into every title published by HEB and is part of our regular workflow. Using a professional freelance proofreader would probably have resulted in less subsequent review time for HEB staff, but this consultant's fee would undoubtedly have inflated overall proofing costs. The final verdict remains that producing XML-encoded titles is in any case significantly more expensive than producing page-image titles.

[ back to TOC]

III. User Survey

A. Description of survey

In order to help us assess the relative usefulness of the two e-book formats, HEB planned from the outset to conduct a user survey asking our readers to compare versions. We were especially interested in polling readers on the following: formatting, as pertaining to readability and user-friendliness; accuracy of text; and navigation. Accordingly, we devised a survey that covered these areas in several questions. For accuracy, we asked specifically about OCR-related errors spotted in XML books and whether these had an impact on readability; for format and appearance, we asked how important it was to replicate print formatting exactly; for navigation, we asked users to evaluate the scanned books' page-by-page structure versus the encoded books' multi-level subdivision structure. We also asked for an evaluation of individual features offered only by one format or the other (e.g., enlargeable images, copying and pasting, complete fidelity to original text, etc.). Where appropriate, we asked readers to elaborate on their answers in open-format questions. Finally, we requested an overall assessment of which format seemed more satisfactory and why.

While we had certain expectations regarding which format would be preferred for specific features—e.g., page-image books for accuracy and formatting, XML for navigation, etc.—the challenge was to pose these questions to survey-takers without inadvertently swaying them one way or the other. HEB associate editor Brooke Belott coordinated the effort and assembled these questions using the online service surveymonkey.com, which could also be used for easily tracking results. (See figure 6.)

Figure 6: First page of HEB survey
[click to enlarge]

Groups to be targeted were all primary contacts at subscribing institutions, as well as individual subscribers and contacts at our collaborating ACLS learned societies. We sent out initial invitations to participate in April 2008, then two more follow-up e-mails in subsequent months. As an incentive for responding we offered an iPod Touch giveaway to be raffled off among all participants. The survey was available online from April through the end of July; a total of 226 responses were received, about 78% from participants who identified as librarians. Other categories of responders included faculty members at subscribing institutions, independent scholars, and students. While this response rate was somewhat lower than HEB would have desired, in the context of this particular survey, it was sufficient to provide us with a good idea of user preferences.[14]

B. Outcome and implementation of results

Taking all factors into consideration, XML was considered the preferred format by 69% of survey participants. In addition, even users who preferred the page-image books overall often mentioned liking the XML encoded version better for certain uses. A number of respondents pointed out that they would consider it ideal to be able to access all titles in both formats.

The top reasons that the encoded version was preferred were readability, the benefits of having access to a text-based format (entailing, for example, the ability to cut and paste, as well as the option of using automated screen readers), and—as a surprisingly distant third—the interactive features offered in the XML books. While the latter two factors were somewhat anticipated, HEB did not expect that the notion of "readability" would take on such importance; in fact, about a third of all encoded-text proponents cited this as a factor influencing their preference. Presumably the books' uniform appearance with a consistently legible font was of relevance here (some pointed out that type in the page-image books appeared "fuzzy" or "blurry" to them). Perhaps the no-frills page-layout was also significant, with wide margins and elements such as figures and tables centered rather than encompassed by wrap-around text, along with having access to longer text sections. Since the necessity of conserving page real estate generally comes into play much more forcefully in print-book production, the same type of generous layout usually doesn't exist for HEB's page-image titles.

Conversely, a significant number of respondents preferred the page-image books precisely for preserving the original page layout, as well as for mirroring the print format and being more "book-like" in general. One factor prominently cited in this regard was the notion of navigating through an electronic title as though turning the pages of an actual print book. These were the most prominent reasons given for choosing this format, followed by accuracy and fidelity to the original text. In a similar vein, the page-image proponent group reported feeling dissatisfied with the formatting used for print-derived page numbers in the XML books—rendered in small font and bracketed, as we encourage paragraph-based citation for our encoded titles—and feared this would prevent easy cross-referencing between the print and electronic editions. (See figure 7.)

Figure 7: Example of a print page number in an XML title (see circled text)
[click to enlarge]

From our survey we were also able to gather some additional feedback on the collection at large, enabling us to make a number of improvements. Foremost among user requests was a desire for better printing options. Printing of HEB titles has always been restricted to fair-use provisions, and for this reason there had never been any immediate way of printing out pages without prior browser adjustment to accommodate frames—the intention being to discourage printing out long sections of copyrighted text at once. Because this was an issue for so many of our respondents, HEB decided to make page-image books available as PDF (Portable Document Format) files in the future to facilitate printouts of up to three pages at a time. We also now make the original (unformatted) OCR-derived text files available for viewing highlighted search terms and application of screen readers. Both these options now appear in a pull-down menu in the toolbar at the top of each page. (See figure 8.)

Figure 8: Example of pull-down menu for page-image title viewing options (see circled text)
[click to enlarge]

Another improvement to the collection resulting from survey feedback will be development work on a module allowing for easy citation of whichever title is currently being viewed, rather than users having to compile bibliographical information by hand. Finally, while we will not be breaking XML titles down by print pages in the future, as some respondents requested, we will be revising formatting of page-number references to alleviate the difficulty some users encountered with immediately locating these.

[ back to TOC]

IV. Desirability and Viability of Extensive XML Conversion

A. Practical considerations for undertaking mass conversion

Is a mass retroactive conversion of page-image titles into XML desirable? The survey results, if not yet the user stats, do seem to bear this out. For providers such as HEB, with the number of prospective backlist titles to be converted in the thousands, however, there are several factors to take into consideration first.

In-house staff size directly affects the efficiency of the preproduction and review process. At HEB, two part-time staff members were available to work on this for the duration of our conversion experiment, with additional input from other staff at various junctures. As we've in effect concluded that it is inadvisable to bypass the involvement of a production editor (or equivalent personnel) altogether in order to create reasonably error-free books of a uniform quality, the existence of a larger core group of production staff would be a prerequisite in order to successfully process HEB's entire collection. Securing adequate funding would also be a necessity, since HEB's initial findings indicate a cost for converting to XML of nearly 2.5 times (3.5 times if developing books from scratch) the approximate expense of a page-image title. This could probably be reduced if a larger number of titles were processed simultaneously. However, we started out with a conversion rate that was already relatively low, and therefore our assumption is that savings here can't be increased by much.

An entirely different approach could be to relax XML guidelines in order to facilitate a completely programmatic conversion. However, such a move would entail a lowering of standards that would almost certainly result in a less robust and therefore less desirable (if not unusable) product.

Whether or not mass conversion makes sense in the end depends on the size, scope, and resources of the organization in question. Before an effort of this magnitude is actually considered, however, publishers may also want to investigate further the longevity of XML as an optimal format for their digital publishing needs, and consider alternatives that might offer similar benefits without calling for such extensive time and financial investments. This topic could be covered at length in another white paper; some initial thoughts are presented below in section B.

B. Future formats for digital publishing

HEB decided on XML over other encoding options prior to launching its collection in 2002 for a number of reasons, primarily related to versatility and flexibility: "XML allows us to create our own set of tags so we can mark up specific elements in the e-book and style and manipulate them as we choose.… For more efficient development and maintenance, books encoded in XML can be processed as a collection. Designs can be managed through templates, and new functions can be programmed for books across the collection."[15] But other publishers are now beginning to move away from XML. Erich van Rijn, Director of Publishing Operations at the University of California Press, reports that the press's new digital titles are currently being released as web-optimized PDFs only.[16] This process is also being applied to its catalog of backlist titles slated for online release. The press is, however, considering reintroducing the format by generating an additional XML version of each title, minimally tagged to comply with the DTBook (Digital Talking Book) DTD.[17] In the past, a more elaborate DTD had been used for encoding the pressís California Digital Library collection, but this deeper tagging was eventually dropped because the final application of these files, destined for online presentation only, did not appear to warrant the considerable effort involved.

Project MUSE, an online archive of digitized journals based at Johns Hopkins University Press, currently does make use of XML, in accordance with NLM's (National Library of Medicine) Journal Archiving and Interchange Tag Suite.[18] In fact, this approach was adopted only last year and a number of archived texts originally encoded in HTML are now undergoing retroactive conversion, according to project director Mary Rose Muccie. Johns Hopkins University Press is also preparing a new e-book pilot project for launch, however, and these titles, for the time being, will not be encoded at all; instead, they will be offered as universally accessible PDFs.

There seems to be a growing consensus that hand-held reading devices, such as Amazon's Kindle and the Sony Reader, represent the next big wave of electronic publishing for the commercial book trade. Apple's iPhone can also download electronic-reader software, and further developments in this area appear to be imminent for traditional cell phones. While portability itself may be a big draw, the innovation of the user-friendly electronic paper display employed by several hand-helds has proven popular among readers, quelling one common complaint about e-books, namely, that they can't comfortably be read on a computer screen. AZW, Amazon's proprietary format, represents a variant of the Mobipocket standard, an encoded-text format designed specifically for e-books. Also commonly supported by portable readers and PDAs are plain text, MSWord documents, and HTML. Conversion programs are available allowing for the transformation of PDF into AZW for use with the Kindle; the former is also supported as is by a number of other devices, for example Sony's Reader. One might therefore conclude that this format is especially useful for accommodating both elaborate design elements and text-accessibility requirements, thus covering all bases for print and digital output. However, PDF has the disadvantage of offering neither reflowable text, a requirement for small-screen devices unable to legibly display the original image-based layout, nor cross-title searchability.

Publishers seem to be in agreement that, as yet, no universal industry standard suited to all prospective applications and devices is on the horizon. However, ePub, the successor to the Open eBook standard, may currently be the strongest contender for the e-book trade. EPub titles are composed of multiple compressed XHTML files and related metadata. They can be used with, or transformed for use with, several handheld devices and are likely to be even more widely supported in the future. The Johns Hopkins University Press e-book pilot program may eventually make use of this technology, though this option still needs to be fully vetted, as Muccie explains.

In the scholarly domain, on the other hand, XML is likely to remain more relevant as a tool for effectively managing collections that depend on cross-searchability and uniform online presentation. Rather than catering to casual readers, such projects are often research-oriented, and therefore tend to incorporate a range of web-based features such as search modules and "save" options to keep track of searches and of individual titles being viewed, not to mention extensive archives of supplementary materials. Since XML is still considered to be highly compatible with HTML and well-suited to such online applications, we can assume it will persist here for some time.

But the question for e-publishing initiatives, in particular those producing titles not intended exclusively for online publication, is whether XML may eventually become an unnecessary intermediate step on the way to implementing a final, even more serviceable format. Any publishing entity considering a large-scale conversion of print books should probably explore this issue further before forging ahead.

[ back to TOC]


APPENDIX: Survey Results

Below is a full list of questions and responses, excluding open-response comments.

 

1. Please identify yourself as one of the following, in terms of your primary function in accessing the HEB collection.
  Response Percent Response Count
Faculty 6.2% 14
Librarian 77.8% 176
Scholar/Researcher 5.3% 12
Student 2.7% 6
Other 8.0% 18
answered question 226
skipped question 0

 

2. From which location do you primarily access the HEB collection?
  Response Percent Response Count
U.S.A. - On Campus 61.5% 139
U.S.A. - Off Campus 10.6% 24
International - On Campus 21.7% 49
International - Off Campus 6.2% 14
answered question 226
skipped question 0

 

3. What type of institution are you affiliated with?
  Response Percent Response Count
College/ University 90.2% 204
Secondary School 2.7% 6
Library (Public or Private) 2.2% 5
Professional Institution, Society or Foundation 2.2% 5
Other or Not Applicable 2.7% 6
answered question 226
skipped question 0

 

4. What size is your institution?
  Response Percent Response Count
Doctoral Institution 45.6% 103
Masters Institution 19.0% 43
Liberal Arts/Baccalaureate College 16.4% 37
Associateís College 6.2% 14
Secondary School 2.7% 6
Public Research Library 2.2% 5
Private Research Library 0.4% 1
Other 7.5% 17
answered question 226
skipped question 0

 

5. How do you access the ACLS Humanities Book collection?
  Response Percent Response Count
College, university or library subscription 78.4% 177
Individual subscription (such as through AHA, RSA, or MESA) 7.5% 17
Trial subscription 8.8% 20
Other 5.3% 12
answered question 226
skipped question 0

 

6. While using the XML books, approximately how many typos or similar errors did you encounter?
  Response Percent Response Count
I didnít see any errors. 83.9% 183
I saw occasional errors. 16.1% 35
I saw errors in every XML book that I accessed. 0.0% 0
answered question 218
skipped question 8

 

7. How did the number of errors you encountered affect your experience using the XML books?
  Response Percent Response Count
It didnít cause any problems. 94.0% 205
It bothered me somewhat. 5.0% 11
It was extremely distracting. 1.0% 2
answered question 218
skipped question 8

 

8. Keeping the above in mind, which format do you prefer?
  Response Percent Response Count
Page-Image books 34.1% 73
XML books 65.9% 141
answered question 214
skipped question 12

 

9. What level of text do you prefer to access?
  Response Percent Response Count
One print page at a time 17.9% 38
One chapter or section at a time 49.6% 105
As much text as possible at once 32.5% 69
answered question 212
skipped question 14

 

10. Which format do you prefer for navigation?
  Response Percent Response Count
XML book 73.6% 156
Page-Image book 26.4% 56
answered question 212
skipped question 14

 

11. How important do you consider the following features?
  Not Important Somewhat Important Very Important Essential Response Count
Text enlargement 13 67 84 45 209
Pop-up notes 56 95 45 13 209
Copy/paste 11 45 87 66 209
Direct access to individual pages 11 34 89 75 209
Enlargeable images 8 50 102 49 209
Fidelity to original text 8 33 57 111 209
Interactive index 9 45 105 50 209
Highlighted search terms within e-book 10 67 81 51 209
answered question 209
skipped question 17

 

12. Please rate your satisfaction with HEBís XML format e-books in the following areas:
  Very Dissatisfied
[Rating: 1]
Dissatisfied
[Rating: 2]
Satisfied
[Rating: 3]
Very Satisfied
[Rating: 4]
Rating
Average
Response
Count
Accuracy 1 3 153 49 3.21 206
Format & Appearance 5 21 100 80 3.24 206
Structure & Navigation 6 14 106 80 3.26 206
Features 3 12 130 61 3.21 206
answered question 206
skipped question 20

 

13. Please rate your satisfaction with HEBís Page-Image format e-books in the following areas:
  Very Dissatisfied
[Rating: 1]
Dissatisfied
[Rating: 2]
Satisfied
[Rating: 3]
Very Satisfied
[Rating: 4]
Rating Average Response Count
Accuracy 0 2 101 103 3.49 206
Format & Appearance 2 36 117 51 3.05 206
Structure & Navigation 7 58 116 25 2.77 206
Features 8 60 120 18 2.72 206
answered question 206
skipped question 20

 

14. What is your overall impression of HEBís XML books?
  Below Expectations
[Rating: 1]
Meets Expectations
[Rating: 2]
Exceeds Expectations
[Rating: 3]
Rating Average Response Count
Functionality/Navigation 15 129 61 2.22 205
Appearance 28 123 54 2.13 205
Print/Download Capability 29 144 32 2.01 205
Ease of Citation Method 30 146 29 2 205
answered question 205
skipped question 21

 

15. What is your overall impression of HEBís Page-Image books?
  Below Expectations
[Rating: 1]
Meets Expectations
[Rating: 2]
Exceeds Expectations
[Rating: 3]
Rating Average Response Count
Functionality/ Navigation 40 150 15 1.88 205
Appearance 31 152 22 1.96 205
Print/Download Capability 65 133 7 1.72 205
Ease of Citation Method 37 155 13 1.88 205
answered question 205
skipped question 21

 

16. Taking all of the survey topics into account, which type of e-book do you prefer?
  Response Percent Response Count
Page-Image book 31.2% 64
XML book 68.8% 141
answered question 205
skipped question 21

[ back to TOC]


Notes

1. Further information on titles and series can be found at http://www.humanitiesebook.org/titlelist.html.

2. For details on functionality and why HEB originally chose to work with XML rather than encoding directly in HTML or XHTML, see HEB's first white paper: Nancy Lin, "Report on Technology Development and Production Workflow for XML Encoded E-Books" (New York: ACLS, 2003), available online athttp://www.humanitiesebook.org/heb-whitepaper-1.html.

3. Even for frontlist titles, editorial supervision on the part of collaborating presses has not always been forthcoming. In fact, for the duration of HEBís existence as an e-publishing initiative, presses were found to be somewhat reluctant to adopt our tested production workflow, in spite of the financial support provided to facilitate this.

4. Price quotes for double keying varied from vendor to vendor, up the total per-title conversion cost by at least 200%.

5. Note that this equation would inevitably change if dealing instead with a large quantity of e-books being developed from scratch, in which case the cost of the OCR scanning process and its associated management fees would need to be figured in separately. This would mean adding in most of the total processing cost for regular page-image books, on average about $170 per title. Also see section II.F, "Production cost analysis".

6. There was an additional motive behind this, namely, our desire to test how readers would respond to being able to access as much text as they wanted at a time without HEB controlling this factor. If the response were positive, and royalties were not aversely affected, we would consider implementing this policy for all future XML titles.

7. The standard version is available for download on our website: http://www.humanitiesebook.org/xml/doc/acls-hebook-doc.html.

8. Berkeley: University of California Press, 1998.

9. HEB specifies that most paragraphs be numbered for reference purposes, with the exception of extracts, epigraphs, source information for elements such as tables, and a few other cases.

10. These were Elvin Hatch, Respectable Lives: Social Standing in Rural New Zealand (Berkeley and Los Angeles, California: University of California Press, 1992); Ruby Hart Phillips, Cuba: Island of Paradox (New York: McDowell, Obolensky, 1959); and Paul Stephenson, Byzantium's Balkan Frontier: A Political Study of the Northern Balkans, 900-1204 (New York: Cambridge University Press, 2000).

11. In contrast, the highest number of hits received by an individual title within the experiment was a combined total of about 8,000 hits for the two versions of Stephen Alford, Kingship and Politics in the Reign of Edward VI(Cambridge: Cambridge University Press, 2002).

12. This figure is derived from per-page processing fees in conjunction with average page count for HEB titles.

13. As previously noted, these numbers are based specifically on HEB's conversion experiment, which included use of preexisting OCR-derived text files. In order to gauge production costs outside of this particular context, but drawing on a similar workflow, the amount required for XML conversion would be $570 (this includes the prorated cost of proofing), nearly 3.5 times the average scanning cost of $170. HEB wishes to emphasize that these should be considered ballpark figures only and may fluctuate dramatically from situation to situation depending on a number of variables, such as the quantity of titles being converted, the quality of the source texts, and the extent of post-conversion quality assessment and subsequent revisions.

14. Also see the Appendix for detailed survey results.

15. Lin, "Report on Technology Development," p. 6.

16. Several observations and considerations regarding digital formats in this section were derived from a conversation with van Rijn in December 2008.

17. This conforms in large part to the World Wide Web Consortium's HTML 4.0 specifications while simultaneously assuring better access for automated readers.

18. Details on this can be found at: http://dtd.nlm.nih.gov/index.html.

[ back to TOC]

rev. 11/30/2016