Mining the Historian Archives 07

Digitizing the Published Volumes

I plan to finish scanning the seven published volumes this summer so that we can continue to publish the books on CD. We have just about run out of print copies. I had expected to OCR the text, which takes a LOT longer but makes it searchable. However, if we were to OCR the Seven-Volume Index, that would be a far superior solution. We can place the Seven-Volume Index online as part of our project, and also produce it as a searchable text document.

Every entry in the Seven-Volume Index points to a specific volume and page number. I could picture re-scanning the pages in, say, 50-page chunks. If we had sufficient storage space, we could store those chunks online. Thus the Seven-Volume Index item could link to the relevant 50-page chunk.

Given that we want the revenue from selling copies of our published books, we may not want to make those 50-page chunks freely available. However, those chunks might be awfully useful for the SFAA historians doing research. We could perhaps make them a benefit of Association membership or part of an add-on subscription.

The digitized volumes are outside the scope of this project. However, it would make sense to allow for this possibility in the data design.

The question will come in when we decide what to display when we display a Seven-Volume Index entry. We know the name, possibly the Dwight number, the descendant line, and the volume and page number. We could display search results in a grid, with columns containing links to each of those things if available. That is, the columns could optionally link to:

  • The page with all resources for that name
  • The page with all resources for that Dwight number
  • The page with all resources for that descendant line
  • A link to the 50-page chunk of the digitized actual volume

The Manuscripts in Progress

I have a large number of Microsoft Word documents. They range from a few lines to hundreds of pages each. Nearly all of them tie to a specific Dwight number (and therefore specific descendant line).

Because they are in digital form, they are capable of being placed online and/or published on CD. That fact is outside the scope of this project.

What’s important is that they are capable of being ingested to create a name index. I spent a year trying to turn such documents (or OCR’d scans of printed documents) into genealogical databases with little success.

However, it might well be feasible to scan these documents to form a list of primary names. Since we don’t want to just give away the content, I don’t see any reasonable way to have google scan and index the documents for us. But if we can create a list of the primary names, we can add those names to our central repository.

It might in fact be feasible and sensible to capture some of the surrounding context. Thus it might make sense to provide for storing and displaying additional text to go with the name. That could be invaluable for searchers to identify the exact item they seek.