Data Design

Project Design 03

Seven-Volume Index

What is the best way to achieve a usable result which any member of the Association can see as clearly beneficial? I think the answer is the Seven-Volume Index. An ideal result would be to connect it up to the Langbehn database, but I’m not sure there’s time for that to happen.

One key challenge is digitizing the Index itself. I have much of the original index in the form of floppy disks with a custom file format, running MS-DOS, probably on the PC/AT. There’s no way of knowing if those disks are complete, or if they represent the final copy after proofing.

I therefore think the best approach is to scan and OCR the text. The index is on 8.5×11″ (standard letter size) paper. There are 861 pages. It is printed in four columns per page. This means there may be a major challenge in converting this to a usable text file.

It seems to me, though, that the best approach is to start with the digitized index. Getting a working text file is independent of the Data Design, and can therefore be started right away.

Read more…

Be the first to comment - What do you think?  Posted by admin - June 28, 2014 at 2:05 pm

Categories: Data Design   Tags:

Project Design 02

I view our project (the Web site) as having two distinct phases:

  • We have the User Experience. I believe this is what we want to design first.
  • We have the import process. The data creation/import process is undoubtedly different for each type of artifact or collection.

Project Design 01 talked about the first item. This post talks about the second item.

Read more…

Be the first to comment - What do you think?  Posted by admin - at 1:45 pm

Categories: Data Design   Tags:

Project Design 01

Project Constraints

The solution is dictated to be MySQL with the CakePHP Framework for PHP. My self study is aimed in that direction. I am assuming this won’t create a barrier to implementing the data design in Oracle at the same time.

CakePHP has some slightly weird table-naming expectations. I have found that if you follow the CakePHP expectations to the letter, you can get things up and running in extremely rapid fashion. If not, you don’t. I am therefore dictating that the actual MySQL implementations follow the naming conventions documented at http://book.cakephp.org/2.0/en/models/associations-linking-models-together.html.

CakePHP does have a good way of building a solid Web site based on highly normalized tables. CakePHP thrives on a correctly-normalized design. It’s very convenient, given the curriculum sequence, that I intend to create the data design first and build the Web site out from the schema.

I would ideally like to have some sort of useful demonstration up and running for the Strong Family Reunion.

This means that I need to find a way to articulate what I’m trying to accomplish, and figure out what this means for people willing to add their help. These would be genealogy people not hi-tech computer people.

Read more…

Be the first to comment - What do you think?  Posted by admin - at 1:25 pm

Categories: Data Design   Tags:

Mining the Historian Archives 08

The Project

Can we boil all of this background down to a reasonable project? I have hopes that we can.

I envision the following primary ways of looking at the Historian Archives:

  • Browse and search our various online genealogy databases. We receive a file in GEDCOM format and import it to TNG for Web display.
  • For a given Dwight number, browse a list of collections tied to that Dwight number. You can click any item in the list to get to that collection. The Dwight number is meaningless of itself; it is in effect a surrogate key. All Dwight-based links need to have suitable descriptive text.
  • For a given name, browse a list of resources tied to that name. This may be tricky to display in useful and coherent fashion. This may be a good place to separate presentation from content.
  • For a given descendant line, browse all resources tied to that descendant line.
  • Each display page should likely contain links to each of the above, tuned to that specific page.
  • We have a table of contents page providing links to each collection.
  • We provide a means (probably an XML site map) for search engines to correctly index the site(s).

Read more…

Be the first to comment - What do you think?  Posted by admin - June 27, 2014 at 8:21 pm

Categories: Data Design   Tags:

Mining the Historian Archives 07

Digitizing the Published Volumes

I plan to finish scanning the seven published volumes this summer so that we can continue to publish the books on CD. We have just about run out of print copies. I had expected to OCR the text, which takes a LOT longer but makes it searchable. However, if we were to OCR the Seven-Volume Index, that would be a far superior solution. We can place the Seven-Volume Index online as part of our project, and also produce it as a searchable text document.

Every entry in the Seven-Volume Index points to a specific volume and page number. I could picture re-scanning the pages in, say, 50-page chunks. If we had sufficient storage space, we could store those chunks online. Thus the Seven-Volume Index item could link to the relevant 50-page chunk.

Read more…

Be the first to comment - What do you think?  Posted by admin - at 7:34 pm

Categories: Data Design   Tags:

Mining the Historian Archives 06

Twenty-Four Shelf Feet

I have a very large number of file folders. Each file folder is labeled with a name and a Dwight number. It has proven to be completely impractical to process or even inspect each item within each file folder.

However, it might be possible to process the file folders themselves. I can envision a simple data-entry mechanism such as a spreadsheet to use as a means of inventorying the file folders.

Each file folder has a Dwight number and a name. Those are our two key lookup mechanisms.

Read more…

Be the first to comment - What do you think?  Posted by admin - at 6:58 pm

Categories: Data Design   Tags:

Mining the Historian Archives 05

Contributed GEDCOM Files

There is something which I have been trying to figure out for five years now. This project approach may well provide the solution.

The basic problem which I have been trying to figure out is, How do I fold all of this disparate information together in some coherent way so that we can publish it for our membership? It’s proven to be an impossible task. We have collected genealogical information which has not seen the light of day for more than a quarter century. Many of our information submitters never lived to see their submissions published or otherwise made available!

To be sure, our Association has been actively working with the information, curating and combining it. But it’s never been in a form to be published.

Read more…

1 comment - What do you think?  Posted by admin - at 5:58 pm

Categories: Data Design   Tags:

Mining the Historian Archives 04

Getting to the Specifics

The point of this project is to make information available in usable form. We’ll now look at each collection and see what I envision us doing with it.

Read more…

Be the first to comment - What do you think?  Posted by admin - at 4:56 pm

Categories: Data Design   Tags:

Mining the Historian Archives 03

Genealogy Databases

I personally have a number of genealogy databases, in actual database form:

  1. The John Langbehn database. John, for many years, has been typing the contents of Dwight’s History of the Strong Family into his Family Tree Maker program, a Windows-based genealogy program. Each person, in this Langbehn database, has the Dwight number typed into the “User Reference Number” field.
  2. I have previous editions of the same database from John Langbehn.
  3. I have two versions of my own research database.
  4. I have other databases contributed from related persons.

I have commercial software (TNG, The Next Generation of Genealogy Site Building) capable of importing each of the above databases into its MySQL database and display the information as web pages.

Since TNG is capable of maintaining each family tree database separately, there is no specific need to merge databases. You will recall that I have concluded it is ALWAYS a bad idea to merge groups of genealogy data. The result is a tangled mess. That in fact is why I have two versions of my own research database: I made a mess of the first to the point that I started over from scratch.

Read more…

Be the first to comment - What do you think?  Posted by admin - at 4:15 pm

Categories: Data Design   Tags:

Mining the Historian Archives 02

I don’t have a good handle on the exact project description. Instead let me describe what I am visualizing. From this, I trust, we can define what the project actually is. This post is about classification; then we’ll move on to details.

The Dwight Number

Benjamin Dwight, in 1871, published a two-volume History of the Strong Family. Each person in the books was individually numbered, with a total of about 29,000 numbered persons. Elder John Strong (1610-1699), the Progenitor of the family, is person #1. Sergeant Joseph Barnard (1641-1695), my ancestor, is person #27454.

Read more…

Be the first to comment - What do you think?  Posted by admin - at 3:28 pm

Categories: Data Design   Tags:

« Previous PageNext Page »