Data Design Proposal

This is a consolidation of the “Second Design” posts into a single document. It’s the same text, just put together into a single page for convenience. If I do any editing, it will be to THIS document.

There is a second area which might be a good place for formal Data Design. This area is not time critical, and is separate from the paper document solutions being discussed elsewhere. This design has the aim of revenue generation.

I will describe the specific artifacts which currently exist, and work towards the end-user experience. Additional background material is at: http://otscripts.com/category/data-design/.

The Langbehn Database

This section is detailed and boring. Please bear with me because this is the entire basis of our classification system, and therefore the central point of any possible Data Design.

I receive copies of The Langbehn Database periodically on CD. It is a complete transcription of Dwight’s History of the Strong Family, containing about 29,000 names. John Langbehn, over the decades, has been transcribing the books into Family Tree Maker. This means that it’s possible to browse through family relationships and so on.

I have software for importing and displaying the database online via PHP and MySQL. The software is quite advanced, well supported, and I am quite comfortable with the underlying database table structure.

I receive new copies of the Langbehn Database as John Langbehn continues his work in progress. Online, the new version completely replaces the old version.

Each person in Dwight’s History of the Strong Family, and therefore every person in The Langbehn Database, has a Reference Number, running from 1 to about 29,000. We call this the Dwight Number after the author of the books.

This Dwight Number is the basis of our classification system. Every person in Dwight’s History of the Strong Family is a descendant, naturally, of person #1 in the book, Elder John Strong. The book was published in 1871 and therefore, of course, contains no persons born more recently than 1871.

Our genealogy focuses on all persons who descend from Elder John Strong. Everyone, therefore, can be classified as descending from someone in Dwight’s History.

In my own case, the person furthest “down the tree” is Joseph Barnard (1681-1736), with Dwight reference #27454. All descendants of that Joseph Barnard, including myself, get classified with that number, #27454.

From the end user perspective, the Langbehn Database is a way to search and explore Dwight’s History of the Strong family, with everything structured and linked along family lines. This is how people can discover that they do indeed connect to the Strong family.

From the data design perspective, this is how we can fit all other data into the right place. Suppose our end user has clicked around and found Joseph Barnard, #27454. The plan, then, is that Web page could now provide a list of all other information we have related to Joseph Barnard’s family line (descendants).

This has a revenue implication in that we’ll be showing resources available for purchase on CD.

Reverse Classification

Consider a master names list, or various documents. Each of them contains a link back to the specific person in the Langbehn Database. You find a person you seek via the master names list. This gets you to a list of all resources we have related to that person, and links you to the relevant person or ancestor in the Langbehn Database for further research and exploration.

We’ll now look at each artifact and see what we should do with it.

The Langbehn Database, Again

Let’s look at The Langbehn Database from a data design perspective.

The Langehn Database is represented in a collection of tables defined by TNG (The Next Generation of Genealogy Sitebuilding). The TNG product site is here: http://lythgoes.net/genealogy/software.php. The software itself is well supported with an extremely active mailing list.

As part of our data design, I intend to add a link to each TNG-generated page showing our other resources related to that person. The information could also be part of a tooltip popup. The implementation details are outside this data design; just be assured that I’ve done this sort of thing with TNG before.

What we DO need with this data design, is provision for an SQL query providing that information, given the relevant Dwight reference number.

We also need to design the reverse direction. For a given Dwight reference/classification number, we need to be able to provide a Web link to the page in The Langbehn Database (as displayed by TNG).

Many people have the same name. Our published genealogies contain a LOT of different persons all named John Strong! We therefore need to extract from the Langbehn database distinguishing information such as:

  • Birth/Marriage/Death dates where known
  • Alternate names (nicknames) where known
  • Spouse name(s) where known
  • Descendant line. We need to indirectly derive this; see below
  • The Web link (URL) to this person’s page in the TNG display area

Note that as Langbehn Database updates are received, this information gets replaced and re-imported.

The import process is outside of this data design. This data design needs to provide for storing (and deriving) the above information, and allow for replacement of the information by re-import.

The descendant line means which child of Elder John Strong the person descends from. That’s an important classification from a revenue standpoint, because we sell material according to descendant line.

By scanning Dwight’s History, I can make a list of which number ranges attach to which descendant line. There is an accidental overlap of about a hundred numbers, due to an error on Dwight’s part. We likely don’t have additional information for any persons in that overlapping range, so can arbitrarily assign that range to one of those two lines.

We can therefore construct a table which maps descendant line to number ranges. It’s worthwhile having a separate table of Descendant Line definitions. This can include abbreviations, descriptive text. It should not contain the Dwight ranges because there are multiple ranges per descendant line.

For example, the Sarah line includes #27-#28 (Sarah Strong and husband Joseph Barnard), and #27449-#27542 (their descendants). My own classification number, #27454, fits in that range.

Thus one requirement is: Given a Dwight classification number, determine the Descendant Line.

The Printed Volumes

This project is not about the already-published books. However, there is a significant revenue opportunity which can be supported by a simple piece of the data design.

Our artifact is the Table of Contents of each book. Each Table of Contents is 1-3 pages long and easily transcribed into a spreadsheet (for example). Here is an excerpt from Volume Three:

THE JOHN STRONG, JR. LINE (below is Dwight #, the person, and page number in this book):

  • #3, John Strong, Jr., Page 1
  • #39, Hannah Strong, Page 3
  • #41, John Strong, Page 12

THE ABIGAIL STRONG CHAUNCEY LINE

  • #15, Abigail Strong, Page 199
  • #23954, Abigail Brewer, Page 199
  • #24042, Nathaniel Chauncey, III, Page 215

and so on.

In other words, the book is divided into sections. Each section has descendants of ONE person in the Langbehn Database, with ONE Dwight number applying to that whole section of the book.

This means that we can create a database (speaking loosely) which tells us, for each Dwight number, whether we have additional information on that family line.

To continue with myself as the example:

  1. By exploring the Langbehn Database, I discover my progenitor Joseph Barnard, #27454.
  2. We now provide the information, attached to the above page, that we have published additional information on the descendants of Joseph Barnard #27454 in Updates Volume Three, pages 445-477. (We derive this fact from the above-mentioned table of contents for the book.)
  3. I purchase Updates Volume Three from the SFAA, confidently knowing that the volume contains 33 pages of additional genealogy.

We potentially have a strong sales boost, based merely on mining our tables of contents. This works because of our classification system based on Dwight numbers.

The Manuscripts in Progress

The SFAA Historians worked on updates to our published books, from 1990-2004. I have most of this work in the form of Microsoft Word documents. These documents are organized the same way as the published books. That is, each separate Word document covers descendants of a specific person. Each can be tied to one specific Dwight number and person.

Because these are computer files, I believe that I can mine each document for names. I believe I can catch the name at the beginning of each formatted paragraph, and capture an excerpt such as the paragraph itself.

Since each of these documents was typed by hand, there is a lot of “fuzzy logic” involved in successfully capturing the names. Thus this information would be added to the database slowly, over time.

However, you can probably see how this now fits into the scheme of things. As with the table of contents of the published book, each document can be registered with a descendant line, Dwight number and name, and file name/location.

When I mine the document for names, these can become part of a searchable master names list. I expect to capture an excerpt of surrounding text, so that the name has some sort of context useful to the end user / researcher.

This provides us with two revenue opportunities.

  • First, the names list with text excerpts may allow a person to realize that they have found the right place. They can figure out exactly which volumes of genealogy they’d like to purchase from the SFAA.
  • Second, we can collect and publish (on CD) all of the Manuscripts in Progress for a given descendant line. Thanks to this data design I might be able to generate a coherent Table of Contents for the CD, giving Dwight Number, Name, and file name/link, just like the printed books’ tables of contents.

The ideal solution would be to turn these Manuscripts in Progress into coherent single genealogies. A quarter century later, we haven’t yet managed this. This proposal, at least, gets the material out to the persons most interested in that material.

The Seven-Volume Index

I do have a (printed on paper) index of all names of all seven printed volumes, including married names. It may be that, once commercially scanned, I can reprocess the information to create a usable names index. Each entry contains descendant line, volume and page number where the name appears. From the page number, in most cases, we can infer the relevant Dwight number.

So, if the digitization ever comes to pass, this presents a revenue opportunity. We can publish a complete index of everything in our seven published volumes.

We currently have scans of this index, one PDF file per letter of the alphabet. It might actually be worthwhile to put those online and make them public. People know how to use an index, and might find what information they can purchase as a book.

If we create a proper database from the Seven-Volume Index (which is far in the future), this would also allow us to show all names for a given page. Our researcher finds the right name, but doesn’t know if it’s the right person of that name. Our researcher could then look at the list of ALL names on that page. If he or she recognizes several of those names, our researcher will know it’s the right page, and be inclined to purchase the book.

The Walther Barnard Manuscript

This set of Microsoft Word documents would print out to several thousand pages. It covers the Sarah (Strong) Barnard line. It’s so large that it gets its own CD.

As with the other Manuscripts in Progress, I believe it’s feasible to mine the documents for primary names, with excerpt text to give context and distinguishing information.

I don’t think it requires anything separate in the way of data design. It will require separate mining software.

Other GEDCOM Database Files

If people submit genealogical information to us, they may be able to submit it in the form of GEDCOM files. All personal genealogy software supports data export to GEDCOM format.

Our TNG software can import any number of family trees (GEDCOM files), including attached images and documents if handled properly.

TNG’s advanced search capability can do searches covering a specific tree, or return results for all trees combined. Thus, rather than creating a huge “master database,” I believe it makes far more sense to maintain distinct trees as submitted. There are very strong reasons for not attempting to combine or merge information (but outside the scope of this document). TNG allows us to keep things intact as submitted, but do combined searches over all trees at once.

We now have the ability to accept member-submitted information. This is a huge business point in our favor. TNG software automatically redacts information on living persons.

Each user-submitted GEDCOM database will normally relate to a SINGLE person in Dwight’s History. Thus every tree as a whole can be classified with one (or a small list of) Dwight numbers, and with one (or a small list of) descendant line(s). Thanks to intermarriages, many Strong genealogies have ancestors from multiple descendant lines.

From a data design standpoint, then, we want to be able to record:

  • A link to the collection (the GEDCOM file as displayed by TNG)
  • A description
  • Submitter information
  • Repository information. This is different from the submitter information. The submitter sent the file and, for example, I received it. The repository information describes how and where I am keeping the file
  • Dwight classification number or numbers
  • Descendant line or lines

The User Experience

Our data design centers around these features:

  • How/where does this item fit into the Strong genealogy? Internally we discern this with the Dwight classification number.
  • From this item, we can click to the Langbehn Database, which allows us to explore the early Strong genealogy and family relationships.
  • From a given point in the Langbehn Database (or anywhere else, for that matter) we can bring up links to all related resources we have online with the SFAA.
  • We can create a master names list as the alternate way of exploring the above.

I don’t think we need to design in any advanced searching capability. That can come later as experience warrants. The Langbehn Database does have advanced search capability.

Multi-Part Data Design

The first part of the data design is the natural representation of the artifacts which I’ve described. We can base that design on samples of each existing artifact.

The second part of the data design supports the user experience. In other words we need to define the end-user Web site:

  • What information should be on each type of Web page?
  • How should things be interlinked?

We don’t need to design implementation details such as pagination, site layout, or navigation. The implementation is dictated to be in MySQL and CakePHP.

CakePHP allows for extremely rapid development if the data design supports the Web presentation. That is, if we design the properly-normalized tables around how we want to present the information.

These tables are read-only. We only update with data imports.

Of course how we present our information may be quite different from how we naturally represent our artifacts in a database. That’s why I suggest a multi-part data design. We have one design to support the artifacts and data import, and we have the other design supporting the User Experience.

We would then have some process which imports/replaces data from the “artifact” area to the “presentation” area. I see that as an implementation detail which need not be addressed in the data design.

There is potentially a third part of the data design. Some of these artifact data imports may be tricky, and handled over time. I expect to probably create ad-hoc tables supporting those imports. From those tables I could then transfer/import the data from the ad-hoc import area to the first part of this data design, which “properly” represents that artifact.

This third part need not be part of the data design. In my mind, it’s an implementation detail to be determined later.