Mining the Historian Archives 05

Contributed GEDCOM Files

There is something which I have been trying to figure out for five years now. This project approach may well provide the solution.

The basic problem which I have been trying to figure out is, How do I fold all of this disparate information together in some coherent way so that we can publish it for our membership? It’s proven to be an impossible task. We have collected genealogical information which has not seen the light of day for more than a quarter century. Many of our information submitters never lived to see their submissions published or otherwise made available!

To be sure, our Association has been actively working with the information, curating and combining it. But it’s never been in a form to be published.


To Merge or Not to Merge

My own research is in the form of two separate databases. The first one got so messed up from ill-advised data merges that I started over from scratch. For the last year and a half I have been manually typing information from the first database into the second.

Thanks to this proposed project, that re-typing is no longer necessary.

Remember that we classify everyone according to Dwight number, and the Dwight numbers stop around the year 1870. Anyone submitting new or updated information to us now, 144 years later, is likely submitting information related to a single Dwight classification.

My tens of thousands of names, for example, all relate to descendants of Joseph Barnard Jr., Dwight #27454. Therefore each of my two databases as a whole can be classified with the single Dwight number (#27454).

What does this mean for our project?

I can place each of the databases up as-is, in a TNG database. Our project can record each of the two databases, as a whole, as items related to Dwight #27454. Anyone, therefore, looking for all information related to Dwight #27454, sees links to each of these two databases (along with suitable descriptions or whatever).

We can meanwhile scan the TNG database, mining each name in the database, and attach it to our master names list. Each name has its own web page and we can therefore produce a clickable link.

Many people have the same name, and therefore our master reference lists need to include whatever distinguishing information is available, such as birth/death dates, descendant line, spouse name, etc.

Unique Identifier

One thing that can and should happen is that we receive updated versions of the same database. It would therefore make sense to REPLACE the existing TNG database with the new one. This becomes a problem for our data-mining operation because genealogical relationships may have changed (e.g., someone was attached to the wrong parents), sequences could have been renumbered, and so on.

As I import data to TNG, I can assign sequence numbers during the import. This does not help, because when I receive an updated GEDCOM file, it won’t have those assigned sequence numbers.

The sequence numbers need to be assigned at the source. Most genealogy programs have some sort of “reference number” generator. Therefore, ideally, the owner of the original data source can have his or her program generate reference numbers for all persons in the database. Therefore, for each data export to GEDCOM file, the same person will retain the same reference number.

Each data source, therefore, also needs its own unique identifier. (For completeness, we have both file identifier and version information, assuming we may get updated versions of the same data source.)

This means, then, each item can be uniquely identified by the data source identifier and by its reference number within the data source. This “combined natural key” then becomes our reference (foreign key) for this project.

Data Source: GEDCOM File

We can now define a business process for handling any contributed GEDCOM file. I am assuming that we can either assign a Dwight number to the file as a whole, or a small collection of Dwight numbers (bearing in mind that I personally have three Dwight numbers), or can split the file into parts which can each be assigned a single Dwight number.

Most genealogy programs are capable of exporting part of a file, i.e., splitting the file into descendant lines. We can, if necessary, use this process to divide up a file so it can be classified by Dwight number.

Some data, of course, can’t be related to a specific Dwight number. This could be because we simply don’t yet know the connection to Elder John Strong, or because we explicitly know the data do NOT connect to Elder John Strong’s descendants. Another example is a database we have of descendants of Elder John Strong’s uncle.

We therefore need a non-Dwight system of classification. It doesn’t matter what the system is, just so that it’s defined, and can work alongside the Dwight system.

Note that any data, or data set, can have multiple classifications. That’s the nature of genealogy. I personally have three Dwight classifications, and each is as biologically valid as the others. Our data design must take this into account.

Our business process includes these points:

  • The GEDCOM file can be imported to TNG and displayed as Web pages.
  • Updated versions of the file can replace the current TNG import.
  • We can classify the file as a whole by Dwight number or numbers. The file as a whole (i.e., a link to its root web page with description) becomes a resource within this project. This appears on the list of all resources available for a specific Dwight number.
  • Each individual name becomes part of our master name list. Each such name has a unique link within the TNG Web pages. We make the master name list more useful by including distinguishing information such as birth/death year, spouse, descendant line. Each individual name has a Dwight classification, i.e., the classification of the GEDCOM file as a whole. It would not be particularly useful to display the Dwight number. A number such as #27454 does not convey anything useful to anyone. However, it would be useful to display a LINK to all resources with that Dwight number.

Note that 35,000 names could have the same Dwight number. It would therefore be pointless to list all PERSONS with that Dwight number. Our intent is to display all COLLECTIONS related to that Dwight number. If we have two dozen COLLECTIONS related to that Dwight number, we’d surely like to see that list of what’s available.

Also note that if the same person is in two dozen different COLLECTIONS (e.g., submitted GEDCOM files), the by-name list will have two dozen listings for that name, each listing pointing to the record in a different collection. It’s important for each of those two dozen listings to include the distinguishing information (birth, death, spouse, descendant line, etc.) so that it’s clear at a glance if we’re looking at the same or different persons.