Mining the Historian Archives 08
Can we boil all of this background down to a reasonable project? I have hopes that we can.
I envision the following primary ways of looking at the Historian Archives:
- Browse and search our various online genealogy databases. We receive a file in GEDCOM format and import it to TNG for Web display.
- For a given Dwight number, browse a list of collections tied to that Dwight number. You can click any item in the list to get to that collection. The Dwight number is meaningless of itself; it is in effect a surrogate key. All Dwight-based links need to have suitable descriptive text.
- For a given name, browse a list of resources tied to that name. This may be tricky to display in useful and coherent fashion. This may be a good place to separate presentation from content.
- For a given descendant line, browse all resources tied to that descendant line.
- Each display page should likely contain links to each of the above, tuned to that specific page.
- We have a table of contents page providing links to each collection.
- We provide a means (probably an XML site map) for search engines to correctly index the site(s).
This project may have privileged areas. I therefore intend to place this within a Role-Based Access Control framework. Fortunately I already have such a thing written, and we may place that consideration outside the scope of this project.
The implementation is to be with MySQL and CakePHP. There is a specific reason for this. I am stepping in to the Data Architect role at my company, and all implementations are with MySQL and CakePHP (a PHP framework). I therefore want this project to align with the direction where I need to gain experience.
If we choose to use this as a basis for “members only” features, the access control will take care of that. It doesn’t matter, for design purposes, whether we choose to go with a single membership login, or generate and issue individual logins to all members.
I expect to use multiple MySQL databases (schemas). MySQL is capable of joining across multiple databases so long as they reside on the same server. I have a nightly backup strategy for my other MySQL databases, and my intention is to add these databases to that strategy. I see no need for replication, partitioning, or similar Enterprise-level considerations.
If I choose to store significant amounts of material online (e.g., book scans), that will need a strategy of its own, perhaps on my server or perhaps through a cloud service such as Amazon S3.
I see the data design as having two phases or areas.
The first phase is what we use for final display. We need to represent each of the above primary ways of looking at the Historian Archives. If there is a standard format for storing and displaying that information regardless of source, the user site design ought to be straightforward.
The other phase is that we probably need a separate data design for each type of artifact. This second phase is where we maintain things in an artifact-specific manner. Each of these artifact-specific areas then has some process for transferring data to the first phase design.
However, I hope that a detailed analysis and design will turn up enough commonality that we can store things in a common manner. I suspect that the opposite may well be the case. It may be a far more simple implementation if we have collections of area-specific tables. It may be the application code, rather than the tables, which takes advantage of commonality.
Initial Implementation Scope
I think the data design, if practicable, should take in all the territory described. We may well identify things which appear to difficult to undertake in the near future, and therefore we should feel free to declare those things as outside the project scope.
As for Phase One, I would like to see the following support:
- Implement the concept of a combination unique key. Each artifact or collection has its own unique identifier, and each item in the collection has a unique identifier, that is, a fixed identifier unique within that collection. We need to require that reference numbers be added to GEDCOM files so that new versions of the same file will have unchanged identifiers for specific items in the file.
- I already do GEDCOM imports, so consider this out of scope (already accomplished).
- I am experienced in mining the TNG databases for the necessary information. Our task therefore is to define what data items need to be imported, and in what manner, from the TNG databases. Once the requirements are written, this task is complete. I’ll implement those requirements as an out-of-scope item.
- Implement the Seven-Volume Index, which includes mapping page numbers to Dwight numbers.
- Implement the above primary ways of looking at this data, that is, looking at the GEDCOM/TNG databases and the Seven-Volume Index information.
Phase Two implementation should be a simple but useful step:
- Implement adding any particular artifact to the database. This includes Dwight classification(s), name(s), descendant line(s). Provide for entering repository information.
- Implement incorporating this artifact or collection into our primary ways of looking at this data.
Just the above may prove to be an unreasonably massive project. As we consider this project we need to identify what is the smallest and simplest implementation which produces a useful result. My gut feeling is that the answer is the Seven-Volume Index. That by itself is a useful resource not presently available to the public and potentially a powerful sales tool.