The Old Database

From sokwedb
Jump to navigation Jump to search

The database named old contains a copy of the data stored in the original Gombe Chimp Database, the database which was developed prior to 2023. It can be used to query the old data, obtaining the data from the original database's table structures.

The idea behind the old database is to keep from cluttering up the sokwedb database with content that is rarely used.

Overview

The old data exists in various forms in the old database. The most-like-the-original form has only the data and table names obtained from a dump of the original Gombe Chimp Database. (The table names, columns, and data were dumped as SQL statements. It would be prudent of us to attach that file to this page when we have the final data dump we ultimately use to produce the SokweDB database's content.)

There is no feasible way to store other features of the old database in the new database, things like the old database's macros. It may be prudent to store as much as possible about the old database, especially database elements like the macros that have not moved into the old database on the SokweDB server but also things like the old db's documentation, either here on this wiki or somewhere else that is available to the SokweDB team.

The "newest" form of the old database content kept in the old database consists of copies of the SokweDB's various schemas, as converted from the old database. This is the final converted data as of the date the new system replaced the old.

Should there be questions about the data transformations performed to convert the old data and data structures to the new, queries can be written that compare the original data to the final forms of the converted data.

The Schemas/Steps Through Which The Data is Converted

The data passes through the following schemas, in the order given, as it is converted into the form 'SowkeDB expects. Some of these forms may be easier to query than others.

raw

The raw contains the data, tables, and indexes restored from the Gombe Chimp Database dump.

Because most table and column names in the old database contain both upper and lower case, and sometimes spaces or other problematic characters, when writing SQL it is best to always put the table and column names in double quotes. An example would be:

SELECT "BIOGRAPHY"."B_AnimID"
  FROM "BIOGRAPHY";

tidy

The tidy schema is exactly like the raw schema but the data types of the tables have been modified to better conform with the data. Specifically, the following changes are made to the tables:

  • All TIMESTAMP data columns that contain dates are re-typed as DATE columns. This removes the time portion of the data, which is expected to always be 00:00:00 -- midnight.
  • All TIMESTAMP data columns that contain times are re-typed as TIME WITHOUT TIME ZONE columns. This removes the date portion of the data, which is expected to always be 1899-12-30.

easy

This schema is like the tidy schema but table and column names are altered to conform to a form that no longer requires quoting in written SQL. All characters are converted to lower case and all spaces are converted to underscores.

This is the easiest schema to write SQL for, but pre-existing queries written to run in the Gombe Chimp Database can no longer be used to query the easy schema because of the change to the table and column names.

clean

This schema is like the easy schema but has had "automated" data cleanup done to the data. These are transformations that could be done when moving the data to the final form, but it is easier to track each transformation with the data in a separate schema. Otherwise all transformation must occur in the one statement that loads data into the final table, which can intermingle unrelated changes to disparate columns.

This schema contains the "final form" of the data, in a form organized like that of the old MS Access database.

The PEOPLE table

The clean schema has a PEOPLE table, in order to make it easier to query for missing/incorrect people codes.

This table is manually constructed during the conversion process. Then, the observer codes from the follow are added to the new table, with differences of character case removed -- choosing a mixed-case version when there is one.

The follow table

The observer columns of the follow table, fol_am_observer_1, fol_am_observer_2, fol_pm_observer_1, fol_pm_observer_2, have spaces removed from the start of their strings, any empty strings are then converted to NULL, then, if a mixed-case version of the person code exists that version is used.

The fol_b_animid column, the focal, is made upper-case and has trailing spaces removed.

The follow_arrival.fa_type_of_cycle column cleanup

All the follow_arrivals.fa_type_of_cycle values that were NULL were changed to MISS.

All the follow_arrivals.fa_type_of_cycle values that were NA were changed to n/a, because this is the preferred value.

All the follow_arrivals.fa_type_of_cycle values that were n/a, for females, were changed to MISS.

When there is a mis-match in the MS Access database as to whether or not the focal began or ended the follow in a nest, a mis-match between what the FOLLOW table says and what the FOLLOW_ARRIVAL table's first (or last) arrival of the focal, then if either says the focal was in a nest, change the relevant FOLLOW_ARRIVAL row to indicate the focal was in a nest. If there are multiple first(/last) FOLLOW_ARRIVAL rows recording the focal's arrival that all have the same start(/end) time then the one with the lowest(/highest) FA_Seq_num is examined and its nesting values are used and updated. See:

 Conversion Data Issues#(#26) Mismatch of start-in-nest on follow and follow_arrival
 Conversion Data Issues#(#27) Mismatch of end-in-nest on follow and follow_arrival

It is the FOLLOW_ARRIVAL table's value that the conversion code puts in SokweDB.

Notes On Data Clean Up and Other Data Transformation Steps

At every point at which data is discarded, for example when discarding time information as part of converting TIMESTAMP columns to DATE values, the data is tested to ensure that no meaningful data is discarded.

Some data originally dumped from the old database, at the start of the data conversion process, was found to have problems. The particulars of the data problems and the solutions found, whether data alteration was done, whether it was decided that some data was invalid and should be discarded or whatever the solution, is documented per-problem on the Conversion Data Issues page.

The Conversion Data Issues page should be reviewed when there is interest in the data cleanup specifics which occurred during the transition from the Gombe Chimp Database to SokweDB.

You can read the SQL which converts the data. The conversion code is the ultimate source of truth regarding data transformation and cleanup.

Again, for more detail, see:

 The Conversion Data Issues page

BIOGRAPHY.DadID and BIOGRAPHY.DadIDPrelim

The DadID column in the original data contains some values that are dad ids suffixed with "_prelim", which do not validate as existing AnimIDs. To resolve this, create the DadIDPrelim column in a BIOGRAPHY_DATA table, and make a BIOGRAPHY view that combines Dad_ID and DadIDPrelim into DadID -- adding the '_prelim' suffix as expected.

When the original DadID is the empty string or NULL then both columns are made NULL.

When the data ends with '_prelim' the '_prelim' is removed, the result is the DadID, and DadIDPrelim is made TRUE.

Otherwise, the original data value is used for DadID and DadIDPrelim is FALSE.

BIOGRAPHY.DadIDPub

When the data is the empty string and the new DadID is NULL, then this is made NULL.

When the data is the empty string and the new DadID is not NULL, then this is made Unknown.

Otherwise, the original data value is used.

Comparing Old Data to Current Data

PostgreSQL does not support queries that cross databases. Consequently, in order to write queries against current SokweDB data it is necessary to copy tables (possibly all the tables of a schema) from one database to another. (This is generally done by means of data dump and restore.) The "best" (for some value of "best", possibly taking data safety into account) way to accomplish this is probably to copy tables of old data into your personal schema, although if there is persistent need we could copy various schemas from the old database into the sokwedb database.