Conversion Data Issues

From sokwedb
Jump to navigation Jump to search

This page lists all the problems with the data that were encountered during the data conversion process, and how the issue was resolved.

The problems are numbered, in the order in which they were encountered during the conversion.

Given a choice, earlier problems should be solved before later problems. This allows the later steps in the conversion to receive "correct" data, which help eliminate spurious problems, and aids the discovery of problems hidden by bad data.

Unsolved problems are marked with an *.

(#1) FOLLOW_MAP_TIME duplicate keys

Problem

The data dump says that the FOLLOW_MAP_TIME table has a primary key consisting of, in order, the columns: FMT_FOL_date, FMT_FOL_B_focal_AnimID, FMT_time. But these columns contain duplicate values.

Bad Data

The duplicate values can be listed (from the raw schema) with:

This query no longer reports results because the data was changed in the original MS Access data.

SELECT *
  FROM "FOLLOW_MAP_TIME"
    JOIN (SELECT "FMT_FOL_date" AS the_date
               , "FMT_FOL_B_focal_AnimID" AS the_animid
               , "FMT_time" AS the_time
            FROM "FOLLOW_MAP_TIME"
            GROUP BY "FMT_FOL_date", "FMT_FOL_B_focal_AnimID", "FMT_time"
            HAVING count(*) > 1
         ) AS fmt
      ON ("FOLLOW_MAP_TIME"."FMT_FOL_date" = fmt.the_date
          AND "FOLLOW_MAP_TIME"."FMT_FOL_B_focal_AnimID" = fmt.the_animid
          AND "FOLLOW_MAP_TIME"."FMT_time" = fmt.the_time);

Solution

Fixed in MS Access 11/17/23 by ICG. Checked all against Tikis.

(#2) SUBADULT_ARRIVALS_LOG has a textual SA_first_tiki_date column

Problem

The SUBADULT_ARRIVALS_LOG table has a column that is supposed to contain a date, but instead contains the string "TEXTY".

Likely, the entire row is bad. The row contains:

       SA_B_AnimID | SA_first_tiki_date | SA_notes
       -------------+--------------------+----------
	TEXTY       | TEXTY              | TEXTY

Bad Data

In the raw schema run:

This query no longer reports results because the data was changed in the original MS Access data.

select * from  "SUBADULT_ARRIVALS_LOG" where "SA_first_tiki_date" = 'TEXTY';

Solution

Fixed in MS Access, 11/16/2023, ICG. Deleted row

(#3) BRECORD_NOTES contains rows where BREC_FOL_date has values that are not just a date

Problem

There are 4 rows in BRECORD_NOTES where the BREC_FOL_date column, supposedly a date, contains time values that are not '00:00:00'.

Bad Data

In the raw schema run:

This query no longer reports results because the data was changed in the original MS Access data.

select * from "BRECORD_NOTES" where "BREC_FOL_date"::TIME <> '00:00:00';

Solution

Fixed in MS Access, 11/16/2023, ICG Deleted time stamp

(#4) MATING_EVENT contains rows where M_FOL_date has values that are not just a date

Problem

The MATING_EVENT table contains 1 row where the time portion of M_FOL_date is not '00:00:00'.

Bad Data

In the raw schema run:

This query no longer reports results because the data was changed in the original MS Access data.

select * from "MATING_EVENT" where "M_FOL_date"::TIME WITHOUT TIME ZONE <> '00:00:00';

Solution

Fixed in MS Access, 11/16/2023, ICG Deleted time stamp

(#5) BIOGRAPHY.DepartdateError data discarded

Problem

The BIOGRAPHY.DepartdateError column contains data that was, after discussion, determined to be unusable.

Bad Data

In the raw schema run:

This query no longer reports results because the data was changed in the original MS Access data.

select * from raw."BIOGRAPHY" where "B_DepartdateError" <> 0;

Solution

Do not convert the data. There is no corresponding column in the new db design.

(#6) BIOGRAPHY.B_AnimID_num column contains the empty string

Problem

The BIOGRAPHY.B_AnimID_num column contains the empty string, instead of NULL.

Bad Data

In the raw schema run:

This query no longer reports results because the data was changed in the original MS Access data.

select * from raw."BIOGRAPHY" where "B_AnimID_num" = '';

Solution

Changed 11 empty string values to NULL in MS Access. ICG 12/6/2023

(#7) BIOGRAPHY.B_AnimID_num column is textual

Problem

The BIOGRAPHY.B_AnimID_num column contains has a data type of TEXT. The data values all begin with "CH".

Bad Data

In the raw schema run:

This query no longer reports results because the data was changed in the original MS Access data.

select "B_AnimID", "B_AnimID_num"
  from raw."BIOGRAPHY"
  where "B_AnimID_num" <> ''
        AND "B_AnimID_num" IS DISTINCT FROM NULL
  order by "B_AnimID";

Solution

Removed the "CH" prefix and made the column an integer in MS Access. ICG 12/6/2023

(#8) BIOGRAPHY.DadID_publication_info column contains NULL values

No longer a problem as NULLs are now allowed.

(Not) Problem

The BIOGRAPHY.DadID_publication_info column has a data type that allows NULL values. SokweDB wants only text

Bad Data

In the raw schema run:

This query no longer reports results because the data was changed in the original MS Access data.

select *
  from raw."BIOGRAPHY"
  where "B_DadID" IS NULL
  order by "B_AnimID";

Solution

Changed the NULL to the empty string ("") in MS Access. ICG 12/6/2023

(#9) BIOGRAPHY.DadID column contains non-AnimID values

Problem

The BIOGRAPHY.DadID column has a non-AnimID values.

Bad Data

In the raw schema run:

select "B_AnimID", "B_DadID"
  from raw."BIOGRAPHY"
  where "B_DadID" is not null
        and "B_DadID" <> ''
        and not exists (select 1
                          from raw."BIOGRAPHY" as search
                          where search."B_AnimID" = raw."BIOGRAPHY"."B_DadID");

Solution

Create the DadIDPrelim column in a BIOGRAPHY_DATA table, and make a BIOGRAPHY view that combines Dad_ID and DadIDPrelim into DadID -- adding the '_prelim' suffix as expected.

(#10) BRECORD_NOTES contains rows where BREC_time has values that are not just a time

Problem

There are 56 rows in BRECORD_NOTES where the BREC_time column, supposedly a time, contains date values that are not '1899-12-30'.

Bad Data

In the raw schema run:

select * from raw."BRECORD_NOTES" where "BREC_time"::DATE <> '1899-12-30';

Solution

The data is fixed in the conversion process.

* (#11) KAZ has b_dadid_publication_info, but b_dadid is NULL

Problem

KAZ has dad id publication info of 'Rudicell et al. 2010', but a NULL dadid.

Bad Data

In the clean schema run:

select * 
  from clean.biography
  where b_dadid is NULL
        and (b_dadid_publication_info <> ''
             or b_dadid_publication_info is null);

Solution

(#12) 9 BIOGRAPHY rows have BirthComm values that are not COMM_IDS.CommID values

Problem

There are 9 BIOGRAPHY rows with BirthComm values that are not valid COMM_IDS, their communities do not exist.

Bad Data

In the clean schema run:

select *
  from easy.biography
  where b_birthgroup is not NULL
        and not exists (select 1
                          from easy.community_lookup
                          where community_lookup.cl_community_id
                                = biography.b_birthgroup);

Solution

Added KL and KL_KK to CommIds. KL = Kalande, KL_KK= Kasekela/kalande codes are created during conversion

Fixed in commit 5d85b83bfe1a.

* (#13) 234 COMM_MEMBS rows place individuals in a community before their BIOGRAPHY.EntryDate

Problem

234 COMM_MEMBS rows place individuals into a community before BIOGRAPHY says they entered the community.

Bad Data

In the clean schema run:

select b.b_animid, b.b_entrydate, cm.cm_start_date
  from clean.community_membership as cm
    join clean.biography as b
         on (b.b_animid = cm.cm_b_animid)
  where cm.cm_start_date < b.b_entrydate;

Solution

* (#14) 6 COMM_MEMBS rows place individuals in a community after their BIOGRAPHY.EndDate

Problem

6 COMM_MEMBS rows place individuals into a community after BIOGRAPHY says they left the community.

Bad Data

In the clean schema run:

select b.b_animid, b.b_departdate, cm.cm_end_date
  from clean.community_membership as cm
    join clean.biography as b
         on (b.b_animid = cm.cm_b_animid)
  where cm.cm_end_date > b.b_departdate;

Solution

ICG EVL fixed BH, HAI, TZB2 in Access. Need to talk to Karl about MG, RO, WD. Treat like KL chimps in Biography?

(#15) TT is placed in a community twice on the same day

Problem

An individual may not be in more than one community (or even twice in the same community) on any given day.

Bad Data

In the clean schema run:

select first.cm_b_animid as anim_id
     , first.cm_start_date as first_start_date
     , first.cm_end_date as first_end_date
     , first.cm_cl_community_id as first_community_id
     , first.cm_start_source as first_start_source
     , first.cm_end_source as first_end_source
     , second.cm_start_date as second_start_date
     , second.cm_end_date as second_end_date
     , second.cm_cl_community_id as second_community_id
     , second.cm_start_source as second_start_source
     , second.cm_end_source as second_end_source
  from clean.community_membership as first
    join clean.community_membership as second
         on (first.cm_b_animid = second.cm_b_animid
             and first.cm_start_date < second.cm_start_date)
  where first.cm_end_date >= second.cm_start_date;

Solution

Fixed in MS Access 2/1/2024 Changed end date of KK_P1 membership to 8/14/2022

ICG

(#16) There are 31 rows in COMM_MEMB_LOG where MadeBy is NULL

Problem

There are 31 rows in COMM_MEMB_LOG where MadeBy is NULL, and the column does not allow NULL values.

Bad Data

In the clean schema run:

select * from clean.community_membership_update_log where made_by is null;

Solution

See problem #18. Fixed in commit 7a48fb2c4aad3b5bd4

(#17) There are 64 rows in COMM_MEMB_LOG where the chimp_id is not on BIOGRAPHY

Problem

There are 64 rows in COMM_MEMB_LOG where the chimp_id is not a BIOGRAPHY.AnimID.

Bad Data

In the clean schema run:

select date_of_update, chimp_id
  from clean.community_membership_update_log as log
  where not exists (select 1
                      from clean.biography
                      where biography.b_animid = log.chimp_id);

Solution

Make this a "soft" error. Fixed in commit 4c9626b304.

(#18) There are 31 rows in COMM_MEMB_LOG where the MadeBy is NULL

Problem

There are 31 rows in COMM_MEMB_LOG where MadeBy is NULL, and the column does not allow NULLs.


Bad Data

In the clean schema run:

select * from clean.community_membership_update_log where made_by is null;

Solution

Create an unknown person (UNK), and when the person is NULL, use the unkonwn person.

This allows us to make the unknown person "inactive", preventing them from being used in newly entered data. The alternative, allowing NULL MadeBy values, would allow new "bad data", and NULL values make querying harder.

Fixed in commit 7a48fb2c4aad3b5b.

(#19) There are 18 rows in BIOGRAPHY_LOG where the MadeBy is NULL

Problem

There are 18 rows in BIOGRAPHY_LOG where MadeBy is NULL, and the column does not allow NULLs.


Bad Data

In the clean schema run:

select * from clean.biography_update_log where made_by is null;

Solution

Create an unknown person (UNK), and when the person is NULL, use the unkonwn person. (See Problem #18)== * (#19) There are 18 rows in BIOGRAPHY_LOG where the MadeBy is NULL ==

Fixed in commit 8a528cc7d1a9fb.

(#20) There are 3 rows in BIOGRAPHY_LOG where the Rationale is NULL

Problem

There are 3 rows in BIOGRAPHY_LOG where Rationale is NULL. The column does not allow NULLs; normally the conversion program would convert NULL to the empty string. But the Rationale column requires there be (non-empty) textual data.

Bad Data

In the clean schema run:

select * from clean.biography_update_log where update_rationale is null;

Solution

ICG fixed in MS Access 2/8/2024. Updated rationale to 'routine update'.

(#21) There are 6 rows in BIOGRAPHY_LOG where the update_escription is NULL

Problem

There are 6 rows in BIOGRAPHY_LOG where Description is NULL. It appears that the description was put into the Rationale column, sometimes along with some rationale.

Bad Data

In the clean schema run:

select * from clean.biography_update_log where update_description is null;

Solution

ICG fixed in MS Access 2/8/2024, using information in update_rationale

(#22) There are 85 rows in BIOGRAPHY_LOG where the chimp_id is not on BIOGRAPHY

Problem

There are 85 rows in BIOGRAPHY_LOG where the chimp_id is not a BIOGRAPHY.AnimID.

Bad Data

In the clean schema run:

select *
  from clean.biography_update_log as log
  where not exists (select 1
                      from clean.biography
                      where biography.b_animid = log.chimp_id);

Solution

Make this a "soft" error. Fixed in commit c96555f9f326.

(#23) There are 65 rows in BIOGRAPHY_LOG where the MadeBy is not on PEOPLE

Problem

There are 65 rows in BIOGRAPHY_LOG where the MadeBy is not on PEOPLE, but there is MadeBy data.

These are "combination" ID errors, where multiple people are entered instead of a single people code.

Bad Data

In the clean schema run:

select *
  from clean.biography_update_log as log
  where not exists (select 1
                      from clean.people

                     where people.person = log.made_by)
       and made_by is not null;

Solution

Fixed in MS Access by ICG 2/8/2024. Changed multiple IDs to the one who made the final change.

* (#24) Follow starts are not first arrivals

Problem

There are 6,397 cases where the old FOLLOW table's columns "fol_time_begin" is not the first arrival time, fa_time_start, on FOLLOW_ARRIVAL.

The The Gombe Chimpanzee Database Handbook says that this should never happen, because the follow start is supposed to be the first arrival. Is there additional data in the old fol_time_begin, like the actual start the observers started working? If not, which data is correct, the one in FOLLOW or the one in FOLLOW_ARRIVAL?

Bad Data

WITH spans AS
  (SELECT fa_fol_date, fa_fol_b_focal_animid
        , MIN(fa_time_start) AS min_start
     FROM clean.follow_arrival
     WHERE fa_b_arr_animid = fa_fol_b_focal_animid
     GROUP BY fa_fol_date, fa_fol_b_focal_animid)
SELECT spans.*, follow.fol_time_begin
  FROM spans
    JOIN clean.follow
      ON (follow.fol_date = spans.fa_fol_date
          AND follow.fol_b_animid = spans.fa_fol_b_focal_animid)
  WHERE NOT EXISTS
        (SELECT 1
           FROM clean.follow
           WHERE follow.fol_date = spans.fa_fol_date
                 AND follow.fol_b_animid = spans.fa_fol_b_focal_animid
                 AND spans.min_start = follow.fol_time_begin)
  ORDER BY spans.fa_fol_date, spans.fa_fol_b_focal_animid;

Solution

TENTATIVE: This is data entry error, ignore the problem and ignore the data in the follow.fol_time_begin.

* (#25) Follow ends are not last arrivals

Problem

There are 6,570 cases where the old FOLLOW table's columns "fol_time_end" is not the last arrival time, fa_time_end, on FOLLOW_ARRIVAL.

The The Gombe Chimpanzee Database Handbook says that this should never happen, because the follow end is supposed to be the first arrival/last departure. Is there additional data in the old fol_time_end, like the actual end-time the observers started working? If not, which data is correct, the one in FOLLOW or the one in FOLLOW_ARRIVAL?

Bad Data

WITH spans AS
  (SELECT fa_fol_date, fa_fol_b_focal_animid
          , MAX(fa_time_end) AS max_end
     FROM clean.follow_arrival
     WHERE fa_b_arr_animid = fa_fol_b_focal_animid
     GROUP BY fa_fol_date, fa_fol_b_focal_animid)
SELECT spans.*, follow.fol_time_end
  FROM spans
    JOIN clean.follow
      ON (follow.fol_date = spans.fa_fol_date
          AND follow.fol_b_animid = spans.fa_fol_b_focal_animid)
  WHERE NOT EXISTS
        (SELECT 1
           FROM clean.follow
           WHERE follow.fol_date = spans.fa_fol_date
                 AND follow.fol_b_animid = spans.fa_fol_b_focal_animid
                 AND spans.max_end = follow.fol_time_end)
  ORDER BY spans.fa_fol_date, spans.fa_fol_b_focal_animid;

Solution

TENTATIVE: This is data entry error, ignore the problem and ignore the data in the follow.fol_time_end.

* (#26) Mismatch of start-in-nest on follow and follow_arrival

Problem

There are 1,848 follows where the follow arrival says the focal started in the nest but the follow does not, or vice-versa.

Bad Data

WITH spans AS
  (SELECT fa_fol_date, fa_fol_b_focal_animid
        , MIN(fa_time_start) AS min_start
     FROM clean.follow_arrival
     WHERE fa_b_arr_animid = fa_fol_b_focal_animid
     GROUP BY fa_fol_date, fa_fol_b_focal_animid)
SELECT spans.*, follow.fol_time_begin, follow.fol_flag_begin_in_nest
     , first_arrivals.fa_type_of_nesting
  FROM clean.follow
    JOIN spans
      ON (follow.fol_date = spans.fa_fol_date
          AND follow.fol_b_animid = spans.fa_fol_b_focal_animid)
    JOIN clean.follow_arrival
      AS first_arrivals
      ON (first_arrivals.fa_fol_date = follow.fol_date
          AND first_arrivals.fa_fol_b_focal_animid = follow.fol_b_animid
          AND first_arrivals.fa_time_start = spans.min_start)
  WHERE (((first_arrivals.fa_type_of_nesting = 1
           OR first_arrivals.fa_type_of_nesting = 3)
          AND follow.fol_flag_begin_in_nest = 0)
         OR (first_arrivals.fa_type_of_nesting <> 1
             AND first_arrivals.fa_type_of_nesting <> 3
             AND follow.fol_flag_begin_in_nest = 1))
  ORDER BY spans.fa_fol_date, spans.fa_fol_b_focal_animid;

* (#27) Mismatch of end-in-nest on follow and follow_arrival

Problem

There are 3,804 follows where the follow arrival says the focal ended in the nest but the follow does not, or vice-versa.

Bad Data

WITH spans AS
  (SELECT fa_fol_date, fa_fol_b_focal_animid
        , MAX(fa_time_end) AS max_end
     FROM clean.follow_arrival
     WHERE fa_b_arr_animid = fa_fol_b_focal_animid
     GROUP BY fa_fol_date, fa_fol_b_focal_animid)
SELECT spans.*
     , follow.fol_time_end, follow.fol_flag_end_in_nest
     , last_arrivals.fa_type_of_nesting
  FROM clean.follow
    JOIN spans
      ON (follow.fol_date = spans.fa_fol_date
          AND follow.fol_b_animid = spans.fa_fol_b_focal_animid)
    JOIN clean.follow_arrival
      AS last_arrivals
      ON (last_arrivals.fa_fol_date = follow.fol_date
          AND last_arrivals.fa_fol_b_focal_animid = follow.fol_b_animid
          AND last_arrivals.fa_time_end = spans.max_end)
  WHERE (((last_arrivals.fa_type_of_nesting = 2
           OR last_arrivals.fa_type_of_nesting = 3)
          AND follow.fol_flag_end_in_nest = 0)
         OR (last_arrivals.fa_type_of_nesting <> 2
             AND last_arrivals.fa_type_of_nesting <> 3
             AND follow.fol_flag_end_in_nest = 1))
  ORDER BY spans.fa_fol_date, spans.fa_fol_b_focal_animid;


(#28) The FOLLOW.FOL_distance_traveled column is not converted

Problem

The FOLLOW.FOL_distance_traveled column has no corresponding column in the new database design. The conversion process does not check that the value of this column is consistent with the other data in the database from which it is computed.

Solution

This is a computed column and does not need to be converted. The desired value is computed in the design of the new database.

The assumption is that the "raw" data from which this value is computed in the MS Access database is correct.


(#29) The FOLLOW.Brecord_notes column is not converted

Problem

The FOLLOW.Brecord_notes column has no corresponding column in the new database design.

Solution

This column is used for administrative purposes and does not need to be in the new database design.


* (#30) Some follows have no community

Problem

There are 13 follows with no community.

Bad Data

select *
  from clean.follow
  where fol_cl_community_id is null
  order by fol_date, fol_b_animid;


* (#32) Some follows have an animid with trailing spaces

Problem

There are 9 follows with animids that don't exist, because they have trailing spaces. These are comprised of 9 distinct animids.

Bad Data

select rtrim(fol_b_animid)
  from clean.follow
  where rtrim(fol_b_animid) <> fol_b_animid
  order by fol_b_animid;

Solution

Remove the trailing spaces in the conversion process.


* (#32) Some follows have an animid that is lower-case

Problem

There is 1 follows with an animids that don't exist, because it is lower-case.

Bad Data

select *
  from clean.follow
  where upper(fol_b_animid) <> fol_b_animid
  order by fol_b_animid;

Solution

Convert the animid to upper-case in the conversion process.


* (#34) Some follows have an animid that does not exist, even after animid cleanup

Problem

There are 14 follows with animids that don't exist, even after cleanup that removes trailing spaces and forces upper-case. (Some of these may be due to prior errors.) These are comprised of 3 distinct animids.

Bad Data

-- The 14 follows with bad animids
select *
  from clean.follow
  where not exists
          (select 1
             from clean.biography
             where biography.b_animid = upper(rtrim(follow.fol_b_animid)))
  order by fol_date, fol_b_animid;
-- The 3 animids involved
select distinct follow.fol_b_animid
  from clean.follow
  where not exists
          (select 1
             from clean.biography
             where biography.b_animid = upper(rtrim(follow.fol_b_animid)))
  order by follow.fol_b_animid;


* (#33) There are duplicate animid, date combinations on FOLLOW

Problem

There are 2 sets of duplicate animid, date combinations on the FOLLOW table, for a total of 4 rows.

Bad Data

-- The duplicate animid, date combinations
select follow.fol_b_animid, follow.fol_date, count(*)
  from clean.follow
  group by follow.fol_b_animid, follow.fol_date
  having count(*) > 1
  order by fol_b_animid, fol_date;
-- The duplicate rows
with dups as (
  select follow.fol_b_animid, follow.fol_date
    from clean.follow
    group by follow.fol_b_animid, follow.fol_date
    having count(*) > 1)
select *
  from clean.follow
       join dups on (follow.fol_b_animid = dups.fol_b_animid
                     and follow.fol_date = dups.fol_date)
  order by follow.fol_b_animid, follow.fol_date;


* (#35) There are follows done before a focal was under study

Problem

There are 5 follows that are done before their focal was under study, before the focal's EntryDate.

Bad Data

select follow.*, biography.b_entrydate
  from clean.follow
    join clean.biography on (biography.b_animid = follow.fol_b_animid)
  where biography.b_entrydate > follow.fol_date
  order by follow.fol_date, follow.fol_b_animid;