Conversion Data Issues
This page lists all the problems with the data that were encountered during the data conversion process, and how the issue was resolved.
The problems are numbered, in the order in which they were encountered during the conversion.
Given a choice, earlier problems should be solved before later problems. This allows the later steps in the conversion to receive "correct" data, which help eliminate spurious problems, and aids the discovery of problems hidden by bad data.
Unsolved problems are marked with an *
.
(#1) FOLLOW_MAP_TIME duplicate keys
Problem
The data dump says that the FOLLOW_MAP_TIME
table has a primary key consisting of, in order, the columns: FMT_FOL_date
, FMT_FOL_B_focal_AnimID
, FMT_time
.
But these columns contain duplicate values.
Bad Data
The duplicate values can be listed (from the raw
schema) with:
This query no longer reports results because the data was changed in the original MS Access data.
SELECT * FROM "FOLLOW_MAP_TIME" JOIN (SELECT "FMT_FOL_date" AS the_date , "FMT_FOL_B_focal_AnimID" AS the_animid , "FMT_time" AS the_time FROM "FOLLOW_MAP_TIME" GROUP BY "FMT_FOL_date", "FMT_FOL_B_focal_AnimID", "FMT_time" HAVING count(*) > 1 ) AS fmt ON ("FOLLOW_MAP_TIME"."FMT_FOL_date" = fmt.the_date AND "FOLLOW_MAP_TIME"."FMT_FOL_B_focal_AnimID" = fmt.the_animid AND "FOLLOW_MAP_TIME"."FMT_time" = fmt.the_time);
Solution
Fixed in MS Access 11/17/23 by ICG. Checked all against Tikis.
(#2) SUBADULT_ARRIVALS_LOG has a textual SA_first_tiki_date column
Problem
The SUBADULT_ARRIVALS_LOG table has a column that is supposed to contain a date, but instead contains the string "TEXTY".
Likely, the entire row is bad. The row contains:
SA_B_AnimID | SA_first_tiki_date | SA_notes -------------+--------------------+---------- TEXTY | TEXTY | TEXTY
Bad Data
In the raw
schema run:
This query no longer reports results because the data was changed in the original MS Access data.
select * from "SUBADULT_ARRIVALS_LOG" where "SA_first_tiki_date" = 'TEXTY';
Solution
Fixed in MS Access, 11/16/2023, ICG. Deleted row
(#3) BRECORD_NOTES contains rows where BREC_FOL_date has values that are not just a date
Problem
There are 4 rows in BRECORD_NOTES where the BREC_FOL_date column, supposedly a date, contains time values that are not '00:00:00'.
Bad Data
In the raw
schema run:
This query no longer reports results because the data was changed in the original MS Access data.
select * from "BRECORD_NOTES" where "BREC_FOL_date"::TIME <> '00:00:00';
Solution
Fixed in MS Access, 11/16/2023, ICG Deleted time stamp
(#4) MATING_EVENT contains rows where M_FOL_date has values that are not just a date
Problem
The MATING_EVENT table contains 1 row where the time portion of M_FOL_date is not '00:00:00'.
Bad Data
In the raw
schema run:
This query no longer reports results because the data was changed in the original MS Access data.
select * from "MATING_EVENT" where "M_FOL_date"::TIME WITHOUT TIME ZONE <> '00:00:00';
Solution
Fixed in MS Access, 11/16/2023, ICG Deleted time stamp
(#5) BIOGRAPHY.DepartdateError data discarded
Problem
The BIOGRAPHY.DepartdateError column contains data that was, after discussion, determined to be unusable.
Bad Data
In the raw
schema run:
This query no longer reports results because the data was changed in the original MS Access data.
select * from raw."BIOGRAPHY" where "B_DepartdateError" <> 0;
Solution
Do not convert the data. There is no corresponding column in the new db design.
(#6) BIOGRAPHY.B_AnimID_num column contains the empty string
Problem
The BIOGRAPHY.B_AnimID_num column contains the empty string, instead of NULL.
Bad Data
In the raw
schema run:
This query no longer reports results because the data was changed in the original MS Access data.
select * from raw."BIOGRAPHY" where "B_AnimID_num" = '';
Solution
Changed 11 empty string values to NULL in MS Access. ICG 12/6/2023
(#7) BIOGRAPHY.B_AnimID_num column is textual
Problem
The BIOGRAPHY.B_AnimID_num column contains has a data type of TEXT. The data values all begin with "CH".
Bad Data
In the raw
schema run:
This query no longer reports results because the data was changed in the original MS Access data.
select "B_AnimID", "B_AnimID_num" from raw."BIOGRAPHY" where "B_AnimID_num" <> '' AND "B_AnimID_num" IS DISTINCT FROM NULL order by "B_AnimID";
Solution
Removed the "CH" prefix and made the column an integer in MS Access. ICG 12/6/2023
(#8) BIOGRAPHY.DadID_publication_info column contains NULL values
Problem
The BIOGRAPHY.DadID_publication_info column has a data type that allows NULL
values.
SokweDB wants only text
Bad Data
In the raw
schema run:
This query no longer reports results because the data was changed in the original MS Access data.
select * from raw."BIOGRAPHY" where "B_DadID" IS NULL order by "B_AnimID";
Solution
Changed the NULL
to the empty string ("") in MS Access. ICG 12/6/2023
(#9) BIOGRAPHY.DadID column contains non-AnimID values
Problem
The BIOGRAPHY.DadID column has a non-AnimID values.
Bad Data
In the raw
schema run:
select "B_AnimID", "B_DadID" from raw."BIOGRAPHY" where "B_DadID" is not null and "B_DadID" <> '' and not exists (select 1 from raw."BIOGRAPHY" as search where search."B_AnimID" = raw."BIOGRAPHY"."B_DadID");
Solution
Create the DadIDPrelim column in a BIOGRAPHY_DATA table, and make a BIOGRAPHY view that combines Dad_ID and DadIDPrelim into DadID -- adding the '_prelim' suffix as expected.
(#10) BRECORD_NOTES contains rows where BREC_time has values that are not just a time
Problem
There are 56 rows in BRECORD_NOTES where the BREC_time column, supposedly a time, contains date values that are not '1899-12-30'.
Bad Data
In the raw
schema run:
select * from raw."BRECORD_NOTES" where "BREC_time"::DATE <> '1899-12-30';
Solution
The data is fixed in the conversion process.
* (#11) KAZ has b_dadid_publication_info, but b_dadid is NULL
Problem
KAZ
has dad id publication info of 'Rudicell et al. 2010', but a NULL
dadid.
Bad Data
In the clean
schema run:
select * from clean.biography where b_dadid is NULL and (b_dadid_publication_info <> '' or b_dadid_publication_info is null);
Solution
(#12) 9 BIOGRAPHY rows have BirthComm values that are not COMM_IDS.CommID values
Problem
There are 9 BIOGRAPHY rows with BirthComm values that are not valid COMM_IDS, their communities do not exist.
Bad Data
In the clean
schema run:
select * from easy.biography where b_birthgroup is not NULL and not exists (select 1 from easy.community_lookup where community_lookup.cl_community_id = biography.b_birthgroup);
Solution
Added KL and KL_KK to CommIds. KL = Kalande, KL_KK= Kasekela/kalande codes are created during conversion
Fixed in commit 5d85b83bfe1a.
* (#13) 234 COMM_MEMBS rows place individuals in a community before their BIOGRAPHY.EntryDate
Problem
234 COMM_MEMBS rows place individuals into a community before BIOGRAPHY says they entered the community.
Bad Data
In the clean
schema run:
select b.b_animid, b.b_entrydate, cm.cm_start_date from clean.community_membership as cm join clean.biography as b on (b.b_animid = cm.cm_b_animid) where cm.cm_start_date < b.b_entrydate;
Solution
* (#14) 6 COMM_MEMBS rows place individuals in a community after their BIOGRAPHY.EndDate
Problem
6 COMM_MEMBS rows place individuals into a community after BIOGRAPHY says they left the community.
Bad Data
In the clean
schema run:
select b.b_animid, b.b_departdate, cm.cm_end_date from clean.community_membership as cm join clean.biography as b on (b.b_animid = cm.cm_b_animid) where cm.cm_end_date > b.b_departdate;
Solution
ICG EVL fixed BH, HAI, TZB2 in Access. Need to talk to Karl about MG, RO, WD. Treat like KL chimps in Biography?
(#15) TT is placed in a community twice on the same day
Problem
An individual may not be in more than one community (or even twice in the same community) on any given day.
Bad Data
In the clean
schema run:
select first.cm_b_animid as anim_id , first.cm_start_date as first_start_date , first.cm_end_date as first_end_date , first.cm_cl_community_id as first_community_id , first.cm_start_source as first_start_source , first.cm_end_source as first_end_source , second.cm_start_date as second_start_date , second.cm_end_date as second_end_date , second.cm_cl_community_id as second_community_id , second.cm_start_source as second_start_source , second.cm_end_source as second_end_source from clean.community_membership as first join clean.community_membership as second on (first.cm_b_animid = second.cm_b_animid and first.cm_start_date < second.cm_start_date) where first.cm_end_date >= second.cm_start_date;
Solution
Fixed in MS Access 2/1/2024 Changed end date of KK_P1 membership to 8/14/2022
ICG
(#16) There are 31 rows in COMM_MEMB_LOG where MadeBy is NULL
Problem
There are 31 rows in COMM_MEMB_LOG where MadeBy is NULL, and the column does not allow NULL values.
Bad Data
In the clean
schema run:
select * from clean.community_membership_update_log where made_by is null;
Solution
See problem #18. Fixed in commit 7a48fb2c4aad3b5bd4
* (#17) There are 64 rows in COMM_MEMB_LOG where the chimp_id is not on BIOGRAPHY
Problem
There are 64 rows in COMM_MEMB_LOG where the chimp_id is not a BIOGRAPHY.AnimID.
Bad Data
In the clean
schema run:
select date_of_update, chimp_id from clean.community_membership_update_log as log where not exists (select 1 from clean.biography where biography.b_animid = log.chimp_id);
Solution
Make this a "soft" error.
(#18) There are 31 rows in COMM_MEMB_LOG where the MadeBy is NULL
Problem
There are 31 rows in COMM_MEMB_LOG where MadeBy is NULL, and the column does not allow NULLs.
Bad Data
In the clean
schema run:
select * from clean.community_membership_update_log where made_by is null;
Solution
Create an unknown person (UNK), and when the person is NULL, use the unkonwn person.
This allows us to make the unknown person "inactive", preventing them from being used in newly entered data. The alternative, allowing NULL MadeBy values, would allow new "bad data", and NULL values make querying harder.
Fixed in commit 7a48fb2c4aad3b5b.
(#19) There are 18 rows in BIOGRAPHY_LOG where the MadeBy is NULL
Problem
There are 18 rows in BIOGRAPHY_LOG where MadeBy is NULL, and the column does not allow NULLs.
Bad Data
In the clean
schema run:
select * from clean.biography_update_log where made_by is null;
Solution
Create an unknown person (UNK), and when the person is NULL, use the unkonwn person. (See Problem #18)== * (#19) There are 18 rows in BIOGRAPHY_LOG where the MadeBy is NULL ==
Fixed in commit 8a528cc7d1a9fb.
(#20) There are 3 rows in BIOGRAPHY_LOG where the Rationale is NULL
Problem
There are 3 rows in BIOGRAPHY_LOG where Rationale is NULL. The column does not allow NULLs; normally the conversion program would convert NULL to the empty string. But the Rationale column requires there be (non-empty) textual data.
Bad Data
In the clean
schema run:
select * from clean.biography_update_log where update_rationale is null;
Solution
ICG fixed in MS Access 2/8/2024. Updated rationale to 'routine update'.
(#21) There are 6 rows in BIOGRAPHY_LOG where the update_escription is NULL
Problem
There are 6 rows in BIOGRAPHY_LOG where Description is NULL. It appears that the description was put into the Rationale column, sometimes along with some rationale.
Bad Data
In the clean
schema run:
select * from clean.biography_update_log where update_description is null;
Solution
ICG fixed in MS Access 2/8/2024, using information in update_rationale
* (#22) There are 85 rows in BIOGRAPHY_LOG where the chimp_id is not on BIOGRAPHY
Problem
There are 85 rows in BIOGRAPHY_LOG where the chimp_id is not a BIOGRAPHY.AnimID.
Bad Data
In the clean
schema run:
select * from clean.biography_update_log as log where not exists (select 1 from clean.biography where biography.b_animid = log.chimp_id);
Solution
Make this a "soft" error.
(#23) There are 65 rows in BIOGRAPHY_LOG where the MadeBy is not on PEOPLE
Problem
There are 65 rows in BIOGRAPHY_LOG where the MadeBy is not on PEOPLE, but there is MadeBy data.
These are "combination" ID errors, where multiple people are entered instead of a single people code.
Bad Data
In the clean
schema run:
select * from clean.biography_update_log as log where not exists (select 1 from clean.people where people.person = log.made_by) and made_by is not null;
= Solution
Fixed in MS Access by ICG 2/8/2024. Changed multiple IDs to the one who made the final change.