Harvest for protisten.de Created 24 Nov 12:38

Stage: completed
Fetched: 24 Nov 12:38
Validated: 24 Nov 12:38
Deltas Created 24 Nov 12:38
Units Normalized: 24 Nov 12:40
Ancestry Built: 24 Nov 12:38
Nodes Matched: 24 Nov 12:40
Names Parsed: 24 Nov 12:38
New Models Stored: 24 Nov 12:38
Indexed: 24 Nov 12:40
Completed: 24 Nov 12:42
Time to Harvest: less than a minute

Harvesting Log (most recent first)

[INFO] [2021-11-24 12:38:05] Created harvest instance #4077
[STOP] [2021-11-24 12:38:05] create_harvest_instance
[START] [2021-11-24 12:38:05] fetch_files
[STOP] [2021-11-24 12:38:05] fetch_files
[START] [2021-11-24 12:38:05] validate_each_file
[INFO] [2021-11-24 12:38:05] Looping over 3 formats...
[INFO] [2021-11-24 12:38:05] ...agents (/app/public/data/ptgdp/agent.tab)
[INFO] [2021-11-24 12:38:05] Valid: /app/public/converted_csv/ptgdp_agents_4077.csv (1 lines)
[INFO] [2021-11-24 12:38:05] ...nodes (/app/public/data/ptgdp/taxon.tab)
[INFO] [2021-11-24 12:38:05] Valid: /app/public/converted_csv/ptgdp_nodes_4077.csv (1165 lines)
[INFO] [2021-11-24 12:38:05] ...media (/app/public/data/ptgdp/media_resource.tab)
[INFO] [2021-11-24 12:38:05] Valid: /app/public/converted_csv/ptgdp_media_4077.csv (2078 lines)
[STOP] [2021-11-24 12:38:05] validate_each_file
[START] [2021-11-24 12:38:05] convert_to_csv
[INFO] [2021-11-24 12:38:05] Looping over 3 formats...
[INFO] [2021-11-24 12:38:05] ...agents (/app/public/data/ptgdp/agent.tab)
[CMD] [2021-11-24 12:38:05] /usr/bin/sort /app/public/converted_csv/ptgdp_agents_4077.csv > /app/public/converted_csv/ptgdp_agents_4077.csv_sorted
[INFO] [2021-11-24 12:38:07] Converted: /app/public/converted_csv/ptgdp_agents_4077.csv (1 lines)
[INFO] [2021-11-24 12:38:07] ...nodes (/app/public/data/ptgdp/taxon.tab)
[CMD] [2021-11-24 12:38:07] /usr/bin/sort /app/public/converted_csv/ptgdp_nodes_4077.csv > /app/public/converted_csv/ptgdp_nodes_4077.csv_sorted
[INFO] [2021-11-24 12:38:08] Converted: /app/public/converted_csv/ptgdp_nodes_4077.csv (1165 lines)
[INFO] [2021-11-24 12:38:08] ...media (/app/public/data/ptgdp/media_resource.tab)
[CMD] [2021-11-24 12:38:08] /usr/bin/sort /app/public/converted_csv/ptgdp_media_4077.csv > /app/public/converted_csv/ptgdp_media_4077.csv_sorted
[INFO] [2021-11-24 12:38:10] Converted: /app/public/converted_csv/ptgdp_media_4077.csv (2078 lines)
[STOP] [2021-11-24 12:38:10] convert_to_csv
[START] [2021-11-24 12:38:10] calculate_delta
[INFO] [2021-11-24 12:38:10] Looping over 3 formats...
[INFO] [2021-11-24 12:38:10] ...agents (/app/public/data/ptgdp/agent.tab)
[CMD] [2021-11-24 12:38:10] echo "0a" > /app/public/diff/ptgdp_agents_4077.diff
[CMD] [2021-11-24 12:38:12] tail -n +1 /app/public/converted_csv/ptgdp_agents_4077.csv >> /app/public/diff/ptgdp_agents_4077.diff
[CMD] [2021-11-24 12:38:13] echo "." >> /app/public/diff/ptgdp_agents_4077.diff
[INFO] [2021-11-24 12:38:15] Created diff: /app/public/diff/ptgdp_agents_4077.diff (3 lines)
[INFO] [2021-11-24 12:38:15] ...nodes (/app/public/data/ptgdp/taxon.tab)
[CMD] [2021-11-24 12:38:15] echo "0a" > /app/public/diff/ptgdp_nodes_4077.diff
[CMD] [2021-11-24 12:38:17] tail -n +1 /app/public/converted_csv/ptgdp_nodes_4077.csv >> /app/public/diff/ptgdp_nodes_4077.diff
[CMD] [2021-11-24 12:38:18] echo "." >> /app/public/diff/ptgdp_nodes_4077.diff
[INFO] [2021-11-24 12:38:20] Created diff: /app/public/diff/ptgdp_nodes_4077.diff (1167 lines)
[INFO] [2021-11-24 12:38:20] ...media (/app/public/data/ptgdp/media_resource.tab)
[CMD] [2021-11-24 12:38:20] echo "0a" > /app/public/diff/ptgdp_media_4077.diff
[CMD] [2021-11-24 12:38:21] tail -n +1 /app/public/converted_csv/ptgdp_media_4077.csv >> /app/public/diff/ptgdp_media_4077.diff
[CMD] [2021-11-24 12:38:23] echo "." >> /app/public/diff/ptgdp_media_4077.diff
[INFO] [2021-11-24 12:38:25] Created diff: /app/public/diff/ptgdp_media_4077.diff (2080 lines)
[STOP] [2021-11-24 12:38:25] calculate_delta
[START] [2021-11-24 12:38:25] parse_diff_and_store
[INFO] [2021-11-24 12:38:25] Handling diff: /app/public/diff/ptgdp_agents_4077.diff (3 lines)
[INFO] [2021-11-24 12:38:26] Loading agents diff file into memory (3 /app/public/diff/ptgdp_agents_4077.diff lines)...
[INFO] [2021-11-24 12:38:28] Handling diff: /app/public/diff/ptgdp_nodes_4077.diff (1167 lines)
[INFO] [2021-11-24 12:38:30] Loading nodes diff file into memory (1167 /app/public/diff/ptgdp_nodes_4077.diff lines)...
[WARN] [2021-11-24 12:38:32] Filtered Scientific Name `Bryozoa/ Moostierchen` to `Bryozoa Moostierchen`
[WARN] [2021-11-24 12:38:32] Filtered Scientific Name `Rhodophyta/Rotalge` to `RhodophytaRotalge`
[WARN] [2021-11-24 12:38:32] Filtered Scientific Name `Chlorophyta-Cyste/Chlorophyta cyst` to `Chlorophyta-CysteChlorophyta cyst`
[WARN] [2021-11-24 12:38:32] Filtered Scientific Name `Cyanobacteria/Melainabacteria group` to `CyanobacteriaMelainabacteria group`
[INFO] [2021-11-24 12:38:32] Handling diff: /app/public/diff/ptgdp_media_4077.diff (2080 lines)
[INFO] [2021-11-24 12:38:33] Loading media diff file into memory (2080 /app/public/diff/ptgdp_media_4077.diff lines)...
[WARN] [2021-11-24 12:38:36] skipping invalid medium (missing format or subtype) with resource_pk 8547ce48764a154fc078a638d5d2b31c, subclass image (from http://purl.org/dc/dcmitype/stillimage), format  (from )
[WARN] [2021-11-24 12:38:36] skipping invalid medium (missing format or subtype) with resource_pk a26be8c67c42b448a3e2d4d7595d59e1, subclass image (from http://purl.org/dc/dcmitype/stillimage), format  (from )
[WARN] [2021-11-24 12:38:37] skipping invalid medium (missing format or subtype) with resource_pk f1a880a4d1ad767e0d7870dd5443d093, subclass image (from http://purl.org/dc/dcmitype/stillimage), format  (from )
[INFO] [2021-11-24 12:38:37] Storing 1 Attributions
[INFO] [2021-11-24 12:38:37] Processing group of 1 in 1 groups of 1000
[INFO] [2021-11-24 12:38:37] Average Time: 0.03
[INFO] [2021-11-24 12:38:37] Total Time: 1s
[INFO] [2021-11-24 12:38:37] Storing 1165 ScientificNames
[INFO] [2021-11-24 12:38:37] Processing group of 1165 in 2 groups of 1000
[INFO] [2021-11-24 12:38:37] Average Time: 0.165
[INFO] [2021-11-24 12:38:37] Total Time: 1s
[INFO] [2021-11-24 12:38:37] Storing 1165 Nodes
[INFO] [2021-11-24 12:38:37] Processing group of 1165 in 2 groups of 1000
[INFO] [2021-11-24 12:38:38] Average Time: 0.265
[INFO] [2021-11-24 12:38:38] Total Time: 1s
[INFO] [2021-11-24 12:38:38] Storing 2075 ContentAttributions
[INFO] [2021-11-24 12:38:38] Processing group of 2075 in 3 groups of 1000
[INFO] [2021-11-24 12:38:38] Average Time: 0.133
[INFO] [2021-11-24 12:38:38] Total Time: 1s
[INFO] [2021-11-24 12:38:38] Storing 2075 Media
[INFO] [2021-11-24 12:38:38] Processing group of 2075 in 3 groups of 1000
[INFO] [2021-11-24 12:38:39] Average Time: 0.407
[INFO] [2021-11-24 12:38:39] Total Time: 2s
[STOP] [2021-11-24 12:38:39] parse_diff_and_store
[START] [2021-11-24 12:38:39] resolve_keys
[INFO] [2021-11-24 12:38:48] Occurrences to nodes (through scientific_names)...
[INFO] [2021-11-24 12:38:48] traits to occurrences...
[INFO] [2021-11-24 12:38:48] traits to nodes (through occurrences)...
[INFO] [2021-11-24 12:38:48] Traits to sex term...
[INFO] [2021-11-24 12:38:48] Traits to lifestage term...
[INFO] [2021-11-24 12:38:48] MetaTraits to traits...
[INFO] [2021-11-24 12:38:48] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2021-11-24 12:38:48] Assocs to occurrences...
[INFO] [2021-11-24 12:38:48] Assocs to nodes...
[INFO] [2021-11-24 12:38:48] Assoc to sex term...
[INFO] [2021-11-24 12:38:48] Assoc to lifestage term...
[INFO] [2021-11-24 12:38:48] MetaAssoc to assocs...
[STOP] [2021-11-24 12:38:48] resolve_keys
[START] [2021-11-24 12:38:48] hold_for_later_1
[STOP] [2021-11-24 12:38:48] hold_for_later_1
[START] [2021-11-24 12:38:48] hold_for_later_2
[STOP] [2021-11-24 12:38:48] hold_for_later_2
[START] [2021-11-24 12:38:48] resolve_missing_parents
[STOP] [2021-11-24 12:38:48] resolve_missing_parents
[START] [2021-11-24 12:38:48] rebuild_nodes
[START] [2021-11-24 12:38:48] Flattener#flatten
[START] [2021-11-24 12:38:48] Flattener#study_resource
[START] [2021-11-24 12:38:48] Flattener#build_ancestry
[STOP] [2021-11-24 12:38:48] Flattener#build_ancestry
[INFO] [2021-11-24 12:38:48] 1165 ancestry keys
[START] [2021-11-24 12:38:48] build_node_ancestors
[INFO] [2021-11-24 12:38:48] old ancestors deleted.
[STOP] [2021-11-24 12:38:49] build_node_ancestors
[START] [2021-11-24 12:38:49] Flattener#propagate_ancestor_ids
[STOP] [2021-11-24 12:38:49] Flattener#propagate_ancestor_ids
[STOP] [2021-11-24 12:38:49] Flattener#flatten
[STOP] [2021-11-24 12:38:49] rebuild_nodes
[START] [2021-11-24 12:38:49] resolve_missing_media_owners
[STOP] [2021-11-24 12:38:49] resolve_missing_media_owners
[START] [2021-11-24 12:38:49] sanitize_media_verbatims
[STOP] [2021-11-24 12:38:49] sanitize_media_verbatims
[START] [2021-11-24 12:38:49] queue_downloads
[STOP] [2021-11-24 12:38:49] queue_downloads
[START] [2021-11-24 12:38:49] parse_names
[WARN] [2021-11-24 12:38:49] I see 1165 names which still need to be parsed.
[WARN] [2021-11-24 12:38:50] Names to parse: 1165 formatted: 1165 learned: 1144 parsed: 1165
[STOP] [2021-11-24 12:38:51] parse_names
[START] [2021-11-24 12:38:51] denormalize_canonical_names_to_nodes
[STOP] [2021-11-24 12:38:51] denormalize_canonical_names_to_nodes
[START] [2021-11-24 12:38:51] match_nodes
[START] [2021-11-24 12:38:51] map_all_nodes_to_pages
[ERR] [2021-11-24 12:39:31][hdls] download_and_prep FAILED for Medium.find(13940197): 404 Not Found
[STOP] [2021-11-24 12:40:10] map_all_nodes_to_pages
[INFO] [2021-11-24 12:40:10] 66 Unmatched nodes (of 1165)! That's too many to output. Full list in /app/public/data/ptgdp/unmatched_nodes.txt ; First 10: Canonical: Siderocapsa; Node#101269878; ResourceID: 976a61ae86cac16ed1137606867d2dce; Canonical: Chroococcus turgidus with; Node#101269688; ResourceID: 4e0938c921dba380e294e261c264563a; Canonical: Eucaryota; Node#101270301; ResourceID: eucaryota; Canonical: Nassulophorea; Node#101270426; ResourceID: intramacronucleata-nassulophorea; Canonical: Peniculiada; Node#101270481; ResourceID: oligohymenophorea-peniculiada; Canonical: Protostomatea; Node#101270430; ResourceID: intramacronucleata-protostomatea; Canonical: Alveolates; Node#101269492; ResourceID: sar_(stramenopiles,_alveolates,_rhizaria)-alveolates; Canonical: Tecofilosea; Node#101270091; ResourceID: cercozoa-tecofilosea; Canonical: Microgromia socialis; Node#101270362; ResourceID: f9686b84a29a24c9287e57cdb7df43ba; Canonical: Bicosida; Node#101270031; ResourceID: bicoecea-bicosida
[START] [2021-11-24 12:40:10] update_nodes
[STOP] [2021-11-24 12:40:10] update_nodes
[STOP] [2021-11-24 12:40:10] match_nodes
[START] [2021-11-24 12:40:10] reindex_search
[STOP] [2021-11-24 12:40:12] reindex_search
[START] [2021-11-24 12:40:12] normalize_units
[STOP] [2021-11-24 12:40:12] normalize_units
[START] [2021-11-24 12:40:12] calculate_statistics
[STOP] [2021-11-24 12:40:12] calculate_statistics
[START] [2021-11-24 12:40:12] complete_harvest_instance
[START] [2021-11-24 12:40:12] overall_tsv_creation
[INFO] [2021-11-24 12:40:12] Processing group of 1165 in 1 batches of 10000
[ERR] [2021-11-24 12:40:32][hdls] download_and_prep FAILED for Medium.find(13940537): 404 Not Found
[ERR] [2021-11-24 12:41:18][hdls] download_and_prep FAILED for Medium.find(13940945): 404 Not Found
[ERR] [2021-11-24 12:41:58][hdls] download_and_prep FAILED for Medium.find(13941318): 404 Not Found
[INFO] [2021-11-24 12:42:24] Average Time: 16.29
[INFO] [2021-11-24 12:42:24] Total Time: 2m13s
[STOP] [2021-11-24 12:42:24] overall_tsv_creation
[INFO] [2021-11-24 12:42:24] Done. Check your files:
[INFO] [2021-11-24 12:42:26] (1165 lines) /app/public/data/ptgdp/publish_nodes.tsv
[INFO] [2021-11-24 12:42:27] (8085 lines) /app/public/data/ptgdp/publish_node_ancestors.tsv
[INFO] [2021-11-24 12:42:29] (1165 lines) /app/public/data/ptgdp/publish_scientific_names.tsv
[INFO] [2021-11-24 12:42:30] (2075 lines) /app/public/data/ptgdp/publish_media.tsv
[INFO] [2021-11-24 12:42:32] (211 lines) /app/public/data/ptgdp/publish_image_info.tsv
[INFO] [2021-11-24 12:42:34] (2075 lines) /app/public/data/ptgdp/publish_attributions.tsv
[STOP] [2021-11-24 12:42:34] complete_harvest_instance
[START] [2021-11-24 12:42:34] completed
[STOP] [2021-11-24 12:42:34] completed
[STOP] [2021-11-24 12:42:34] logged process, took 270.49

Latest Process