Harvest for
vimeo
Created
03 Apr 15:27
Stage:
completed
Fetched:
03 Apr 15:27
Validated:
03 Apr 15:27
Deltas Created
03 Apr 15:27
Units Normalized:
03 Apr 15:27
Ancestry Built:
03 Apr 15:27
Nodes Matched:
03 Apr 15:27
Names Parsed:
03 Apr 15:27
New Models Stored:
03 Apr 15:27
Indexed:
03 Apr 15:27
Completed:
03 Apr 15:28
Time to Harvest:
less than a minute
Harvesting Log
(153 lines)
[INFO] [2023-04-03 15:27:33] Created harvest instance #4328
[STOP] [2023-04-03 15:27:33] create_harvest_instance
[START] [2023-04-03 15:27:33] fetch_files
[STOP] [2023-04-03 15:27:33] fetch_files
[START] [2023-04-03 15:27:33] validate_each_file
[INFO] [2023-04-03 15:27:33] Looping over 4 formats...
[INFO] [2023-04-03 15:27:33] ...agents (/app/public/data/DwCA/agent.tab)
[INFO] [2023-04-03 15:27:33] Valid: /app/public/data/DwCA/converted_csv/DwCA_agents_30273.csv (22 lines)
[INFO] [2023-04-03 15:27:33] ...nodes (/app/public/data/DwCA/taxon.tab)
[INFO] [2023-04-03 15:27:33] Valid: /app/public/data/DwCA/converted_csv/DwCA_nodes_30274.csv (208 lines)
[INFO] [2023-04-03 15:27:33] ...media (/app/public/data/DwCA/media_resource.tab)
[INFO] [2023-04-03 15:27:33] Valid: /app/public/data/DwCA/converted_csv/DwCA_media_30276.csv (290 lines)
[INFO] [2023-04-03 15:27:33] ...vernaculars (/app/public/data/DwCA/vernacular_name.tab)
[INFO] [2023-04-03 15:27:33] Valid: /app/public/data/DwCA/converted_csv/DwCA_vernaculars_30275.csv (22 lines)
[STOP] [2023-04-03 15:27:33] validate_each_file
[START] [2023-04-03 15:27:33] convert_to_csv
[INFO] [2023-04-03 15:27:33] Looping over 4 formats...
[INFO] [2023-04-03 15:27:33] ...agents (/app/public/data/DwCA/agent.tab)
[CMD] [2023-04-03 15:27:33] /usr/bin/sort /app/public/data/DwCA/converted_csv/DwCA_agents_30273.csv > /app/public/data/DwCA/converted_csv/DwCA_agents_30273.csv_sorted
[INFO] [2023-04-03 15:27:33] Converted: /app/public/data/DwCA/converted_csv/DwCA_agents_30273.csv (22 lines)
[INFO] [2023-04-03 15:27:33] ...nodes (/app/public/data/DwCA/taxon.tab)
[CMD] [2023-04-03 15:27:33] /usr/bin/sort /app/public/data/DwCA/converted_csv/DwCA_nodes_30274.csv > /app/public/data/DwCA/converted_csv/DwCA_nodes_30274.csv_sorted
[INFO] [2023-04-03 15:27:33] Converted: /app/public/data/DwCA/converted_csv/DwCA_nodes_30274.csv (208 lines)
[INFO] [2023-04-03 15:27:33] ...media (/app/public/data/DwCA/media_resource.tab)
[CMD] [2023-04-03 15:27:33] /usr/bin/sort /app/public/data/DwCA/converted_csv/DwCA_media_30276.csv > /app/public/data/DwCA/converted_csv/DwCA_media_30276.csv_sorted
[INFO] [2023-04-03 15:27:34] Converted: /app/public/data/DwCA/converted_csv/DwCA_media_30276.csv (290 lines)
[INFO] [2023-04-03 15:27:34] ...vernaculars (/app/public/data/DwCA/vernacular_name.tab)
[CMD] [2023-04-03 15:27:34] /usr/bin/sort /app/public/data/DwCA/converted_csv/DwCA_vernaculars_30275.csv > /app/public/data/DwCA/converted_csv/DwCA_vernaculars_30275.csv_sorted
[INFO] [2023-04-03 15:27:34] Converted: /app/public/data/DwCA/converted_csv/DwCA_vernaculars_30275.csv (22 lines)
[STOP] [2023-04-03 15:27:34] convert_to_csv
[START] [2023-04-03 15:27:34] calculate_delta
[INFO] [2023-04-03 15:27:34] Looping over 4 formats...
[INFO] [2023-04-03 15:27:34] ...agents (/app/public/data/DwCA/agent.tab)
[CMD] [2023-04-03 15:27:34] echo "0a" > /app/public/data/DwCA/diff/DwCA_agents_30273.diff
[CMD] [2023-04-03 15:27:34] tail -n +1 /app/public/data/DwCA/converted_csv/DwCA_agents_30273.csv >> /app/public/data/DwCA/diff/DwCA_agents_30273.diff
[CMD] [2023-04-03 15:27:34] echo "." >> /app/public/data/DwCA/diff/DwCA_agents_30273.diff
[INFO] [2023-04-03 15:27:34] Created diff: /app/public/data/DwCA/diff/DwCA_agents_30273.diff (24 lines)
[INFO] [2023-04-03 15:27:34] ...nodes (/app/public/data/DwCA/taxon.tab)
[CMD] [2023-04-03 15:27:34] echo "0a" > /app/public/data/DwCA/diff/DwCA_nodes_30274.diff
[CMD] [2023-04-03 15:27:35] tail -n +1 /app/public/data/DwCA/converted_csv/DwCA_nodes_30274.csv >> /app/public/data/DwCA/diff/DwCA_nodes_30274.diff
[CMD] [2023-04-03 15:27:35] echo "." >> /app/public/data/DwCA/diff/DwCA_nodes_30274.diff
[INFO] [2023-04-03 15:27:35] Created diff: /app/public/data/DwCA/diff/DwCA_nodes_30274.diff (210 lines)
[INFO] [2023-04-03 15:27:35] ...media (/app/public/data/DwCA/media_resource.tab)
[CMD] [2023-04-03 15:27:35] echo "0a" > /app/public/data/DwCA/diff/DwCA_media_30276.diff
[CMD] [2023-04-03 15:27:35] tail -n +1 /app/public/data/DwCA/converted_csv/DwCA_media_30276.csv >> /app/public/data/DwCA/diff/DwCA_media_30276.diff
[CMD] [2023-04-03 15:27:35] echo "." >> /app/public/data/DwCA/diff/DwCA_media_30276.diff
[INFO] [2023-04-03 15:27:36] Created diff: /app/public/data/DwCA/diff/DwCA_media_30276.diff (292 lines)
[INFO] [2023-04-03 15:27:36] ...vernaculars (/app/public/data/DwCA/vernacular_name.tab)
[CMD] [2023-04-03 15:27:36] echo "0a" > /app/public/data/DwCA/diff/DwCA_vernaculars_30275.diff
[CMD] [2023-04-03 15:27:36] tail -n +1 /app/public/data/DwCA/converted_csv/DwCA_vernaculars_30275.csv >> /app/public/data/DwCA/diff/DwCA_vernaculars_30275.diff
[CMD] [2023-04-03 15:27:36] echo "." >> /app/public/data/DwCA/diff/DwCA_vernaculars_30275.diff
[INFO] [2023-04-03 15:27:36] Created diff: /app/public/data/DwCA/diff/DwCA_vernaculars_30275.diff (24 lines)
[STOP] [2023-04-03 15:27:36] calculate_delta
[START] [2023-04-03 15:27:36] parse_diff_and_store
[INFO] [2023-04-03 15:27:36] Handling diff: /app/public/data/DwCA/diff/DwCA_agents_30273.diff (24 lines)
[INFO] [2023-04-03 15:27:36] Loading agents diff file into memory (24 lines)...
[INFO] [2023-04-03 15:27:36] Storing 22 Attributions (22/22/24)
[INFO] [2023-04-03 15:27:36] Handling diff: /app/public/data/DwCA/diff/DwCA_nodes_30274.diff (210 lines)
[INFO] [2023-04-03 15:27:37] Loading nodes diff file into memory (210 lines)...
[INFO] [2023-04-03 15:27:37] Storing 304 ScientificNames (608/208/210)
[INFO] [2023-04-03 15:27:37] Storing 304 Nodes (608/208/210)
[INFO] [2023-04-03 15:27:37] Handling diff: /app/public/data/DwCA/diff/DwCA_media_30276.diff (292 lines)
[INFO] [2023-04-03 15:27:37] Loading media diff file into memory (292 lines)...
[INFO] [2023-04-03 15:27:38] Storing 290 ContentAttributions (580/290/292)
[INFO] [2023-04-03 15:27:38] Storing 290 Media (580/290/292)
[INFO] [2023-04-03 15:27:38] Handling diff: /app/public/data/DwCA/diff/DwCA_vernaculars_30275.diff (24 lines)
[INFO] [2023-04-03 15:27:38] Loading vernaculars diff file into memory (24 lines)...
[INFO] [2023-04-03 15:27:38] Storing 22 Vernaculars (22/22/24)
[STOP] [2023-04-03 15:27:38] parse_diff_and_store
[START] [2023-04-03 15:27:38] resolve_keys
[2023-04-03 15:27:40] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2023-04-03 15:27:47] Occurrences to nodes (through scientific_names)...
[INFO] [2023-04-03 15:27:47] traits to occurrences...
[INFO] [2023-04-03 15:27:47] traits to nodes (through occurrences)...
[INFO] [2023-04-03 15:27:47] Traits to sex term...
[INFO] [2023-04-03 15:27:47] Traits to lifestage term...
[INFO] [2023-04-03 15:27:47] MetaTraits to traits...
[INFO] [2023-04-03 15:27:47] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2023-04-03 15:27:47] Assocs to occurrences...
[INFO] [2023-04-03 15:27:47] Assocs to nodes...
[INFO] [2023-04-03 15:27:47] Assoc to sex term...
[INFO] [2023-04-03 15:27:47] Assoc to lifestage term...
[INFO] [2023-04-03 15:27:47] MetaAssoc to assocs...
[STOP] [2023-04-03 15:27:47] resolve_keys
[START] [2023-04-03 15:27:47] hold_for_later_1
[STOP] [2023-04-03 15:27:47] hold_for_later_1
[START] [2023-04-03 15:27:47] hold_for_later_2
[STOP] [2023-04-03 15:27:47] hold_for_later_2
[START] [2023-04-03 15:27:47] resolve_missing_parents
[STOP] [2023-04-03 15:27:47] resolve_missing_parents
[START] [2023-04-03 15:27:47] rebuild_nodes
[START] [2023-04-03 15:27:47] Flattener#flatten
[START] [2023-04-03 15:27:47] Flattener#study_resource
[START] [2023-04-03 15:27:47] Flattener#build_ancestry
[STOP] [2023-04-03 15:27:47] Flattener#build_ancestry
[INFO] [2023-04-03 15:27:47] 304 ancestry keys
[START] [2023-04-03 15:27:47] build_node_ancestors
[INFO] [2023-04-03 15:27:47] old ancestors deleted.
[STOP] [2023-04-03 15:27:47] build_node_ancestors
[START] [2023-04-03 15:27:47] Flattener#propagate_ancestor_ids
[STOP] [2023-04-03 15:27:47] Flattener#propagate_ancestor_ids
[STOP] [2023-04-03 15:27:47] Flattener#flatten
[STOP] [2023-04-03 15:27:47] rebuild_nodes
[START] [2023-04-03 15:27:47] resolve_missing_media_owners
[STOP] [2023-04-03 15:27:47] resolve_missing_media_owners
[START] [2023-04-03 15:27:47] sanitize_media_verbatims
[STOP] [2023-04-03 15:27:47] sanitize_media_verbatims
[START] [2023-04-03 15:27:47] queue_downloads
[STOP] [2023-04-03 15:27:47] queue_downloads
[START] [2023-04-03 15:27:47] parse_names
[WARN] [2023-04-03 15:27:47] I see 304 names which still need to be parsed.
[WARN] [2023-04-03 15:27:47] Names to parse: 304 formatted: 304 learned: 297 parsed: 304
[STOP] [2023-04-03 15:27:49] parse_names
[START] [2023-04-03 15:27:49] denormalize_canonical_names_to_nodes
[STOP] [2023-04-03 15:27:49] denormalize_canonical_names_to_nodes
[START] [2023-04-03 15:27:49] match_nodes
[START] [2023-04-03 15:27:49] map_all_nodes_to_pages
[INFO] [2023-04-03 15:27:49] 0% of media downloaded
[STOP] [2023-04-03 15:27:58] map_all_nodes_to_pages
[INFO] [2023-04-03 15:27:58] 32 Unmatched nodes (of 304)! That's too many to output. Full list in /app/public/data/DwCA/unmatched_nodes.txt ; First 10: Canonical: Aves; Node#134263255; ResourceID: 1258169b9360156b673791b4051d5aed; Canonical: Forskalia; Node#134263227; ResourceID: 0464bc0bb174d6d89a93f0c2dfc498e2; Canonical: Arthropoda; Node#134263277; ResourceID: Animalia/Arthropoda; Canonical: Clytia hemispherica; Node#134263308; ResourceID: 5312804185e4a25b11e94281a6c5bd76; Canonical: Mollusca; Node#134263320; ResourceID: Animalia/Mollusca; Canonical: Oxygirus; Node#134263324; ResourceID: Animalia/Mollusca/Gastropoda/Littorinimorpha/Atlantidae/Oxygirus; Canonical: Oxygirus; Node#134263325; ResourceID: 5af8928a4d55701181828cf424a5a2e6; Canonical: Atlanta peroni; Node#134263453; ResourceID: c686f5052aa2592eed21c48e8b2ad8b3; Canonical: Firoloida desmaresti; Node#134263502; ResourceID: f8981ca865e7b84ba7c8f4c2a956b322; Canonical: Thecosomata; Node#134263358; ResourceID: Animalia/Mollusca/Gastropoda/Thecosomata
[START] [2023-04-03 15:27:58] update_nodes
[STOP] [2023-04-03 15:27:59] update_nodes
[STOP] [2023-04-03 15:27:59] match_nodes
[START] [2023-04-03 15:27:59] reindex_search
[STOP] [2023-04-03 15:27:59] reindex_search
[START] [2023-04-03 15:27:59] normalize_units
[STOP] [2023-04-03 15:27:59] normalize_units
[START] [2023-04-03 15:27:59] calculate_statistics
[INFO] [2023-04-03 15:27:59] 0% of media downloaded
[INFO] [2023-04-03 15:28:00] Duplicate page_id count: 0
[STOP] [2023-04-03 15:28:00] calculate_statistics
[START] [2023-04-03 15:28:00] complete_harvest_instance
[START] [2023-04-03 15:28:00] overall_tsv_creation
[INFO] [2023-04-03 15:28:00] Exporting 304 nodes as TSV in batches of 10000...
[INFO] [2023-04-03 15:28:00] Processing group of 304 in 1 batches of 10000
[INFO] [2023-04-03 15:28:00] 0% of media downloaded
[INFO] [2023-04-03 15:28:01] Processed 304/304 nodes
[INFO] [2023-04-03 15:28:01] Average Time: 0.72
[INFO] [2023-04-03 15:28:01] Total Time: 1s
[STOP] [2023-04-03 15:28:01] overall_tsv_creation
[INFO] [2023-04-03 15:28:01] Done. Check your files:
[INFO] [2023-04-03 15:28:01] 0% of media downloaded
[INFO] [2023-04-03 15:28:01] (304 lines) /app/public/data/DwCA/publish_nodes.tsv
[INFO] [2023-04-03 15:28:01] (425 lines) /app/public/data/DwCA/publish_node_ancestors.tsv
[INFO] [2023-04-03 15:28:01] (304 lines) /app/public/data/DwCA/publish_scientific_names.tsv
[INFO] [2023-04-03 15:28:02] (290 lines) /app/public/data/DwCA/publish_media.tsv
[INFO] [2023-04-03 15:28:02] (22 lines) /app/public/data/DwCA/publish_vernaculars.tsv
[INFO] [2023-04-03 15:28:02] (290 lines) /app/public/data/DwCA/publish_attributions.tsv
[STOP] [2023-04-03 15:28:02] complete_harvest_instance
[START] [2023-04-03 15:28:02] completed
[STOP] [2023-04-03 15:28:02] completed
[STOP] [2023-04-03 15:28:02] logged process, took 29.6
[INFO] [2023-04-03 15:28:02] 0% of media downloaded
Latest Process