Harvest for protisten.de Created 19 Jul 16:16

Stage: completed
Fetched: 19 Jul 16:16
Validated: 19 Jul 16:16
Deltas Created 19 Jul 16:16
Units Normalized: 19 Jul 16:17
Ancestry Built: 19 Jul 16:16
Nodes Matched: 19 Jul 16:17
Names Parsed: 19 Jul 16:16
New Models Stored: 19 Jul 16:16
Indexed: 19 Jul 16:17
Completed: 19 Jul 16:19
Time to Harvest: less than a minute

Harvesting Log (most recent first)

[INFO] [2021-07-19 16:16:05] Created harvest instance #4043
[STOP] [2021-07-19 16:16:05] create_harvest_instance
[START] [2021-07-19 16:16:05] fetch_files
[STOP] [2021-07-19 16:16:05] fetch_files
[START] [2021-07-19 16:16:05] validate_each_file
[INFO] [2021-07-19 16:16:05] Looping over 3 formats...
[INFO] [2021-07-19 16:16:05] ...agents (/app/public/data/ptgdp/agent.tab)
[INFO] [2021-07-19 16:16:05] Valid: /app/public/converted_csv/ptgdp_agents_4043.csv (1 lines)
[INFO] [2021-07-19 16:16:05] ...nodes (/app/public/data/ptgdp/taxon.tab)
[INFO] [2021-07-19 16:16:05] Valid: /app/public/converted_csv/ptgdp_nodes_4043.csv (1148 lines)
[INFO] [2021-07-19 16:16:05] ...media (/app/public/data/ptgdp/media_resource.tab)
[INFO] [2021-07-19 16:16:05] Valid: /app/public/converted_csv/ptgdp_media_4043.csv (1976 lines)
[STOP] [2021-07-19 16:16:05] validate_each_file
[START] [2021-07-19 16:16:05] convert_to_csv
[INFO] [2021-07-19 16:16:05] Looping over 3 formats...
[INFO] [2021-07-19 16:16:05] ...agents (/app/public/data/ptgdp/agent.tab)
[CMD] [2021-07-19 16:16:05] /usr/bin/sort /app/public/converted_csv/ptgdp_agents_4043.csv > /app/public/converted_csv/ptgdp_agents_4043.csv_sorted
[INFO] [2021-07-19 16:16:05] Converted: /app/public/converted_csv/ptgdp_agents_4043.csv (1 lines)
[INFO] [2021-07-19 16:16:05] ...nodes (/app/public/data/ptgdp/taxon.tab)
[CMD] [2021-07-19 16:16:05] /usr/bin/sort /app/public/converted_csv/ptgdp_nodes_4043.csv > /app/public/converted_csv/ptgdp_nodes_4043.csv_sorted
[INFO] [2021-07-19 16:16:05] Converted: /app/public/converted_csv/ptgdp_nodes_4043.csv (1148 lines)
[INFO] [2021-07-19 16:16:05] ...media (/app/public/data/ptgdp/media_resource.tab)
[CMD] [2021-07-19 16:16:05] /usr/bin/sort /app/public/converted_csv/ptgdp_media_4043.csv > /app/public/converted_csv/ptgdp_media_4043.csv_sorted
[INFO] [2021-07-19 16:16:05] Converted: /app/public/converted_csv/ptgdp_media_4043.csv (1976 lines)
[STOP] [2021-07-19 16:16:05] convert_to_csv
[START] [2021-07-19 16:16:05] calculate_delta
[INFO] [2021-07-19 16:16:05] Looping over 3 formats...
[INFO] [2021-07-19 16:16:05] ...agents (/app/public/data/ptgdp/agent.tab)
[CMD] [2021-07-19 16:16:05] echo "0a" > /app/public/diff/ptgdp_agents_4043.diff
[CMD] [2021-07-19 16:16:05] tail -n +1 /app/public/converted_csv/ptgdp_agents_4043.csv >> /app/public/diff/ptgdp_agents_4043.diff
[CMD] [2021-07-19 16:16:05] echo "." >> /app/public/diff/ptgdp_agents_4043.diff
[INFO] [2021-07-19 16:16:05] Created diff: /app/public/diff/ptgdp_agents_4043.diff (3 lines)
[INFO] [2021-07-19 16:16:05] ...nodes (/app/public/data/ptgdp/taxon.tab)
[CMD] [2021-07-19 16:16:05] echo "0a" > /app/public/diff/ptgdp_nodes_4043.diff
[CMD] [2021-07-19 16:16:05] tail -n +1 /app/public/converted_csv/ptgdp_nodes_4043.csv >> /app/public/diff/ptgdp_nodes_4043.diff
[CMD] [2021-07-19 16:16:05] echo "." >> /app/public/diff/ptgdp_nodes_4043.diff
[INFO] [2021-07-19 16:16:05] Created diff: /app/public/diff/ptgdp_nodes_4043.diff (1150 lines)
[INFO] [2021-07-19 16:16:05] ...media (/app/public/data/ptgdp/media_resource.tab)
[CMD] [2021-07-19 16:16:05] echo "0a" > /app/public/diff/ptgdp_media_4043.diff
[CMD] [2021-07-19 16:16:05] tail -n +1 /app/public/converted_csv/ptgdp_media_4043.csv >> /app/public/diff/ptgdp_media_4043.diff
[CMD] [2021-07-19 16:16:05] echo "." >> /app/public/diff/ptgdp_media_4043.diff
[INFO] [2021-07-19 16:16:05] Created diff: /app/public/diff/ptgdp_media_4043.diff (1978 lines)
[STOP] [2021-07-19 16:16:05] calculate_delta
[START] [2021-07-19 16:16:05] parse_diff_and_store
[INFO] [2021-07-19 16:16:05] Handling diff: /app/public/diff/ptgdp_agents_4043.diff (3 lines)
[INFO] [2021-07-19 16:16:05] Loading agents diff file into memory (3 /app/public/diff/ptgdp_agents_4043.diff lines)...
[INFO] [2021-07-19 16:16:05] Handling diff: /app/public/diff/ptgdp_nodes_4043.diff (1150 lines)
[INFO] [2021-07-19 16:16:05] Loading nodes diff file into memory (1150 /app/public/diff/ptgdp_nodes_4043.diff lines)...
[WARN] [2021-07-19 16:16:05] Filtered Scientific Name `Bryozoa/ Moostierchen` to `Bryozoa Moostierchen`
[WARN] [2021-07-19 16:16:05] Filtered Scientific Name `Rhodophyta/Rotalge` to `RhodophytaRotalge`
[WARN] [2021-07-19 16:16:05] Filtered Scientific Name `Chlorophyta-Cyste/Chlorophyta cyst` to `Chlorophyta-CysteChlorophyta cyst`
[WARN] [2021-07-19 16:16:06] Filtered Scientific Name `Cyanobacteria/Melainabacteria group` to `CyanobacteriaMelainabacteria group`
[INFO] [2021-07-19 16:16:06] Handling diff: /app/public/diff/ptgdp_media_4043.diff (1978 lines)
[INFO] [2021-07-19 16:16:06] Loading media diff file into memory (1978 /app/public/diff/ptgdp_media_4043.diff lines)...
[INFO] [2021-07-19 16:16:07] Storing 1 Attributions
[INFO] [2021-07-19 16:16:07] Processing group of 1 in 1 groups of 1000
[INFO] [2021-07-19 16:16:07] Average Time: 0.01
[INFO] [2021-07-19 16:16:07] Total Time: 1s
[INFO] [2021-07-19 16:16:07] Storing 1148 ScientificNames
[INFO] [2021-07-19 16:16:07] Processing group of 1148 in 2 groups of 1000
[INFO] [2021-07-19 16:16:08] Average Time: 0.26
[INFO] [2021-07-19 16:16:08] Total Time: 1s
[INFO] [2021-07-19 16:16:08] Storing 1148 Nodes
[INFO] [2021-07-19 16:16:08] Processing group of 1148 in 2 groups of 1000
[INFO] [2021-07-19 16:16:08] Average Time: 0.18
[INFO] [2021-07-19 16:16:08] Total Time: 1s
[INFO] [2021-07-19 16:16:08] Storing 1976 ContentAttributions
[INFO] [2021-07-19 16:16:08] Processing group of 1976 in 2 groups of 1000
[INFO] [2021-07-19 16:16:08] Average Time: 0.15
[INFO] [2021-07-19 16:16:08] Total Time: 1s
[INFO] [2021-07-19 16:16:08] Storing 1976 Media
[INFO] [2021-07-19 16:16:08] Processing group of 1976 in 2 groups of 1000
[INFO] [2021-07-19 16:16:09] Average Time: 0.54
[INFO] [2021-07-19 16:16:09] Total Time: 2s
[STOP] [2021-07-19 16:16:09] parse_diff_and_store
[START] [2021-07-19 16:16:09] resolve_keys
[INFO] [2021-07-19 16:16:18] Occurrences to nodes (through scientific_names)...
[INFO] [2021-07-19 16:16:18] traits to occurrences...
[INFO] [2021-07-19 16:16:18] traits to nodes (through occurrences)...
[INFO] [2021-07-19 16:16:18] Traits to sex term...
[INFO] [2021-07-19 16:16:18] Traits to lifestage term...
[INFO] [2021-07-19 16:16:18] MetaTraits to traits...
[INFO] [2021-07-19 16:16:18] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2021-07-19 16:16:18] Assocs to occurrences...
[INFO] [2021-07-19 16:16:18] Assocs to nodes...
[INFO] [2021-07-19 16:16:18] Assoc to sex term...
[INFO] [2021-07-19 16:16:18] Assoc to lifestage term...
[INFO] [2021-07-19 16:16:18] MetaAssoc to assocs...
[STOP] [2021-07-19 16:16:18] resolve_keys
[START] [2021-07-19 16:16:18] hold_for_later_1
[STOP] [2021-07-19 16:16:18] hold_for_later_1
[START] [2021-07-19 16:16:18] hold_for_later_2
[STOP] [2021-07-19 16:16:18] hold_for_later_2
[START] [2021-07-19 16:16:18] resolve_missing_parents
[STOP] [2021-07-19 16:16:18] resolve_missing_parents
[START] [2021-07-19 16:16:18] rebuild_nodes
[START] [2021-07-19 16:16:18] Flattener#flatten
[START] [2021-07-19 16:16:18] Flattener#study_resource
[START] [2021-07-19 16:16:18] Flattener#build_ancestry
[STOP] [2021-07-19 16:16:18] Flattener#build_ancestry
[INFO] [2021-07-19 16:16:18] 1148 ancestry keys
[START] [2021-07-19 16:16:18] build_node_ancestors
[INFO] [2021-07-19 16:16:18] old ancestors deleted.
[STOP] [2021-07-19 16:16:18] build_node_ancestors
[START] [2021-07-19 16:16:19] Flattener#propagate_ancestor_ids
[STOP] [2021-07-19 16:16:19] Flattener#propagate_ancestor_ids
[STOP] [2021-07-19 16:16:19] Flattener#flatten
[STOP] [2021-07-19 16:16:19] rebuild_nodes
[START] [2021-07-19 16:16:19] resolve_missing_media_owners
[STOP] [2021-07-19 16:16:19] resolve_missing_media_owners
[START] [2021-07-19 16:16:19] sanitize_media_verbatims
[STOP] [2021-07-19 16:16:19] sanitize_media_verbatims
[START] [2021-07-19 16:16:19] queue_downloads
[STOP] [2021-07-19 16:16:19] queue_downloads
[START] [2021-07-19 16:16:19] parse_names
[WARN] [2021-07-19 16:16:19] I see 1148 names which still need to be parsed.
[WARN] [2021-07-19 16:16:21] I see 20 names which still need to be parsed.
[STOP] [2021-07-19 16:16:22] parse_names
[START] [2021-07-19 16:16:22] denormalize_canonical_names_to_nodes
[STOP] [2021-07-19 16:16:22] denormalize_canonical_names_to_nodes
[START] [2021-07-19 16:16:22] match_nodes
[START] [2021-07-19 16:16:22] map_all_nodes_to_pages
[ERR] [2021-07-19 16:16:46][hdls] download_and_prep FAILED for Medium.find(13135188): 404 Not Found
[STOP] [2021-07-19 16:17:21] map_all_nodes_to_pages
[INFO] [2021-07-19 16:17:21] 69 Unmatched nodes (of 1148)! That's too many to output. Full list in /app/public/data/ptgdp/unmatched_nodes.txt ; First 10: Canonical: Siderocapsa; Node#97218153; ResourceID: 976a61ae86cac16ed1137606867d2dce; Canonical: Chroococcus turgidus with; Node#97218644; ResourceID: ff613204e7336d2555aa9ae401b4cebd; Canonical: Eucaryota; Node#97218571; ResourceID: eucaryota; Canonical: Nassulophorea; Node#97218696; ResourceID: intramacronucleata-nassulophorea; Canonical: Peniculiada; Node#97218751; ResourceID: oligohymenophorea-peniculiada; Canonical: Protostomatea; Node#97218700; ResourceID: intramacronucleata-protostomatea; Canonical: Alveolates; Node#97217777; ResourceID: sar_(stramenopiles,_alveolates,_rhizaria)-alveolates; Canonical: Tecofilosea; Node#97218363; ResourceID: cercozoa-tecofilosea; Canonical: Microgromia socialis; Node#97218632; ResourceID: f9686b84a29a24c9287e57cdb7df43ba; Canonical: Bicosida; Node#97218303; ResourceID: bicoecea-bicosida
[START] [2021-07-19 16:17:21] update_nodes
[STOP] [2021-07-19 16:17:22] update_nodes
[STOP] [2021-07-19 16:17:22] match_nodes
[START] [2021-07-19 16:17:22] reindex_search
[STOP] [2021-07-19 16:17:23] reindex_search
[START] [2021-07-19 16:17:23] normalize_units
[STOP] [2021-07-19 16:17:23] normalize_units
[START] [2021-07-19 16:17:23] calculate_statistics
[STOP] [2021-07-19 16:17:23] calculate_statistics
[START] [2021-07-19 16:17:23] complete_harvest_instance
[START] [2021-07-19 16:17:23] overall_tsv_creation
[INFO] [2021-07-19 16:17:23] Processing group of 1148 in 1 batches of 10000
[ERR] [2021-07-19 16:17:37][hdls] download_and_prep FAILED for Medium.find(13135506): 404 Not Found
[ERR] [2021-07-19 16:18:20][hdls] download_and_prep FAILED for Medium.find(13135897): 404 Not Found
[ERR] [2021-07-19 16:19:01][hdls] download_and_prep FAILED for Medium.find(13136254): 404 Not Found
[INFO] [2021-07-19 16:19:18] Average Time: 17.23
[INFO] [2021-07-19 16:19:18] Total Time: 1m55s
[STOP] [2021-07-19 16:19:18] overall_tsv_creation
[INFO] [2021-07-19 16:19:18] Done. Check your files:
[INFO] [2021-07-19 16:19:18] (1148 lines) /app/public/data/ptgdp/publish_nodes.tsv
[INFO] [2021-07-19 16:19:18] (7970 lines) /app/public/data/ptgdp/publish_node_ancestors.tsv
[INFO] [2021-07-19 16:19:18] (1148 lines) /app/public/data/ptgdp/publish_scientific_names.tsv
[INFO] [2021-07-19 16:19:18] (1976 lines) /app/public/data/ptgdp/publish_media.tsv
[INFO] [2021-07-19 16:19:18] (249 lines) /app/public/data/ptgdp/publish_image_info.tsv
[INFO] [2021-07-19 16:19:18] (1976 lines) /app/public/data/ptgdp/publish_attributions.tsv
[STOP] [2021-07-19 16:19:18] complete_harvest_instance
[START] [2021-07-19 16:19:18] completed
[STOP] [2021-07-19 16:19:18] completed
[STOP] [2021-07-19 16:19:18] logged process, took 193.55

Latest Process