Harvest for wikipedia EN Created 02 Aug 08:07

Stage: completed
Fetched: 02 Aug 08:07
Validated: 02 Aug 08:08
Deltas Created 02 Aug 08:08
Units Normalized: 02 Aug 19:04
Ancestry Built: 02 Aug 13:24
Nodes Matched: 02 Aug 18:48
Names Parsed: 02 Aug 13:30
New Models Stored: 02 Aug 12:27
Indexed: 02 Aug 19:04
Completed: 02 Aug 20:40
Time to Harvest: 13 minutes

Harvesting Log (most recent first)

[INFO] [2021-08-02 08:07:09] Created harvest instance #4047
[STOP] [2021-08-02 08:07:09] create_harvest_instance
[START] [2021-08-02 08:07:09] fetch_files
[STOP] [2021-08-02 08:07:09] fetch_files
[START] [2021-08-02 08:07:09] validate_each_file
[INFO] [2021-08-02 08:07:09] Created new folder: /app/public/converted_csv
[INFO] [2021-08-02 08:07:09] Looping over 2 formats...
[INFO] [2021-08-02 08:07:09] ...nodes (/app/public/data/wiki_english/taxon.tab)
[INFO] [2021-08-02 08:07:21] Valid: /app/public/converted_csv/wiki_english_nodes_4047.csv (430102 lines)
[INFO] [2021-08-02 08:07:21] ...media (/app/public/data/wiki_english/media_resource.tab)
[INFO] [2021-08-02 08:08:37] Valid: /app/public/converted_csv/wiki_english_media_4047.csv (814055 lines)
[STOP] [2021-08-02 08:08:37] validate_each_file
[START] [2021-08-02 08:08:37] convert_to_csv
[INFO] [2021-08-02 08:08:37] Looping over 2 formats...
[INFO] [2021-08-02 08:08:37] ...nodes (/app/public/data/wiki_english/taxon.tab)
[CMD] [2021-08-02 08:08:37] /usr/bin/sort /app/public/converted_csv/wiki_english_nodes_4047.csv > /app/public/converted_csv/wiki_english_nodes_4047.csv_sorted
[INFO] [2021-08-02 08:08:37] Converted: /app/public/converted_csv/wiki_english_nodes_4047.csv (430102 lines)
[INFO] [2021-08-02 08:08:37] ...media (/app/public/data/wiki_english/media_resource.tab)
[CMD] [2021-08-02 08:08:37] /usr/bin/sort /app/public/converted_csv/wiki_english_media_4047.csv > /app/public/converted_csv/wiki_english_media_4047.csv_sorted
[INFO] [2021-08-02 08:08:47] Converted: /app/public/converted_csv/wiki_english_media_4047.csv (814055 lines)
[STOP] [2021-08-02 08:08:47] convert_to_csv
[START] [2021-08-02 08:08:47] calculate_delta
[INFO] [2021-08-02 08:08:49] Created diff dir: /app/public/diff
[INFO] [2021-08-02 08:08:49] Looping over 2 formats...
[INFO] [2021-08-02 08:08:49] ...nodes (/app/public/data/wiki_english/taxon.tab)
[CMD] [2021-08-02 08:08:49] echo "0a" > /app/public/diff/wiki_english_nodes_4047.diff
[CMD] [2021-08-02 08:08:49] tail -n +1 /app/public/converted_csv/wiki_english_nodes_4047.csv >> /app/public/diff/wiki_english_nodes_4047.diff
[CMD] [2021-08-02 08:08:49] echo "." >> /app/public/diff/wiki_english_nodes_4047.diff
[INFO] [2021-08-02 08:08:50] Created diff: /app/public/diff/wiki_english_nodes_4047.diff (430104 lines)
[INFO] [2021-08-02 08:08:50] ...media (/app/public/data/wiki_english/media_resource.tab)
[CMD] [2021-08-02 08:08:50] echo "0a" > /app/public/diff/wiki_english_media_4047.diff
[CMD] [2021-08-02 08:08:50] tail -n +1 /app/public/converted_csv/wiki_english_media_4047.csv >> /app/public/diff/wiki_english_media_4047.diff
[CMD] [2021-08-02 08:08:53] echo "." >> /app/public/diff/wiki_english_media_4047.diff
[INFO] [2021-08-02 08:08:58] Created diff: /app/public/diff/wiki_english_media_4047.diff (814057 lines)
[STOP] [2021-08-02 08:08:58] calculate_delta
[START] [2021-08-02 08:08:58] parse_diff_and_store
[INFO] [2021-08-02 08:08:58] Handling diff: /app/public/diff/wiki_english_nodes_4047.diff (430104 lines)
[INFO] [2021-08-02 08:08:58] Loading nodes diff file into memory (430104 /app/public/diff/wiki_english_nodes_4047.diff lines)...
[WARN] [2021-08-02 08:09:03] Filtered Scientific Name `Oligosoma aff. infrapunctatum "cobble"` to `Oligosoma aff. infrapunctatum cobble`
[INFO] [2021-08-02 08:15:15] Handling diff: /app/public/diff/wiki_english_media_4047.diff (814057 lines)
[INFO] [2021-08-02 08:15:16] Loading media diff file into memory (814057 /app/public/diff/wiki_english_media_4047.diff lines)...
[INFO] [2021-08-02 12:07:10] Storing 430102 ScientificNames
[INFO] [2021-08-02 12:07:10] Processing group of 430102 in 431 groups of 1000
[INFO] [2021-08-02 12:10:56] Average Time: 0.519
[INFO] [2021-08-02 12:10:56] Total Time: 3m46s
[INFO] [2021-08-02 12:10:56] last 3 / first 3: 0.48
[INFO] [2021-08-02 12:10:56] Std.Dev: 0.9055385138137416; Max: 12.35
[INFO] [2021-08-02 12:10:56] Storing 430107 Identifiers
[INFO] [2021-08-02 12:10:56] Processing group of 430107 in 431 groups of 1000
[INFO] [2021-08-02 12:12:00] Average Time: 0.145
[INFO] [2021-08-02 12:12:00] Total Time: 1m5s
[INFO] [2021-08-02 12:12:00] last 3 / first 3: 0.57
[INFO] [2021-08-02 12:12:00] Std.Dev: 0.2345207879911715; Max: 1.97
[INFO] [2021-08-02 12:12:00] Storing 430102 Nodes
[INFO] [2021-08-02 12:12:00] Processing group of 430102 in 431 groups of 1000
[INFO] [2021-08-02 12:15:46] Average Time: 0.519
[INFO] [2021-08-02 12:15:46] Total Time: 3m46s
[INFO] [2021-08-02 12:15:46] last 3 / first 3: 0.71
[INFO] [2021-08-02 12:15:46] Std.Dev: 1.4949916387726052; Max: 17.31
[INFO] [2021-08-02 12:15:46] Storing 814055 ArticlesSections
[INFO] [2021-08-02 12:15:46] Processing group of 814055 in 815 groups of 1000
[INFO] [2021-08-02 12:16:55] Average Time: 0.079
[INFO] [2021-08-02 12:16:55] Total Time: 1m10s
[INFO] [2021-08-02 12:16:55] last 3 / first 3: 0.59
[INFO] [2021-08-02 12:16:55] Std.Dev: 0.1673320053068151; Max: 1.62
[INFO] [2021-08-02 12:16:55] Storing 814055 Articles
[INFO] [2021-08-02 12:16:55] Processing group of 814055 in 815 groups of 1000
[INFO] [2021-08-02 12:27:42] Average Time: 0.786
[INFO] [2021-08-02 12:27:42] Total Time: 10m47s
[INFO] [2021-08-02 12:27:42] last 3 / first 3: 0.53
[INFO] [2021-08-02 12:27:42] Std.Dev: 1.5472556349873152; Max: 19.81
[STOP] [2021-08-02 12:27:42] parse_diff_and_store
[START] [2021-08-02 12:27:42] resolve_keys
[INFO] [2021-08-02 12:51:49] Occurrences to nodes (through scientific_names)...
[INFO] [2021-08-02 12:51:49] traits to occurrences...
[INFO] [2021-08-02 12:51:49] traits to nodes (through occurrences)...
[INFO] [2021-08-02 12:51:49] Traits to sex term...
[INFO] [2021-08-02 12:51:49] Traits to lifestage term...
[INFO] [2021-08-02 12:51:49] MetaTraits to traits...
[INFO] [2021-08-02 12:51:49] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2021-08-02 12:51:49] Assocs to occurrences...
[INFO] [2021-08-02 12:51:49] Assocs to nodes...
[INFO] [2021-08-02 12:51:49] Assoc to sex term...
[INFO] [2021-08-02 12:51:49] Assoc to lifestage term...
[INFO] [2021-08-02 12:51:49] MetaAssoc to assocs...
[STOP] [2021-08-02 12:51:49] resolve_keys
[START] [2021-08-02 12:51:49] hold_for_later_1
[STOP] [2021-08-02 12:51:49] hold_for_later_1
[START] [2021-08-02 12:51:49] hold_for_later_2
[STOP] [2021-08-02 12:51:49] hold_for_later_2
[START] [2021-08-02 12:51:49] resolve_missing_parents
[STOP] [2021-08-02 12:52:46] resolve_missing_parents
[START] [2021-08-02 12:52:46] rebuild_nodes
[START] [2021-08-02 12:52:46] Flattener#flatten
[START] [2021-08-02 12:52:46] Flattener#study_resource
[START] [2021-08-02 12:53:18] Flattener#build_ancestry
[STOP] [2021-08-02 12:59:09] Flattener#build_ancestry
[INFO] [2021-08-02 12:59:09] 430102 ancestry keys
[START] [2021-08-02 12:59:09] build_node_ancestors
[INFO] [2021-08-02 12:59:09] old ancestors deleted.
[STOP] [2021-08-02 13:17:50] build_node_ancestors
[START] [2021-08-02 13:17:54] Flattener#propagate_ancestor_ids
[STOP] [2021-08-02 13:24:38] Flattener#propagate_ancestor_ids
[STOP] [2021-08-02 13:24:38] Flattener#flatten
[STOP] [2021-08-02 13:24:38] rebuild_nodes
[START] [2021-08-02 13:24:38] resolve_missing_media_owners
[STOP] [2021-08-02 13:24:38] resolve_missing_media_owners
[START] [2021-08-02 13:24:38] sanitize_media_verbatims
[STOP] [2021-08-02 13:24:38] sanitize_media_verbatims
[START] [2021-08-02 13:24:38] queue_downloads
[STOP] [2021-08-02 13:24:38] queue_downloads
[START] [2021-08-02 13:24:38] parse_names
[WARN] [2021-08-02 13:24:39] I see 430102 names which still need to be parsed.
[WARN] [2021-08-02 13:30:00] I see 77 names which still need to be parsed.
[STOP] [2021-08-02 13:30:03] parse_names
[START] [2021-08-02 13:30:03] denormalize_canonical_names_to_nodes
[STOP] [2021-08-02 13:30:13] denormalize_canonical_names_to_nodes
[START] [2021-08-02 13:30:13] match_nodes
[START] [2021-08-02 13:30:14] map_all_nodes_to_pages
[STOP] [2021-08-02 18:48:33] map_all_nodes_to_pages
[INFO] [2021-08-02 18:48:34] 26549 Unmatched nodes (of 430102)! That's too many to output. Full list in /app/public/data/wiki_english/unmatched_nodes.txt ; First 10: Canonical: Hayasakaia; Node#97330359; ResourceID: Q106169081; Canonical: Fametesta; Node#97332655; ResourceID: Q106772637; Canonical: Melikaiella; Node#97332697; ResourceID: Q106784548; Canonical: Euclastaria; Node#97332700; ResourceID: Q106785034; Canonical: Eurycampta; Node#97332703; ResourceID: Q106785200; Canonical: Telaletes obscurata; Node#97333088; ResourceID: Q106945458; Canonical: Mesobaetis; Node#97333097; ResourceID: Q106950589; Canonical: Ruhooglandia; Node#97333111; ResourceID: Q106954582; Canonical: Anoma; Node#97333345; ResourceID: Q107027440; Canonical: Artigasia; Node#97333361; ResourceID: Q107029819
[START] [2021-08-02 18:48:34] update_nodes
[STOP] [2021-08-02 18:48:47] update_nodes
[STOP] [2021-08-02 18:48:47] match_nodes
[START] [2021-08-02 18:48:47] reindex_search
[STOP] [2021-08-02 19:04:56] reindex_search
[START] [2021-08-02 19:04:56] normalize_units
[STOP] [2021-08-02 19:04:56] normalize_units
[START] [2021-08-02 19:04:56] calculate_statistics
[STOP] [2021-08-02 19:04:59] calculate_statistics
[START] [2021-08-02 19:04:59] complete_harvest_instance
[START] [2021-08-02 19:04:59] overall_tsv_creation
[INFO] [2021-08-02 19:05:00] Processing group of 430102 in 44 batches of 10000
[INFO] [2021-08-02 20:40:31] Average Time: 68.022
[INFO] [2021-08-02 20:40:31] Total Time: 1h35m32s
[INFO] [2021-08-02 20:40:31] last 3 / first 3: 0.76
[INFO] [2021-08-02 20:40:31] Std.Dev: 12.129138468992759; Max: 121.79
[STOP] [2021-08-02 20:40:31] overall_tsv_creation
[INFO] [2021-08-02 20:40:31] Done. Check your files:
[INFO] [2021-08-02 20:40:31] (430102 lines) /app/public/data/wiki_english/publish_nodes.tsv
[INFO] [2021-08-02 20:40:32] (430107 lines) /app/public/data/wiki_english/publish_identifiers.tsv
[INFO] [2021-08-02 20:40:33] (10455697 lines) /app/public/data/wiki_english/publish_node_ancestors.tsv
[INFO] [2021-08-02 20:40:34] (430102 lines) /app/public/data/wiki_english/publish_scientific_names.tsv
[INFO] [2021-08-02 20:40:35] (5790866 lines) /app/public/data/wiki_english/publish_articles.tsv
[INFO] [2021-08-02 20:40:36] (814055 lines) /app/public/data/wiki_english/publish_content_sections.tsv
[STOP] [2021-08-02 20:40:37] complete_harvest_instance
[START] [2021-08-02 20:40:37] completed
[STOP] [2021-08-02 20:40:37] completed
[STOP] [2021-08-02 20:40:37] logged process, took 45207.88

Latest Process