Harvest for wikipedia IT Created 13 Jun 10:07

Stage: completed
Fetched: 13 Jun 10:07
Validated: 13 Jun 10:08
Deltas Created 13 Jun 10:08
Units Normalized: 13 Jun 10:46
Ancestry Built: 13 Jun 10:28
Nodes Matched: 13 Jun 10:45
Names Parsed: 13 Jun 10:29
New Models Stored: 13 Jun 10:17
Indexed: 13 Jun 10:46
Completed: 13 Jun 10:58
Time to Harvest: 1 minute

Harvesting Log

(152 lines)
[INFO] [2022-06-13 10:07:45] Created harvest instance #4138
[STOP] [2022-06-13 10:07:45] create_harvest_instance
[START] [2022-06-13 10:07:45] fetch_files
[STOP] [2022-06-13 10:07:45] fetch_files
[START] [2022-06-13 10:07:45] validate_each_file
[INFO] [2022-06-13 10:07:45] Looping over 2 formats...
[INFO] [2022-06-13 10:07:45] ...nodes (/app/public/data/wiki_it_tar_gz/taxon.tab)
[INFO] [2022-06-13 10:07:47] Valid: /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_nodes_29380.csv (46127 lines)
[INFO] [2022-06-13 10:07:47] ...media (/app/public/data/wiki_it_tar_gz/media_resource.tab)
[INFO] [2022-06-13 10:08:03] Valid: /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_media_29379.csv (74824 lines)
[STOP] [2022-06-13 10:08:03] validate_each_file
[START] [2022-06-13 10:08:03] convert_to_csv
[INFO] [2022-06-13 10:08:03] Looping over 2 formats...
[INFO] [2022-06-13 10:08:03] ...nodes (/app/public/data/wiki_it_tar_gz/taxon.tab)
[CMD] [2022-06-13 10:08:03] /usr/bin/sort /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_nodes_29380.csv > /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_nodes_29380.csv_sorted
[INFO] [2022-06-13 10:08:03] Converted: /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_nodes_29380.csv (46127 lines)
[INFO] [2022-06-13 10:08:03] ...media (/app/public/data/wiki_it_tar_gz/media_resource.tab)
[CMD] [2022-06-13 10:08:03] /usr/bin/sort /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_media_29379.csv > /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_media_29379.csv_sorted
[INFO] [2022-06-13 10:08:13] Converted: /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_media_29379.csv (74824 lines)
[STOP] [2022-06-13 10:08:13] convert_to_csv
[START] [2022-06-13 10:08:13] calculate_delta
[INFO] [2022-06-13 10:08:13] Looping over 2 formats...
[INFO] [2022-06-13 10:08:13] ...nodes (/app/public/data/wiki_it_tar_gz/taxon.tab)
[CMD] [2022-06-13 10:08:13] echo "0a" > /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_nodes_29380.diff
[CMD] [2022-06-13 10:08:13] tail -n +1 /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_nodes_29380.csv >> /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_nodes_29380.diff
[CMD] [2022-06-13 10:08:13] echo "." >> /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_nodes_29380.diff
[INFO] [2022-06-13 10:08:13] Created diff: /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_nodes_29380.diff (46129 lines)
[INFO] [2022-06-13 10:08:13] ...media (/app/public/data/wiki_it_tar_gz/media_resource.tab)
[CMD] [2022-06-13 10:08:13] echo "0a" > /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_media_29379.diff
[CMD] [2022-06-13 10:08:14] tail -n +1 /app/public/data/wiki_it_tar_gz/converted_csv/wiki_it_tar_gz_media_29379.csv >> /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_media_29379.diff
[CMD] [2022-06-13 10:08:18] echo "." >> /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_media_29379.diff
[INFO] [2022-06-13 10:08:20] Created diff: /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_media_29379.diff (74826 lines)
[STOP] [2022-06-13 10:08:20] calculate_delta
[START] [2022-06-13 10:08:20] parse_diff_and_store
[INFO] [2022-06-13 10:08:20] Handling diff: /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_nodes_29380.diff (46129 lines)
[INFO] [2022-06-13 10:08:20] Loading nodes diff file into memory (46129 lines)...
[WARN] [2022-06-13 10:08:22] Filtered Scientific Name `Cuon alpinus fumosus/javanicus` to `Cuon alpinus fumosusjavanicus`
[INFO] [2022-06-13 10:08:23] Storing 9999 ScientificNames (29997/10000/46129)
[INFO] [2022-06-13 10:08:26] Storing 9999 Identifiers (29997/10000/46129)
[INFO] [2022-06-13 10:08:27] Storing 9999 Nodes (29997/10000/46129)
[INFO] [2022-06-13 10:08:34] Storing 10000 ScientificNames (59997/20000/46129)
[INFO] [2022-06-13 10:08:36] Storing 10000 Identifiers (59997/20000/46129)
[INFO] [2022-06-13 10:08:37] Storing 10000 Nodes (59997/20000/46129)
[INFO] [2022-06-13 10:08:44] Storing 10000 ScientificNames (89997/30000/46129)
[INFO] [2022-06-13 10:08:48] Storing 10000 Identifiers (89997/30000/46129)
[INFO] [2022-06-13 10:08:49] Storing 10000 Nodes (89997/30000/46129)
[INFO] [2022-06-13 10:08:55] Storing 10000 ScientificNames (119997/40000/46129)
[INFO] [2022-06-13 10:08:58] Storing 10000 Identifiers (119997/40000/46129)
[INFO] [2022-06-13 10:08:59] Storing 10000 Nodes (119997/40000/46129)
[INFO] [2022-06-13 10:09:05] Storing 6128 ScientificNames (138381/46127/46129)
[INFO] [2022-06-13 10:09:07] Storing 6128 Identifiers (138381/46127/46129)
[INFO] [2022-06-13 10:09:08] Storing 6128 Nodes (138381/46127/46129)
[INFO] [2022-06-13 10:09:11] Handling diff: /app/public/data/wiki_it_tar_gz/diff/wiki_it_tar_gz_media_29379.diff (74826 lines)
[INFO] [2022-06-13 10:09:11] Loading media diff file into memory (74826 lines)...
[INFO] [2022-06-13 10:10:11] Storing 9999 ArticlesSections (19998/10000/74826)
[INFO] [2022-06-13 10:10:12] Storing 9999 Articles (19998/10000/74826)
[INFO] [2022-06-13 10:11:16] Storing 10000 ArticlesSections (39998/20000/74826)
[INFO] [2022-06-13 10:11:17] Storing 10000 Articles (39998/20000/74826)
[INFO] [2022-06-13 10:12:23] Storing 10000 ArticlesSections (59998/30000/74826)
[INFO] [2022-06-13 10:12:23] Storing 10000 Articles (59998/30000/74826)
[INFO] [2022-06-13 10:13:30] Storing 10000 ArticlesSections (79998/40000/74826)
[INFO] [2022-06-13 10:13:30] Storing 10000 Articles (79998/40000/74826)
[INFO] [2022-06-13 10:14:38] Storing 10000 ArticlesSections (99998/50000/74826)
[INFO] [2022-06-13 10:14:38] Storing 10000 Articles (99998/50000/74826)
[INFO] [2022-06-13 10:15:46] Storing 10000 ArticlesSections (119998/60000/74826)
[INFO] [2022-06-13 10:15:47] Storing 10000 Articles (119998/60000/74826)
[INFO] [2022-06-13 10:16:53] Storing 10000 ArticlesSections (139998/70000/74826)
[INFO] [2022-06-13 10:16:54] Storing 10000 Articles (139998/70000/74826)
[INFO] [2022-06-13 10:17:30] Storing 4825 ArticlesSections (149648/74824/74826)
[INFO] [2022-06-13 10:17:30] Storing 4825 Articles (149648/74824/74826)
[STOP] [2022-06-13 10:17:33] parse_diff_and_store
[START] [2022-06-13 10:17:33] resolve_keys
[2022-06-13 10:17:51] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2022-06-13 10:26:35] Occurrences to nodes (through scientific_names)...
[INFO] [2022-06-13 10:26:35] traits to occurrences...
[INFO] [2022-06-13 10:26:35] traits to nodes (through occurrences)...
[INFO] [2022-06-13 10:26:35] Traits to sex term...
[INFO] [2022-06-13 10:26:35] Traits to lifestage term...
[INFO] [2022-06-13 10:26:35] MetaTraits to traits...
[INFO] [2022-06-13 10:26:35] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2022-06-13 10:26:35] Assocs to occurrences...
[INFO] [2022-06-13 10:26:35] Assocs to nodes...
[INFO] [2022-06-13 10:26:35] Assoc to sex term...
[INFO] [2022-06-13 10:26:35] Assoc to lifestage term...
[INFO] [2022-06-13 10:26:35] MetaAssoc to assocs...
[STOP] [2022-06-13 10:26:35] resolve_keys
[START] [2022-06-13 10:26:35] hold_for_later_1
[STOP] [2022-06-13 10:26:35] hold_for_later_1
[START] [2022-06-13 10:26:35] hold_for_later_2
[STOP] [2022-06-13 10:26:35] hold_for_later_2
[START] [2022-06-13 10:26:35] resolve_missing_parents
[STOP] [2022-06-13 10:26:38] resolve_missing_parents
[START] [2022-06-13 10:26:38] rebuild_nodes
[START] [2022-06-13 10:26:38] Flattener#flatten
[START] [2022-06-13 10:26:38] Flattener#study_resource
[START] [2022-06-13 10:26:38] Flattener#build_ancestry
[STOP] [2022-06-13 10:26:43] Flattener#build_ancestry
[INFO] [2022-06-13 10:26:43] 46127 ancestry keys
[START] [2022-06-13 10:26:43] build_node_ancestors
[INFO] [2022-06-13 10:26:43] old ancestors deleted.
[STOP] [2022-06-13 10:28:28] build_node_ancestors
[START] [2022-06-13 10:28:31] Flattener#propagate_ancestor_ids
[STOP] [2022-06-13 10:28:59] Flattener#propagate_ancestor_ids
[STOP] [2022-06-13 10:28:59] Flattener#flatten
[STOP] [2022-06-13 10:28:59] rebuild_nodes
[START] [2022-06-13 10:28:59] resolve_missing_media_owners
[STOP] [2022-06-13 10:28:59] resolve_missing_media_owners
[START] [2022-06-13 10:28:59] sanitize_media_verbatims
[STOP] [2022-06-13 10:28:59] sanitize_media_verbatims
[START] [2022-06-13 10:28:59] queue_downloads
[STOP] [2022-06-13 10:28:59] queue_downloads
[START] [2022-06-13 10:28:59] parse_names
[WARN] [2022-06-13 10:29:00] I see 46127 names which still need to be parsed.
[WARN] [2022-06-13 10:29:01] Names to parse: 10000 formatted: 10000 learned: 9997 parsed: 10000
[WARN] [2022-06-13 10:29:07] Names to parse: 10000 formatted: 10000 learned: 9996 parsed: 10000
[WARN] [2022-06-13 10:29:14] Names to parse: 10000 formatted: 10000 learned: 9995 parsed: 10000
[WARN] [2022-06-13 10:29:20] Names to parse: 10000 formatted: 10000 learned: 9997 parsed: 10000
[WARN] [2022-06-13 10:29:27] Names to parse: 6127 formatted: 6127 learned: 6127 parsed: 6127
[STOP] [2022-06-13 10:29:31] parse_names
[START] [2022-06-13 10:29:31] denormalize_canonical_names_to_nodes
[STOP] [2022-06-13 10:29:32] denormalize_canonical_names_to_nodes
[START] [2022-06-13 10:29:32] match_nodes
[START] [2022-06-13 10:29:32] map_all_nodes_to_pages
[STOP] [2022-06-13 10:44:50] map_all_nodes_to_pages
[INFO] [2022-06-13 10:44:50] 3796 Unmatched nodes (of 46127)! That's too many to output. Full list in /app/public/data/wiki_it_tar_gz/unmatched_nodes.txt ; First 10: Canonical: Biota; Node#116789090; ResourceID: Q2382443; Canonical: Acytota; Node#116780350; ResourceID: Q169731; Canonical: Prokaryota; Node#116783530; ResourceID: Q19081; Canonical: Korarchaeota; Node#116803199; ResourceID: Q504947; Canonical: Bacteria; Node#116769330; ResourceID: Q10876; Canonical: Negibacteria; Node#116796619; ResourceID: Q3337759; Canonical: Thermotogae; Node#116770058; ResourceID: Q1146853; Canonical: Thermotogae; Node#116791569; ResourceID: Q26869797; Canonical: Gemmatimonadetes; Node#116770062; ResourceID: Q1147292; Canonical: Gemmatimonadetes; Node#116791568; ResourceID: Q26869746
[START] [2022-06-13 10:44:50] update_nodes
[STOP] [2022-06-13 10:45:11] update_nodes
[STOP] [2022-06-13 10:45:11] match_nodes
[START] [2022-06-13 10:45:11] reindex_search
[STOP] [2022-06-13 10:46:49] reindex_search
[START] [2022-06-13 10:46:50] normalize_units
[STOP] [2022-06-13 10:46:50] normalize_units
[START] [2022-06-13 10:46:50] calculate_statistics
[INFO] [2022-06-13 10:47:16] Duplicate page_id count: 0
[STOP] [2022-06-13 10:47:16] calculate_statistics
[START] [2022-06-13 10:47:16] complete_harvest_instance
[START] [2022-06-13 10:47:16] overall_tsv_creation
[INFO] [2022-06-13 10:47:16] Processing group of 46127 in 5 batches of 10000
[INFO] [2022-06-13 10:58:07] Average Time: 53.882
[INFO] [2022-06-13 10:58:07] Total Time: 10m51s
[STOP] [2022-06-13 10:58:07] overall_tsv_creation
[INFO] [2022-06-13 10:58:07] Done. Check your files:
[INFO] [2022-06-13 10:58:07] (46127 lines) /app/public/data/wiki_it_tar_gz/publish_nodes.tsv
[INFO] [2022-06-13 10:58:07] (46127 lines) /app/public/data/wiki_it_tar_gz/publish_identifiers.tsv
[INFO] [2022-06-13 10:58:07] (1041238 lines) /app/public/data/wiki_it_tar_gz/publish_node_ancestors.tsv
[INFO] [2022-06-13 10:58:07] (46127 lines) /app/public/data/wiki_it_tar_gz/publish_scientific_names.tsv
[INFO] [2022-06-13 10:58:08] (1139057 lines) /app/public/data/wiki_it_tar_gz/publish_articles.tsv
[INFO] [2022-06-13 10:58:08] (74824 lines) /app/public/data/wiki_it_tar_gz/publish_content_sections.tsv
[STOP] [2022-06-13 10:58:08] complete_harvest_instance
[START] [2022-06-13 10:58:08] completed
[STOP] [2022-06-13 10:58:08] completed
[STOP] [2022-06-13 10:58:08] logged process, took 3023.01

Latest Process