Harvest for wikipedia PT Created 13 Jun 12:30

Stage: completed
Fetched: 13 Jun 12:30
Validated: 13 Jun 12:30
Deltas Created 13 Jun 12:31
Units Normalized: 13 Jun 13:34
Ancestry Built: 13 Jun 12:54
Nodes Matched: 13 Jun 13:30
Names Parsed: 13 Jun 12:56
New Models Stored: 13 Jun 12:44
Indexed: 13 Jun 13:34
Completed: 13 Jun 13:56
Time to Harvest: 1 minute

Harvesting Log

(213 lines)
[INFO] [2022-06-13 12:30:14] Created harvest instance #4139
[STOP] [2022-06-13 12:30:14] create_harvest_instance
[START] [2022-06-13 12:30:14] fetch_files
[STOP] [2022-06-13 12:30:14] fetch_files
[START] [2022-06-13 12:30:14] validate_each_file
[INFO] [2022-06-13 12:30:14] Looping over 2 formats...
[INFO] [2022-06-13 12:30:14] ...nodes (/app/public/data/wiki_pt_tar_gz/taxon.tab)
[INFO] [2022-06-13 12:30:19] Valid: /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_nodes_29382.csv (119398 lines)
[INFO] [2022-06-13 12:30:19] ...media (/app/public/data/wiki_pt_tar_gz/media_resource.tab)
[INFO] [2022-06-13 12:30:43] Valid: /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_media_29381.csv (209329 lines)
[STOP] [2022-06-13 12:30:43] validate_each_file
[START] [2022-06-13 12:30:44] convert_to_csv
[INFO] [2022-06-13 12:30:44] Looping over 2 formats...
[INFO] [2022-06-13 12:30:44] ...nodes (/app/public/data/wiki_pt_tar_gz/taxon.tab)
[CMD] [2022-06-13 12:30:44] /usr/bin/sort /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_nodes_29382.csv > /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_nodes_29382.csv_sorted
[INFO] [2022-06-13 12:30:44] Converted: /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_nodes_29382.csv (119398 lines)
[INFO] [2022-06-13 12:30:44] ...media (/app/public/data/wiki_pt_tar_gz/media_resource.tab)
[CMD] [2022-06-13 12:30:44] /usr/bin/sort /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_media_29381.csv > /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_media_29381.csv_sorted
[INFO] [2022-06-13 12:30:55] Converted: /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_media_29381.csv (209329 lines)
[STOP] [2022-06-13 12:30:55] convert_to_csv
[START] [2022-06-13 12:30:55] calculate_delta
[INFO] [2022-06-13 12:30:55] Looping over 2 formats...
[INFO] [2022-06-13 12:30:55] ...nodes (/app/public/data/wiki_pt_tar_gz/taxon.tab)
[CMD] [2022-06-13 12:30:55] echo "0a" > /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_nodes_29382.diff
[CMD] [2022-06-13 12:30:55] tail -n +1 /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_nodes_29382.csv >> /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_nodes_29382.diff
[CMD] [2022-06-13 12:30:55] echo "." >> /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_nodes_29382.diff
[INFO] [2022-06-13 12:30:55] Created diff: /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_nodes_29382.diff (119400 lines)
[INFO] [2022-06-13 12:30:55] ...media (/app/public/data/wiki_pt_tar_gz/media_resource.tab)
[CMD] [2022-06-13 12:30:55] echo "0a" > /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_media_29381.diff
[CMD] [2022-06-13 12:30:55] tail -n +1 /app/public/data/wiki_pt_tar_gz/converted_csv/wiki_pt_tar_gz_media_29381.csv >> /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_media_29381.diff
[CMD] [2022-06-13 12:31:01] echo "." >> /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_media_29381.diff
[INFO] [2022-06-13 12:31:03] Created diff: /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_media_29381.diff (209331 lines)
[STOP] [2022-06-13 12:31:03] calculate_delta
[START] [2022-06-13 12:31:03] parse_diff_and_store
[INFO] [2022-06-13 12:31:03] Handling diff: /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_nodes_29382.diff (119400 lines)
[INFO] [2022-06-13 12:31:03] Loading nodes diff file into memory (119400 lines)...
[INFO] [2022-06-13 12:31:07] Storing 9999 ScientificNames (29997/10000/119400)
[INFO] [2022-06-13 12:31:09] Storing 9999 Identifiers (29997/10000/119400)
[INFO] [2022-06-13 12:31:11] Storing 9999 Nodes (29997/10000/119400)
[WARN] [2022-06-13 12:31:15] Filtered Scientific Name `Cuon alpinus fumosus/javanicus` to `Cuon alpinus fumosusjavanicus`
[INFO] [2022-06-13 12:31:17] Storing 10000 ScientificNames (59997/20000/119400)
[INFO] [2022-06-13 12:31:20] Storing 10000 Identifiers (59997/20000/119400)
[INFO] [2022-06-13 12:31:21] Storing 10000 Nodes (59997/20000/119400)
[INFO] [2022-06-13 12:31:28] Storing 10000 ScientificNames (89997/30000/119400)
[INFO] [2022-06-13 12:31:31] Storing 10000 Identifiers (89997/30000/119400)
[INFO] [2022-06-13 12:31:33] Storing 10000 Nodes (89997/30000/119400)
[INFO] [2022-06-13 12:31:40] Storing 10000 ScientificNames (119997/40000/119400)
[INFO] [2022-06-13 12:31:42] Storing 10000 Identifiers (119997/40000/119400)
[INFO] [2022-06-13 12:31:44] Storing 10000 Nodes (119997/40000/119400)
[INFO] [2022-06-13 12:31:50] Storing 10000 ScientificNames (149997/50000/119400)
[INFO] [2022-06-13 12:31:53] Storing 10000 Identifiers (149997/50000/119400)
[INFO] [2022-06-13 12:31:54] Storing 10000 Nodes (149997/50000/119400)
[INFO] [2022-06-13 12:32:01] Storing 10000 ScientificNames (179997/60000/119400)
[INFO] [2022-06-13 12:32:04] Storing 10000 Identifiers (179997/60000/119400)
[INFO] [2022-06-13 12:32:05] Storing 10000 Nodes (179997/60000/119400)
[WARN] [2022-06-13 12:32:10] Filtered Scientific Name `/Gunneridae` to `Gunneridae`
[INFO] [2022-06-13 12:32:12] Storing 10000 ScientificNames (209997/70000/119400)
[INFO] [2022-06-13 12:32:15] Storing 10000 Identifiers (209997/70000/119400)
[INFO] [2022-06-13 12:32:16] Storing 10000 Nodes (209997/70000/119400)
[INFO] [2022-06-13 12:32:23] Storing 10000 ScientificNames (239997/80000/119400)
[INFO] [2022-06-13 12:32:26] Storing 10000 Identifiers (239997/80000/119400)
[INFO] [2022-06-13 12:32:28] Storing 10000 Nodes (239997/80000/119400)
[WARN] [2022-06-13 12:32:32] Filtered Scientific Name `/Eudicotyledoneae` to `Eudicotyledoneae`
[WARN] [2022-06-13 12:32:32] Filtered Scientific Name `/Mesangiospermae` to `Mesangiospermae`
[WARN] [2022-06-13 12:32:32] Filtered Scientific Name `/Pan-Angiospermae` to `Pan-Angiospermae`
[INFO] [2022-06-13 12:32:34] Storing 10000 ScientificNames (269997/90000/119400)
[INFO] [2022-06-13 12:32:37] Storing 10000 Identifiers (269997/90000/119400)
[INFO] [2022-06-13 12:32:39] Storing 10000 Nodes (269997/90000/119400)
[WARN] [2022-06-13 12:32:44] Filtered Scientific Name `/Pentapetalae` to `Pentapetalae`
[INFO] [2022-06-13 12:32:46] Storing 10000 ScientificNames (299997/100000/119400)
[INFO] [2022-06-13 12:32:48] Storing 10000 Identifiers (299997/100000/119400)
[INFO] [2022-06-13 12:32:50] Storing 10000 Nodes (299997/100000/119400)
[INFO] [2022-06-13 12:32:57] Storing 10000 ScientificNames (329997/110000/119400)
[INFO] [2022-06-13 12:33:00] Storing 10000 Identifiers (329997/110000/119400)
[INFO] [2022-06-13 12:33:01] Storing 10000 Nodes (329997/110000/119400)
[INFO] [2022-06-13 12:33:08] Storing 9399 ScientificNames (358194/119398/119400)
[INFO] [2022-06-13 12:33:10] Storing 9399 Identifiers (358194/119398/119400)
[INFO] [2022-06-13 12:33:12] Storing 9399 Nodes (358194/119398/119400)
[INFO] [2022-06-13 12:33:15] Handling diff: /app/public/data/wiki_pt_tar_gz/diff/wiki_pt_tar_gz_media_29381.diff (209331 lines)
[INFO] [2022-06-13 12:33:16] Loading media diff file into memory (209331 lines)...
[INFO] [2022-06-13 12:33:44] Storing 9999 ArticlesSections (19998/10000/209331)
[INFO] [2022-06-13 12:33:44] Storing 9999 Articles (19998/10000/209331)
[INFO] [2022-06-13 12:34:17] Storing 10000 ArticlesSections (39998/20000/209331)
[INFO] [2022-06-13 12:34:17] Storing 10000 Articles (39998/20000/209331)
[INFO] [2022-06-13 12:34:50] Storing 10000 ArticlesSections (59998/30000/209331)
[INFO] [2022-06-13 12:34:51] Storing 10000 Articles (59998/30000/209331)
[INFO] [2022-06-13 12:35:24] Storing 10000 ArticlesSections (79998/40000/209331)
[INFO] [2022-06-13 12:35:25] Storing 10000 Articles (79998/40000/209331)
[INFO] [2022-06-13 12:35:57] Storing 10000 ArticlesSections (99998/50000/209331)
[INFO] [2022-06-13 12:35:58] Storing 10000 Articles (99998/50000/209331)
[INFO] [2022-06-13 12:36:30] Storing 10000 ArticlesSections (119998/60000/209331)
[INFO] [2022-06-13 12:36:31] Storing 10000 Articles (119998/60000/209331)
[INFO] [2022-06-13 12:37:03] Storing 10000 ArticlesSections (139998/70000/209331)
[INFO] [2022-06-13 12:37:03] Storing 10000 Articles (139998/70000/209331)
[INFO] [2022-06-13 12:37:36] Storing 10000 ArticlesSections (159998/80000/209331)
[INFO] [2022-06-13 12:37:36] Storing 10000 Articles (159998/80000/209331)
[INFO] [2022-06-13 12:38:07] Storing 10000 ArticlesSections (179998/90000/209331)
[INFO] [2022-06-13 12:38:08] Storing 10000 Articles (179998/90000/209331)
[INFO] [2022-06-13 12:38:41] Storing 10000 ArticlesSections (199998/100000/209331)
[INFO] [2022-06-13 12:38:42] Storing 10000 Articles (199998/100000/209331)
[INFO] [2022-06-13 12:39:15] Storing 10000 ArticlesSections (219998/110000/209331)
[INFO] [2022-06-13 12:39:16] Storing 10000 Articles (219998/110000/209331)
[INFO] [2022-06-13 12:39:49] Storing 10000 ArticlesSections (239998/120000/209331)
[INFO] [2022-06-13 12:39:49] Storing 10000 Articles (239998/120000/209331)
[INFO] [2022-06-13 12:40:22] Storing 10000 ArticlesSections (259998/130000/209331)
[INFO] [2022-06-13 12:40:23] Storing 10000 Articles (259998/130000/209331)
[INFO] [2022-06-13 12:40:56] Storing 10000 ArticlesSections (279998/140000/209331)
[INFO] [2022-06-13 12:40:56] Storing 10000 Articles (279998/140000/209331)
[INFO] [2022-06-13 12:41:29] Storing 10000 ArticlesSections (299998/150000/209331)
[INFO] [2022-06-13 12:41:30] Storing 10000 Articles (299998/150000/209331)
[INFO] [2022-06-13 12:42:02] Storing 10000 ArticlesSections (319998/160000/209331)
[INFO] [2022-06-13 12:42:03] Storing 10000 Articles (319998/160000/209331)
[INFO] [2022-06-13 12:42:35] Storing 10000 ArticlesSections (339998/170000/209331)
[INFO] [2022-06-13 12:42:36] Storing 10000 Articles (339998/170000/209331)
[INFO] [2022-06-13 12:43:10] Storing 10000 ArticlesSections (359998/180000/209331)
[INFO] [2022-06-13 12:43:11] Storing 10000 Articles (359998/180000/209331)
[INFO] [2022-06-13 12:43:44] Storing 10000 ArticlesSections (379998/190000/209331)
[INFO] [2022-06-13 12:43:45] Storing 10000 Articles (379998/190000/209331)
[INFO] [2022-06-13 12:44:20] Storing 10000 ArticlesSections (399998/200000/209331)
[INFO] [2022-06-13 12:44:20] Storing 10000 Articles (399998/200000/209331)
[INFO] [2022-06-13 12:44:53] Storing 9330 ArticlesSections (418658/209329/209331)
[INFO] [2022-06-13 12:44:54] Storing 9330 Articles (418658/209329/209331)
[STOP] [2022-06-13 12:44:58] parse_diff_and_store
[START] [2022-06-13 12:44:58] resolve_keys
[2022-06-13 12:46:26] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2022-06-13 12:48:07] Occurrences to nodes (through scientific_names)...
[INFO] [2022-06-13 12:48:07] traits to occurrences...
[INFO] [2022-06-13 12:48:07] traits to nodes (through occurrences)...
[INFO] [2022-06-13 12:48:07] Traits to sex term...
[INFO] [2022-06-13 12:48:07] Traits to lifestage term...
[INFO] [2022-06-13 12:48:07] MetaTraits to traits...
[INFO] [2022-06-13 12:48:07] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2022-06-13 12:48:07] Assocs to occurrences...
[INFO] [2022-06-13 12:48:07] Assocs to nodes...
[INFO] [2022-06-13 12:48:07] Assoc to sex term...
[INFO] [2022-06-13 12:48:07] Assoc to lifestage term...
[INFO] [2022-06-13 12:48:07] MetaAssoc to assocs...
[STOP] [2022-06-13 12:48:07] resolve_keys
[START] [2022-06-13 12:48:08] hold_for_later_1
[STOP] [2022-06-13 12:48:08] hold_for_later_1
[START] [2022-06-13 12:48:08] hold_for_later_2
[STOP] [2022-06-13 12:48:08] hold_for_later_2
[START] [2022-06-13 12:48:08] resolve_missing_parents
[STOP] [2022-06-13 12:48:14] resolve_missing_parents
[START] [2022-06-13 12:48:14] rebuild_nodes
[START] [2022-06-13 12:48:14] Flattener#flatten
[START] [2022-06-13 12:48:14] Flattener#study_resource
[START] [2022-06-13 12:48:14] Flattener#build_ancestry
[STOP] [2022-06-13 12:48:46] Flattener#build_ancestry
[INFO] [2022-06-13 12:48:46] 119398 ancestry keys
[START] [2022-06-13 12:48:46] build_node_ancestors
[INFO] [2022-06-13 12:48:46] old ancestors deleted.
[STOP] [2022-06-13 12:53:22] build_node_ancestors
[START] [2022-06-13 12:53:27] Flattener#propagate_ancestor_ids
[STOP] [2022-06-13 12:54:40] Flattener#propagate_ancestor_ids
[STOP] [2022-06-13 12:54:40] Flattener#flatten
[STOP] [2022-06-13 12:54:40] rebuild_nodes
[START] [2022-06-13 12:54:40] resolve_missing_media_owners
[STOP] [2022-06-13 12:54:40] resolve_missing_media_owners
[START] [2022-06-13 12:54:40] sanitize_media_verbatims
[STOP] [2022-06-13 12:54:40] sanitize_media_verbatims
[START] [2022-06-13 12:54:40] queue_downloads
[STOP] [2022-06-13 12:54:40] queue_downloads
[START] [2022-06-13 12:54:40] parse_names
[WARN] [2022-06-13 12:54:40] I see 119398 names which still need to be parsed.
[WARN] [2022-06-13 12:54:41] Names to parse: 10000 formatted: 10000 learned: 9998 parsed: 10000
[WARN] [2022-06-13 12:54:47] Names to parse: 10000 formatted: 10000 learned: 9998 parsed: 10000
[WARN] [2022-06-13 12:54:55] Names to parse: 10000 formatted: 10000 learned: 10000 parsed: 10000
[WARN] [2022-06-13 12:55:01] Names to parse: 10000 formatted: 10000 learned: 9999 parsed: 10000
[WARN] [2022-06-13 12:55:08] Names to parse: 10000 formatted: 10000 learned: 9999 parsed: 10000
[WARN] [2022-06-13 12:55:15] Names to parse: 10000 formatted: 10000 learned: 9995 parsed: 10000
[WARN] [2022-06-13 12:55:22] Names to parse: 10000 formatted: 10000 learned: 9995 parsed: 10000
[WARN] [2022-06-13 12:55:30] Names to parse: 10000 formatted: 10000 learned: 10000 parsed: 10000
[WARN] [2022-06-13 12:55:36] Names to parse: 10000 formatted: 10000 learned: 10000 parsed: 10000
[WARN] [2022-06-13 12:55:43] Names to parse: 10000 formatted: 10000 learned: 9995 parsed: 10000
[WARN] [2022-06-13 12:55:50] Names to parse: 10000 formatted: 10000 learned: 9996 parsed: 10000
[WARN] [2022-06-13 12:55:57] Names to parse: 9398 formatted: 9398 learned: 9398 parsed: 9398
[STOP] [2022-06-13 12:56:04] parse_names
[START] [2022-06-13 12:56:04] denormalize_canonical_names_to_nodes
[STOP] [2022-06-13 12:56:05] denormalize_canonical_names_to_nodes
[START] [2022-06-13 12:56:05] match_nodes
[START] [2022-06-13 12:56:06] map_all_nodes_to_pages
[STOP] [2022-06-13 13:30:14] map_all_nodes_to_pages
[INFO] [2022-06-13 13:30:14] 11267 Unmatched nodes (of 119398)! That's too many to output. Full list in /app/public/data/wiki_pt_tar_gz/unmatched_nodes.txt ; First 10: Canonical: Artigasia; Node#116821091; ResourceID: Q107029819; Canonical: Micromphalia; Node#116821106; ResourceID: Q107052159; Canonical: Cirratulida; Node#116821174; ResourceID: Q107122642; Canonical: Opheliida; Node#116821175; ResourceID: Q107122700; Canonical: Myenchildae; Node#116821178; ResourceID: Q107126081; Canonical: Creagrocercidae; Node#116821180; ResourceID: Q107126828; Canonical: Parakaryon myojinensis; Node#116867945; ResourceID: Q22329203; Canonical: Biota; Node#116870513; ResourceID: Q2382443; Canonical: Acytota; Node#116851393; ResourceID: Q169731; Canonical: Prokaryota; Node#116858417; ResourceID: Q19081
[START] [2022-06-13 13:30:14] update_nodes
[STOP] [2022-06-13 13:30:23] update_nodes
[STOP] [2022-06-13 13:30:23] match_nodes
[START] [2022-06-13 13:30:23] reindex_search
[STOP] [2022-06-13 13:34:27] reindex_search
[START] [2022-06-13 13:34:27] normalize_units
[STOP] [2022-06-13 13:34:27] normalize_units
[START] [2022-06-13 13:34:27] calculate_statistics
[INFO] [2022-06-13 13:34:35] Duplicate page_id count: 0
[STOP] [2022-06-13 13:34:35] calculate_statistics
[START] [2022-06-13 13:34:35] complete_harvest_instance
[START] [2022-06-13 13:34:35] overall_tsv_creation
[INFO] [2022-06-13 13:34:36] Processing group of 119398 in 12 batches of 10000
[INFO] [2022-06-13 13:56:39] Average Time: 52.185
[INFO] [2022-06-13 13:56:39] Total Time: 22m4s
[INFO] [2022-06-13 13:56:39] last 3 / first 3: 1.01
[INFO] [2022-06-13 13:56:39] Std.Dev: 3.273; Max: 55.1
[STOP] [2022-06-13 13:56:39] overall_tsv_creation
[INFO] [2022-06-13 13:56:39] Done. Check your files:
[INFO] [2022-06-13 13:56:40] (119398 lines) /app/public/data/wiki_pt_tar_gz/publish_nodes.tsv
[INFO] [2022-06-13 13:56:40] (119398 lines) /app/public/data/wiki_pt_tar_gz/publish_identifiers.tsv
[INFO] [2022-06-13 13:56:40] (2684889 lines) /app/public/data/wiki_pt_tar_gz/publish_node_ancestors.tsv
[INFO] [2022-06-13 13:56:40] (119398 lines) /app/public/data/wiki_pt_tar_gz/publish_scientific_names.tsv
[INFO] [2022-06-13 13:56:40] (1513006 lines) /app/public/data/wiki_pt_tar_gz/publish_articles.tsv
[INFO] [2022-06-13 13:56:40] (209329 lines) /app/public/data/wiki_pt_tar_gz/publish_content_sections.tsv
[STOP] [2022-06-13 13:56:41] complete_harvest_instance
[START] [2022-06-13 13:56:41] completed
[STOP] [2022-06-13 13:56:41] completed
[STOP] [2022-06-13 13:56:41] logged process, took 5186.63

Latest Process