Harvest for
wikipedia DE
Created
09 Jun 14:26
Stage:
completed
Fetched:
09 Jun 14:26
Validated:
09 Jun 14:26
Deltas Created
09 Jun 14:27
Units Normalized:
09 Jun 15:14
Ancestry Built:
09 Jun 14:51
Nodes Matched:
09 Jun 15:12
Names Parsed:
09 Jun 14:51
New Models Stored:
09 Jun 14:41
Indexed:
09 Jun 15:14
Completed:
09 Jun 15:26
Time to Harvest:
1 minute
Harvesting Log
(170 lines)
[INFO] [2022-06-09 14:26:05] Created harvest instance #4134
[STOP] [2022-06-09 14:26:05] create_harvest_instance
[START] [2022-06-09 14:26:05] fetch_files
[STOP] [2022-06-09 14:26:05] fetch_files
[START] [2022-06-09 14:26:05] validate_each_file
[INFO] [2022-06-09 14:26:05] Looping over 2 formats...
[INFO] [2022-06-09 14:26:05] ...nodes (/app/public/data/wiki_de_tar_gz/taxon.tab)
[INFO] [2022-06-09 14:26:08] Valid: /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_nodes_29369.csv (64999 lines)
[INFO] [2022-06-09 14:26:08] ...media (/app/public/data/wiki_de_tar_gz/media_resource.tab)
[INFO] [2022-06-09 14:26:35] Valid: /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_media_29368.csv (101397 lines)
[STOP] [2022-06-09 14:26:35] validate_each_file
[START] [2022-06-09 14:26:35] convert_to_csv
[INFO] [2022-06-09 14:26:35] Looping over 2 formats...
[INFO] [2022-06-09 14:26:35] ...nodes (/app/public/data/wiki_de_tar_gz/taxon.tab)
[CMD] [2022-06-09 14:26:35] /usr/bin/sort /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_nodes_29369.csv > /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_nodes_29369.csv_sorted
[INFO] [2022-06-09 14:26:35] Converted: /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_nodes_29369.csv (64999 lines)
[INFO] [2022-06-09 14:26:35] ...media (/app/public/data/wiki_de_tar_gz/media_resource.tab)
[CMD] [2022-06-09 14:26:35] /usr/bin/sort /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_media_29368.csv > /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_media_29368.csv_sorted
[INFO] [2022-06-09 14:26:51] Converted: /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_media_29368.csv (101397 lines)
[STOP] [2022-06-09 14:26:51] convert_to_csv
[START] [2022-06-09 14:26:51] calculate_delta
[INFO] [2022-06-09 14:26:51] Looping over 2 formats...
[INFO] [2022-06-09 14:26:51] ...nodes (/app/public/data/wiki_de_tar_gz/taxon.tab)
[CMD] [2022-06-09 14:26:51] echo "0a" > /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_nodes_29369.diff
[CMD] [2022-06-09 14:26:51] tail -n +1 /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_nodes_29369.csv >> /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_nodes_29369.diff
[CMD] [2022-06-09 14:26:52] echo "." >> /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_nodes_29369.diff
[INFO] [2022-06-09 14:26:52] Created diff: /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_nodes_29369.diff (65001 lines)
[INFO] [2022-06-09 14:26:52] ...media (/app/public/data/wiki_de_tar_gz/media_resource.tab)
[CMD] [2022-06-09 14:26:52] echo "0a" > /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_media_29368.diff
[CMD] [2022-06-09 14:26:52] tail -n +1 /app/public/data/wiki_de_tar_gz/converted_csv/wiki_de_tar_gz_media_29368.csv >> /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_media_29368.diff
[CMD] [2022-06-09 14:27:00] echo "." >> /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_media_29368.diff
[INFO] [2022-06-09 14:27:02] Created diff: /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_media_29368.diff (101399 lines)
[STOP] [2022-06-09 14:27:02] calculate_delta
[START] [2022-06-09 14:27:03] parse_diff_and_store
[INFO] [2022-06-09 14:27:03] Handling diff: /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_nodes_29369.diff (65001 lines)
[INFO] [2022-06-09 14:27:03] Loading nodes diff file into memory (65001 lines)...
[WARN] [2022-06-09 14:27:06] Filtered Scientific Name `Cuon alpinus fumosus/javanicus` to `Cuon alpinus fumosusjavanicus`
[INFO] [2022-06-09 14:27:07] Storing 9999 ScientificNames (29997/10000/65001)
[INFO] [2022-06-09 14:27:13] Storing 9999 Identifiers (29997/10000/65001)
[INFO] [2022-06-09 14:27:14] Storing 9999 Nodes (29997/10000/65001)
[INFO] [2022-06-09 14:27:21] Storing 10000 ScientificNames (59997/20000/65001)
[INFO] [2022-06-09 14:27:25] Storing 10000 Identifiers (59997/20000/65001)
[INFO] [2022-06-09 14:27:26] Storing 10000 Nodes (59997/20000/65001)
[INFO] [2022-06-09 14:27:33] Storing 10000 ScientificNames (89997/30000/65001)
[INFO] [2022-06-09 14:27:37] Storing 10000 Identifiers (89997/30000/65001)
[INFO] [2022-06-09 14:27:38] Storing 10000 Nodes (89997/30000/65001)
[INFO] [2022-06-09 14:27:44] Storing 10000 ScientificNames (119998/40000/65001)
[INFO] [2022-06-09 14:27:47] Storing 10001 Identifiers (119998/40000/65001)
[INFO] [2022-06-09 14:27:48] Storing 10000 Nodes (119998/40000/65001)
[INFO] [2022-06-09 14:27:55] Storing 10000 ScientificNames (149998/50000/65001)
[INFO] [2022-06-09 14:27:58] Storing 10000 Identifiers (149998/50000/65001)
[INFO] [2022-06-09 14:27:59] Storing 10000 Nodes (149998/50000/65001)
[WARN] [2022-06-09 14:28:04] Filtered Scientific Name `Pacmanvirus A23` to `Pacmanvirus A23`
[WARN] [2022-06-09 14:28:04] Filtered Scientific Name `Homalocephala polycephala` to `Homalocephala polycephala`
[INFO] [2022-06-09 14:28:06] Storing 10000 ScientificNames (179998/60000/65001)
[INFO] [2022-06-09 14:28:10] Storing 10000 Identifiers (179998/60000/65001)
[INFO] [2022-06-09 14:28:11] Storing 10000 Nodes (179998/60000/65001)
[INFO] [2022-06-09 14:28:17] Storing 5000 ScientificNames (194998/64999/65001)
[INFO] [2022-06-09 14:28:19] Storing 5000 Identifiers (194998/64999/65001)
[INFO] [2022-06-09 14:28:19] Storing 5000 Nodes (194998/64999/65001)
[INFO] [2022-06-09 14:28:23] Handling diff: /app/public/data/wiki_de_tar_gz/diff/wiki_de_tar_gz_media_29368.diff (101399 lines)
[INFO] [2022-06-09 14:28:23] Loading media diff file into memory (101399 lines)...
[INFO] [2022-06-09 14:29:34] Storing 9999 ArticlesSections (19998/10000/101399)
[INFO] [2022-06-09 14:29:34] Storing 9999 Articles (19998/10000/101399)
[INFO] [2022-06-09 14:30:51] Storing 10000 ArticlesSections (39998/20000/101399)
[INFO] [2022-06-09 14:30:51] Storing 10000 Articles (39998/20000/101399)
[INFO] [2022-06-09 14:32:12] Storing 10000 ArticlesSections (59998/30000/101399)
[INFO] [2022-06-09 14:32:12] Storing 10000 Articles (59998/30000/101399)
[INFO] [2022-06-09 14:33:27] Storing 10000 ArticlesSections (79998/40000/101399)
[INFO] [2022-06-09 14:33:28] Storing 10000 Articles (79998/40000/101399)
[INFO] [2022-06-09 14:34:46] Storing 10000 ArticlesSections (99998/50000/101399)
[INFO] [2022-06-09 14:34:46] Storing 10000 Articles (99998/50000/101399)
[INFO] [2022-06-09 14:36:05] Storing 10000 ArticlesSections (119998/60000/101399)
[INFO] [2022-06-09 14:36:05] Storing 10000 Articles (119998/60000/101399)
[INFO] [2022-06-09 14:37:22] Storing 10000 ArticlesSections (139998/70000/101399)
[INFO] [2022-06-09 14:37:22] Storing 10000 Articles (139998/70000/101399)
[INFO] [2022-06-09 14:38:44] Storing 10000 ArticlesSections (159998/80000/101399)
[INFO] [2022-06-09 14:38:45] Storing 10000 Articles (159998/80000/101399)
[INFO] [2022-06-09 14:40:00] Storing 10000 ArticlesSections (179998/90000/101399)
[INFO] [2022-06-09 14:40:02] Storing 10000 Articles (179998/90000/101399)
[INFO] [2022-06-09 14:41:22] Storing 10000 ArticlesSections (199998/100000/101399)
[INFO] [2022-06-09 14:41:23] Storing 10000 Articles (199998/100000/101399)
[INFO] [2022-06-09 14:41:40] Storing 1398 ArticlesSections (202794/101397/101399)
[INFO] [2022-06-09 14:41:40] Storing 1398 Articles (202794/101397/101399)
[STOP] [2022-06-09 14:41:41] parse_diff_and_store
[START] [2022-06-09 14:41:41] resolve_keys
[2022-06-09 14:43:47] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2022-06-09 14:47:43] Occurrences to nodes (through scientific_names)...
[INFO] [2022-06-09 14:47:43] traits to occurrences...
[INFO] [2022-06-09 14:47:43] traits to nodes (through occurrences)...
[INFO] [2022-06-09 14:47:43] Traits to sex term...
[INFO] [2022-06-09 14:47:43] Traits to lifestage term...
[INFO] [2022-06-09 14:47:43] MetaTraits to traits...
[INFO] [2022-06-09 14:47:43] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2022-06-09 14:47:44] Assocs to occurrences...
[INFO] [2022-06-09 14:47:44] Assocs to nodes...
[INFO] [2022-06-09 14:47:44] Assoc to sex term...
[INFO] [2022-06-09 14:47:44] Assoc to lifestage term...
[INFO] [2022-06-09 14:47:44] MetaAssoc to assocs...
[STOP] [2022-06-09 14:47:44] resolve_keys
[START] [2022-06-09 14:47:44] hold_for_later_1
[STOP] [2022-06-09 14:47:44] hold_for_later_1
[START] [2022-06-09 14:47:44] hold_for_later_2
[STOP] [2022-06-09 14:47:44] hold_for_later_2
[START] [2022-06-09 14:47:44] resolve_missing_parents
[STOP] [2022-06-09 14:47:48] resolve_missing_parents
[START] [2022-06-09 14:47:48] rebuild_nodes
[START] [2022-06-09 14:47:48] Flattener#flatten
[START] [2022-06-09 14:47:48] Flattener#study_resource
[START] [2022-06-09 14:47:48] Flattener#build_ancestry
[STOP] [2022-06-09 14:47:58] Flattener#build_ancestry
[INFO] [2022-06-09 14:47:58] 64999 ancestry keys
[START] [2022-06-09 14:47:58] build_node_ancestors
[INFO] [2022-06-09 14:47:58] old ancestors deleted.
[STOP] [2022-06-09 14:50:18] build_node_ancestors
[START] [2022-06-09 14:50:25] Flattener#propagate_ancestor_ids
[STOP] [2022-06-09 14:51:01] Flattener#propagate_ancestor_ids
[STOP] [2022-06-09 14:51:01] Flattener#flatten
[STOP] [2022-06-09 14:51:01] rebuild_nodes
[START] [2022-06-09 14:51:01] resolve_missing_media_owners
[STOP] [2022-06-09 14:51:01] resolve_missing_media_owners
[START] [2022-06-09 14:51:01] sanitize_media_verbatims
[STOP] [2022-06-09 14:51:01] sanitize_media_verbatims
[START] [2022-06-09 14:51:01] queue_downloads
[STOP] [2022-06-09 14:51:01] queue_downloads
[START] [2022-06-09 14:51:01] parse_names
[WARN] [2022-06-09 14:51:01] I see 64999 names which still need to be parsed.
[WARN] [2022-06-09 14:51:02] Names to parse: 10000 formatted: 10000 learned: 9997 parsed: 10000
[WARN] [2022-06-09 14:51:09] Names to parse: 10000 formatted: 10000 learned: 9998 parsed: 10000
[WARN] [2022-06-09 14:51:16] Names to parse: 10000 formatted: 10000 learned: 9999 parsed: 10000
[WARN] [2022-06-09 14:51:22] Names to parse: 10000 formatted: 10000 learned: 9992 parsed: 10000
[WARN] [2022-06-09 14:51:29] Names to parse: 10000 formatted: 10000 learned: 9996 parsed: 10000
[WARN] [2022-06-09 14:51:36] Names to parse: 10000 formatted: 10000 learned: 9993 parsed: 10000
[WARN] [2022-06-09 14:51:43] Names to parse: 4999 formatted: 4999 learned: 4999 parsed: 4999
[STOP] [2022-06-09 14:51:47] parse_names
[START] [2022-06-09 14:51:47] denormalize_canonical_names_to_nodes
[STOP] [2022-06-09 14:51:48] denormalize_canonical_names_to_nodes
[START] [2022-06-09 14:51:48] match_nodes
[START] [2022-06-09 14:51:48] map_all_nodes_to_pages
[STOP] [2022-06-09 15:11:30] map_all_nodes_to_pages
[INFO] [2022-06-09 15:11:30] 6266 Unmatched nodes (of 64999)! That's too many to output. Full list in /app/public/data/wiki_de_tar_gz/unmatched_nodes.txt ; First 10: Canonical: Anthradapis; Node#116347035; ResourceID: Q107479053; Canonical: Pseudolabrini; Node#116348215; ResourceID: Q111008219; Canonical: incertae sedis; Node#116380714; ResourceID: Q235536; Canonical: Aulographales; Node#116348175; ResourceID: Q110788345; Canonical: Biota; Node#116380939; ResourceID: Q2382443; Canonical: Prokaryota; Node#116373750; ResourceID: Q19081; Canonical: Nitrosopumilus limneticus; Node#116348294; ResourceID: Q111593158; Canonical: Halorubrum salsolis; Node#116397861; ResourceID: Q5643447; Canonical: Methanoliparia; Node#116348141; ResourceID: Q110623777; Canonical: Proteoarchaeota; Node#116378276; ResourceID: Q21282292
[START] [2022-06-09 15:11:30] update_nodes
[STOP] [2022-06-09 15:12:08] update_nodes
[STOP] [2022-06-09 15:12:08] match_nodes
[START] [2022-06-09 15:12:08] reindex_search
[STOP] [2022-06-09 15:14:23] reindex_search
[START] [2022-06-09 15:14:23] normalize_units
[STOP] [2022-06-09 15:14:23] normalize_units
[START] [2022-06-09 15:14:23] calculate_statistics
[INFO] [2022-06-09 15:14:32] Duplicate page_id count: 0
[STOP] [2022-06-09 15:14:32] calculate_statistics
[START] [2022-06-09 15:14:32] complete_harvest_instance
[START] [2022-06-09 15:14:32] overall_tsv_creation
[INFO] [2022-06-09 15:14:32] Processing group of 64999 in 7 batches of 10000
[INFO] [2022-06-09 15:26:45] Average Time: 54.776
[INFO] [2022-06-09 15:26:45] Total Time: 12m13s
[INFO] [2022-06-09 15:26:45] last 3 / first 3: 0.84
[INFO] [2022-06-09 15:26:45] Std.Dev: 10.043; Max: 71.61
[STOP] [2022-06-09 15:26:45] overall_tsv_creation
[INFO] [2022-06-09 15:26:45] Done. Check your files:
[INFO] [2022-06-09 15:26:45] (64999 lines) /app/public/data/wiki_de_tar_gz/publish_nodes.tsv
[INFO] [2022-06-09 15:26:45] (65000 lines) /app/public/data/wiki_de_tar_gz/publish_identifiers.tsv
[INFO] [2022-06-09 15:26:45] (1397240 lines) /app/public/data/wiki_de_tar_gz/publish_node_ancestors.tsv
[INFO] [2022-06-09 15:26:45] (64999 lines) /app/public/data/wiki_de_tar_gz/publish_scientific_names.tsv
[INFO] [2022-06-09 15:26:45] (1873414 lines) /app/public/data/wiki_de_tar_gz/publish_articles.tsv
[INFO] [2022-06-09 15:26:45] (101397 lines) /app/public/data/wiki_de_tar_gz/publish_content_sections.tsv
[STOP] [2022-06-09 15:26:46] complete_harvest_instance
[START] [2022-06-09 15:26:46] completed
[STOP] [2022-06-09 15:26:46] completed
[STOP] [2022-06-09 15:26:46] logged process, took 3640.4
Latest Process