Harvest for wikipedia MIN Created 29 May 14:55

Stage: completed
Fetched: 29 May 14:55
Validated: 29 May 14:55
Deltas Created 29 May 14:55
Units Normalized: 30 May 00:10
Ancestry Built: 29 May 15:29
Nodes Matched: 29 May 23:55
Names Parsed: 29 May 15:31
New Models Stored: 29 May 15:12
Indexed: 30 May 00:10
Completed: 30 May 00:44
Time to Harvest: 10 minutes

Expected File Format Definitions

Harvesting Log (most recent first)

# Logfile created on 2019-11-19 12:07:36 -0500 by logger.rb/56815
[START] [2019-11-19 12:07:36] logged process
[START] [2019-11-19 12:07:36] create_harvest_instance
[STOP] [2019-11-19 12:07:37] create_harvest_instance
[START] [2019-11-19 12:07:37] fetch_files
[STOP] [2019-11-19 12:07:37] fetch_files
[START] [2019-11-19 12:07:37] validate_each_file
[STOP] [2019-11-19 12:08:05] validate_each_file
[START] [2019-11-19 12:08:05] convert_to_csv
[CMD] [2019-11-19 12:08:05] /usr/bin/sort /app/public/converted_csv/wiki_min_minangk_nodes_18706.csv > /app/public/converted_csv/wiki_min_minangk_nodes_18706.csv_sorted
[CMD] [2019-11-19 12:08:05] /usr/bin/sort /app/public/converted_csv/wiki_min_minangk_media_18707.csv > /app/public/converted_csv/wiki_min_minangk_media_18707.csv_sorted
[STOP] [2019-11-19 12:08:06] convert_to_csv
[START] [2019-11-19 12:08:06] calculate_delta
[CMD] [2019-11-19 12:08:06] echo "0a" > /app/public/diff/wiki_min_minangk_nodes_18706.diff
[CMD] [2019-11-19 12:08:06] tail -n +1 /app/public/converted_csv/wiki_min_minangk_nodes_18706.csv >> /app/public/diff/wiki_min_minangk_nodes_18706.diff
[CMD] [2019-11-19 12:08:06] echo "." >> /app/public/diff/wiki_min_minangk_nodes_18706.diff
[CMD] [2019-11-19 12:08:07] echo "0a" > /app/public/diff/wiki_min_minangk_media_18707.diff
[CMD] [2019-11-19 12:08:07] tail -n +1 /app/public/converted_csv/wiki_min_minangk_media_18707.csv >> /app/public/diff/wiki_min_minangk_media_18707.diff
[CMD] [2019-11-19 12:08:07] echo "." >> /app/public/diff/wiki_min_minangk_media_18707.diff
[STOP] [2019-11-19 12:08:08] calculate_delta
[START] [2019-11-19 12:08:08] parse_diff_and_store
[INFO] [2019-11-19 12:08:08] Loading nodes diff file into memory (true lines)...
[INFO] [2019-11-19 12:09:45] Loading media diff file into memory (true lines)...
[INFO] [2019-11-19 12:20:37] Storing 175486 ScientificNames
[INFO] [2019-11-19 12:20:37] Processing group of 175486 in 176 groups of 1000
[INFO] [2019-11-19 12:22:12] Average Time: 0.54
[INFO] [2019-11-19 12:22:12] Total Time: 1m36s
[INFO] [2019-11-19 12:22:12] last 3 / first 3: 0.61
[INFO] [2019-11-19 12:22:12] Std.Dev: 0.8619744775803979; Max: 5.34
[INFO] [2019-11-19 12:22:12] Storing 175486 Identifiers
[INFO] [2019-11-19 12:22:12] Processing group of 175486 in 176 groups of 1000
[INFO] [2019-11-19 12:22:41] Average Time: 0.16
[INFO] [2019-11-19 12:22:41] Total Time: 30s
[INFO] [2019-11-19 12:22:41] last 3 / first 3: 0.75
[INFO] [2019-11-19 12:22:41] Std.Dev: 0.552268050859363; Max: 5.34
[INFO] [2019-11-19 12:22:41] Storing 175486 Nodes
[INFO] [2019-11-19 12:22:41] Processing group of 175486 in 176 groups of 1000
[INFO] [2019-11-19 12:24:14] Average Time: 0.522
[INFO] [2019-11-19 12:24:14] Total Time: 1m33s
[INFO] [2019-11-19 12:24:14] last 3 / first 3: 0.6
[INFO] [2019-11-19 12:24:14] Std.Dev: 1.0516653460107925; Max: 6.32
[INFO] [2019-11-19 12:24:14] Storing 233143 ArticlesSections
[INFO] [2019-11-19 12:24:14] Processing group of 233143 in 234 groups of 1000
[INFO] [2019-11-19 12:24:41] Average Time: 0.11
[INFO] [2019-11-19 12:24:41] Total Time: 27s
[INFO] [2019-11-19 12:24:41] last 3 / first 3: 0.33
[INFO] [2019-11-19 12:24:41] Std.Dev: 0.5531726674375732; Max: 6.08
[INFO] [2019-11-19 12:24:41] Storing 233143 Articles
[INFO] [2019-11-19 12:24:41] Processing group of 233143 in 234 groups of 1000
[INFO] [2019-11-19 12:27:21] Average Time: 0.679
[INFO] [2019-11-19 12:27:21] Total Time: 2m40s
[INFO] [2019-11-19 12:27:21] last 3 / first 3: 0.39
[INFO] [2019-11-19 12:27:21] Std.Dev: 1.3019216566291538; Max: 7.18
[STOP] [2019-11-19 12:27:21] parse_diff_and_store
[START] [2019-11-19 12:27:21] resolve_keys
[INFO] [2019-11-19 12:34:06] Occurrences to nodes (through scientific_names)...
[INFO] [2019-11-19 12:34:06] traits to occurrences...
[INFO] [2019-11-19 12:34:06] traits to nodes (through occurrences)...
[INFO] [2019-11-19 12:34:06] Traits to sex term...
[INFO] [2019-11-19 12:34:06] Traits to lifestage term...
[INFO] [2019-11-19 12:34:06] MetaTraits to traits...
[INFO] [2019-11-19 12:34:06] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2019-11-19 12:34:06] Assocs to occurrences...
[INFO] [2019-11-19 12:34:06] Assocs to nodes...
[INFO] [2019-11-19 12:34:06] Assoc to sex term...
[INFO] [2019-11-19 12:34:07] Assoc to lifestage term...
[STOP] [2019-11-19 12:34:07] resolve_keys
[START] [2019-11-19 12:34:07] hold_for_later_1
[STOP] [2019-11-19 12:34:07] hold_for_later_1
[START] [2019-11-19 12:34:07] hold_for_later_2
[STOP] [2019-11-19 12:34:07] hold_for_later_2
[START] [2019-11-19 12:34:07] resolve_missing_parents
[STOP] [2019-11-19 12:34:14] resolve_missing_parents
[START] [2019-11-19 12:34:14] rebuild_nodes
[START] [2019-11-19 12:34:14] Flattener#flatten
[START] [2019-11-19 12:34:14] Flattener#study_resource
[START] [2019-11-19 12:34:28] Flattener#build_ancestry
[STOP] [2019-11-19 12:35:33] Flattener#build_ancestry
[INFO] [2019-11-19 12:35:33] 175486 ancestry keys
[START] [2019-11-19 12:35:33] build_node_ancestors
[INFO] [2019-11-19 12:35:33] old ancestors deleted.
[STOP] [2019-11-19 12:45:40] build_node_ancestors
[START] [2019-11-19 12:45:42] Flattener#propagate_ancestor_ids
[STOP] [2019-11-19 12:49:15] Flattener#propagate_ancestor_ids
[STOP] [2019-11-19 12:49:15] Flattener#flatten
[STOP] [2019-11-19 12:49:15] rebuild_nodes
[START] [2019-11-19 12:49:15] resolve_missing_media_owners
[STOP] [2019-11-19 12:49:15] resolve_missing_media_owners
[START] [2019-11-19 12:49:15] sanitize_media_verbatims
[STOP] [2019-11-19 12:49:15] sanitize_media_verbatims
[START] [2019-11-19 12:49:15] queue_downloads
[STOP] [2019-11-19 12:49:15] queue_downloads
[START] [2019-11-19 12:49:15] parse_names
[WARN] [2019-11-19 12:49:16] I see 175486 names which still need to be parsed.
[WARN] [2019-11-19 12:51:32] I see 3 names which still need to be parsed.
[STOP] [2019-11-19 12:51:33] parse_names
[START] [2019-11-19 12:51:33] denormalize_canonical_names_to_nodes
[STOP] [2019-11-19 12:51:36] denormalize_canonical_names_to_nodes
[START] [2019-11-19 12:51:36] match_nodes
[START] [2019-11-19 12:51:37] map_all_nodes_to_pages
[STOP] [2019-11-19 15:43:33] map_all_nodes_to_pages
[INFO] [2019-11-19 15:43:33] 4930 Unmatched nodes (of 175486)! That's too many to output. First 10: Biota (#55294430); Prokaryota (#55278222); Bacteria (#55154895); Diaphoretickes (#55273860); Sar (#55162628); Phaeista (#55310830); Limnista (#55277367); Fucistia (#55309427); Fucophycidae (#55283422); Chromalveolata (#55315710)
[START] [2019-11-19 15:43:33] update_nodes
[STOP] [2019-11-19 15:44:05] update_nodes
[STOP] [2019-11-19 15:44:05] match_nodes
[START] [2019-11-19 15:44:05] reindex_search
[STOP] [2019-11-19 15:55:34] reindex_search
[START] [2019-11-19 15:55:34] normalize_units
[STOP] [2019-11-19 15:55:34] normalize_units
[START] [2019-11-19 15:55:34] calculate_statistics
[STOP] [2019-11-19 15:55:35] calculate_statistics
[START] [2019-11-19 15:55:35] complete_harvest_instance
[START] [2019-11-19 15:55:35] overall_tsv_creation
[INFO] [2019-11-19 15:55:36] Processing group of 175486 in 18 batches of 10000
[INFO] [2019-11-19 16:38:48] Average Time: 91.187
[INFO] [2019-11-19 16:38:48] Total Time: 43m13s
[INFO] [2019-11-19 16:38:48] last 3 / first 3: 0.83
[INFO] [2019-11-19 16:38:48] Std.Dev: 12.378449014315162; Max: 103.17
[STOP] [2019-11-19 16:38:48] overall_tsv_creation
[INFO] [2019-11-19 16:38:48] Done. Check your files:
[INFO] [2019-11-19 16:38:49] (175486 lines) /app/public/data/wiki_min_minangk/publish_nodes.tsv
[INFO] [2019-11-19 16:38:49] (175486 lines) /app/public/data/wiki_min_minangk/publish_identifiers.tsv
[INFO] [2019-11-19 16:38:49] (4016244 lines) /app/public/data/wiki_min_minangk/publish_node_ancestors.tsv
[INFO] [2019-11-19 16:38:49] (175486 lines) /app/public/data/wiki_min_minangk/publish_scientific_names.tsv
[INFO] [2019-11-19 16:38:50] (430143 lines) /app/public/data/wiki_min_minangk/publish_articles.tsv
[INFO] [2019-11-19 16:38:50] (233143 lines) /app/public/data/wiki_min_minangk/publish_content_sections.tsv
[STOP] [2019-11-19 16:38:50] complete_harvest_instance
[START] [2019-11-19 16:38:50] completed
[STOP] [2019-11-19 16:38:50] completed
[STOP] [2019-11-19 16:38:50] logged process, took 16274.18
[INFO] [2020-05-29 14:33:42] ## HARVEST: type = re_-harvest
[START] [2020-05-29 14:55:03] logged process
[START] [2020-05-29 14:55:04] create_harvest_instance
[STOP] [2020-05-29 14:55:06] create_harvest_instance
[START] [2020-05-29 14:55:06] fetch_files
[STOP] [2020-05-29 14:55:06] fetch_files
[START] [2020-05-29 14:55:06] validate_each_file
[STOP] [2020-05-29 14:55:26] validate_each_file
[START] [2020-05-29 14:55:26] convert_to_csv
[CMD] [2020-05-29 14:55:26] /usr/bin/sort /app/public/converted_csv/wiki_min_minangk_nodes_21030.csv > /app/public/converted_csv/wiki_min_minangk_nodes_21030.csv_sorted
[CMD] [2020-05-29 14:55:26] /usr/bin/sort /app/public/converted_csv/wiki_min_minangk_media_21031.csv > /app/public/converted_csv/wiki_min_minangk_media_21031.csv_sorted
[STOP] [2020-05-29 14:55:27] convert_to_csv
[START] [2020-05-29 14:55:27] calculate_delta
[CMD] [2020-05-29 14:55:27] echo "0a" > /app/public/diff/wiki_min_minangk_nodes_21030.diff
[CMD] [2020-05-29 14:55:27] tail -n +1 /app/public/converted_csv/wiki_min_minangk_nodes_21030.csv >> /app/public/diff/wiki_min_minangk_nodes_21030.diff
[CMD] [2020-05-29 14:55:27] echo "." >> /app/public/diff/wiki_min_minangk_nodes_21030.diff
[CMD] [2020-05-29 14:55:27] echo "0a" > /app/public/diff/wiki_min_minangk_media_21031.diff
[CMD] [2020-05-29 14:55:27] tail -n +1 /app/public/converted_csv/wiki_min_minangk_media_21031.csv >> /app/public/diff/wiki_min_minangk_media_21031.diff
[CMD] [2020-05-29 14:55:27] echo "." >> /app/public/diff/wiki_min_minangk_media_21031.diff
[STOP] [2020-05-29 14:55:27] calculate_delta
[START] [2020-05-29 14:55:27] parse_diff_and_store
[INFO] [2020-05-29 14:55:28] Loading nodes diff file into memory (true lines)...
[INFO] [2020-05-29 14:56:36] Loading media diff file into memory (true lines)...
[INFO] [2020-05-29 15:07:17] Storing 175486 ScientificNames
[INFO] [2020-05-29 15:07:17] Processing group of 175486 in 176 groups of 1000
[INFO] [2020-05-29 15:08:32] Average Time: 0.421
[INFO] [2020-05-29 15:08:32] Total Time: 1m15s
[INFO] [2020-05-29 15:08:32] last 3 / first 3: 0.64
[INFO] [2020-05-29 15:08:32] Std.Dev: 0.6913754406977441; Max: 5.72
[INFO] [2020-05-29 15:08:32] Storing 175486 Identifiers
[INFO] [2020-05-29 15:08:32] Processing group of 175486 in 176 groups of 1000
[INFO] [2020-05-29 15:08:52] Average Time: 0.106
[INFO] [2020-05-29 15:08:52] Total Time: 20s
[INFO] [2020-05-29 15:08:52] last 3 / first 3: 0.71
[INFO] [2020-05-29 15:08:52] Std.Dev: 0.03162277660168379; Max: 0.25
[INFO] [2020-05-29 15:08:52] Storing 175486 Nodes
[INFO] [2020-05-29 15:08:52] Processing group of 175486 in 176 groups of 1000
[INFO] [2020-05-29 15:10:16] Average Time: 0.476
[INFO] [2020-05-29 15:10:16] Total Time: 1m25s
[INFO] [2020-05-29 15:10:16] last 3 / first 3: 0.52
[INFO] [2020-05-29 15:10:16] Std.Dev: 0.9939818911831342; Max: 6.53
[INFO] [2020-05-29 15:10:16] Storing 233143 ArticlesSections
[INFO] [2020-05-29 15:10:16] Processing group of 233143 in 234 groups of 1000
[INFO] [2020-05-29 15:10:29] Average Time: 0.054
[INFO] [2020-05-29 15:10:29] Total Time: 14s
[INFO] [2020-05-29 15:10:29] last 3 / first 3: 0.65
[INFO] [2020-05-29 15:10:29] Std.Dev: 0.03162277660168379; Max: 0.29
[INFO] [2020-05-29 15:10:29] Storing 233143 Articles
[INFO] [2020-05-29 15:10:29] Processing group of 233143 in 234 groups of 1000
[INFO] [2020-05-29 15:12:56] Average Time: 0.619
[INFO] [2020-05-29 15:12:56] Total Time: 2m27s
[INFO] [2020-05-29 15:12:56] last 3 / first 3: 0.12
[INFO] [2020-05-29 15:12:56] Std.Dev: 1.214084016862095; Max: 7.39
[STOP] [2020-05-29 15:12:56] parse_diff_and_store
[START] [2020-05-29 15:12:56] resolve_keys
[INFO] [2020-05-29 15:16:54] Occurrences to nodes (through scientific_names)...
[INFO] [2020-05-29 15:16:54] traits to occurrences...
[INFO] [2020-05-29 15:16:54] traits to nodes (through occurrences)...
[INFO] [2020-05-29 15:16:54] Traits to sex term...
[INFO] [2020-05-29 15:16:54] Traits to lifestage term...
[INFO] [2020-05-29 15:16:54] MetaTraits to traits...
[INFO] [2020-05-29 15:16:54] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2020-05-29 15:16:54] Assocs to occurrences...
[INFO] [2020-05-29 15:16:54] Assocs to nodes...
[INFO] [2020-05-29 15:16:54] Assoc to sex term...
[INFO] [2020-05-29 15:16:54] Assoc to lifestage term...
[STOP] [2020-05-29 15:16:54] resolve_keys
[START] [2020-05-29 15:16:54] hold_for_later_1
[STOP] [2020-05-29 15:16:54] hold_for_later_1
[START] [2020-05-29 15:16:54] hold_for_later_2
[STOP] [2020-05-29 15:16:54] hold_for_later_2
[START] [2020-05-29 15:16:54] resolve_missing_parents
[STOP] [2020-05-29 15:17:04] resolve_missing_parents
[START] [2020-05-29 15:17:04] rebuild_nodes
[START] [2020-05-29 15:17:04] Flattener#flatten
[START] [2020-05-29 15:17:04] Flattener#study_resource
[START] [2020-05-29 15:17:17] Flattener#build_ancestry
[STOP] [2020-05-29 15:18:17] Flattener#build_ancestry
[INFO] [2020-05-29 15:18:17] 175486 ancestry keys
[START] [2020-05-29 15:18:17] build_node_ancestors
[INFO] [2020-05-29 15:18:17] old ancestors deleted.
[STOP] [2020-05-29 15:25:45] build_node_ancestors
[START] [2020-05-29 15:25:46] Flattener#propagate_ancestor_ids
[STOP] [2020-05-29 15:29:18] Flattener#propagate_ancestor_ids
[STOP] [2020-05-29 15:29:18] Flattener#flatten
[STOP] [2020-05-29 15:29:18] rebuild_nodes
[START] [2020-05-29 15:29:18] resolve_missing_media_owners
[STOP] [2020-05-29 15:29:18] resolve_missing_media_owners
[START] [2020-05-29 15:29:18] sanitize_media_verbatims
[STOP] [2020-05-29 15:29:18] sanitize_media_verbatims
[START] [2020-05-29 15:29:18] queue_downloads
[STOP] [2020-05-29 15:29:18] queue_downloads
[START] [2020-05-29 15:29:18] parse_names
[WARN] [2020-05-29 15:29:18] I see 175486 names which still need to be parsed.
[WARN] [2020-05-29 15:31:28] I see 3 names which still need to be parsed.
[STOP] [2020-05-29 15:31:29] parse_names
[START] [2020-05-29 15:31:29] denormalize_canonical_names_to_nodes
[STOP] [2020-05-29 15:31:32] denormalize_canonical_names_to_nodes
[START] [2020-05-29 15:31:32] match_nodes
[START] [2020-05-29 15:31:32] map_all_nodes_to_pages
[STOP] [2020-05-29 23:54:46] map_all_nodes_to_pages
[INFO] [2020-05-29 23:54:46] 5128 Unmatched nodes (of 175486)! That's too many to output. First 10: Biota (#79702187); Prokaryota (#79685979); Bacteria (#79562652); Diaphoretickes (#79681617); Sar (#79570385); Phaeista (#79718587); Limnista (#79685124); Fucistia (#79717184); Fucophycidae (#79691179); Chromalveolata (#79723467)
[START] [2020-05-29 23:54:46] update_nodes
[STOP] [2020-05-29 23:55:25] update_nodes
[STOP] [2020-05-29 23:55:25] match_nodes
[START] [2020-05-29 23:55:25] reindex_search
[STOP] [2020-05-30 00:10:25] reindex_search
[START] [2020-05-30 00:10:25] normalize_units
[STOP] [2020-05-30 00:10:25] normalize_units
[START] [2020-05-30 00:10:25] calculate_statistics
[STOP] [2020-05-30 00:10:27] calculate_statistics
[START] [2020-05-30 00:10:27] complete_harvest_instance
[START] [2020-05-30 00:10:27] overall_tsv_creation
[INFO] [2020-05-30 00:10:27] Processing group of 175486 in 18 batches of 10000
[INFO] [2020-05-30 00:44:28] Average Time: 58.683
[INFO] [2020-05-30 00:44:28] Total Time: 34m2s
[INFO] [2020-05-30 00:44:28] last 3 / first 3: 0.72
[INFO] [2020-05-30 00:44:28] Std.Dev: 8.209384873423831; Max: 77.42
[STOP] [2020-05-30 00:44:28] overall_tsv_creation
[INFO] [2020-05-30 00:44:28] Done. Check your files:
[INFO] [2020-05-30 00:44:29] (175486 lines) /app/public/data/wiki_min_minangk/publish_nodes.tsv
[INFO] [2020-05-30 00:44:29] (175486 lines) /app/public/data/wiki_min_minangk/publish_identifiers.tsv
[INFO] [2020-05-30 00:44:29] (4016244 lines) /app/public/data/wiki_min_minangk/publish_node_ancestors.tsv
[INFO] [2020-05-30 00:44:29] (175486 lines) /app/public/data/wiki_min_minangk/publish_scientific_names.tsv
[INFO] [2020-05-30 00:44:29] (430143 lines) /app/public/data/wiki_min_minangk/publish_articles.tsv
[INFO] [2020-05-30 00:44:30] (233143 lines) /app/public/data/wiki_min_minangk/publish_content_sections.tsv
[STOP] [2020-05-30 00:44:30] complete_harvest_instance
[START] [2020-05-30 00:44:30] completed
[STOP] [2020-05-30 00:44:30] completed
[STOP] [2020-05-30 00:44:30] logged process, took 35366.57

Latest Process