Harvest for
wikipedia AST
Created
31 Dec 11:03
Stage:
completed
Fetched:
31 Dec 11:03
Validated:
31 Dec 11:03
Deltas Created
31 Dec 11:03
Units Normalized:
31 Dec 11:57
Ancestry Built:
31 Dec 11:11
Nodes Matched:
31 Dec 11:56
Names Parsed:
31 Dec 11:11
New Models Stored:
31 Dec 11:08
Indexed:
31 Dec 11:57
Completed:
31 Dec 12:02
Time to Harvest:
1 minute
Harvesting Log
(129 lines)
# Logfile created on 2019-12-31 11:03:46 -0500 by logger.rb/56815
[START] [2019-12-31 11:03:46] logged process
[START] [2019-12-31 11:03:46] create_harvest_instance
[STOP] [2019-12-31 11:03:46] create_harvest_instance
[START] [2019-12-31 11:03:46] fetch_files
[STOP] [2019-12-31 11:03:46] fetch_files
[START] [2019-12-31 11:03:46] validate_each_file
[STOP] [2019-12-31 11:03:55] validate_each_file
[START] [2019-12-31 11:03:55] convert_to_csv
[CMD] [2019-12-31 11:03:55] /usr/bin/sort /app/public/converted_csv/wiki_wikipedia_a_nodes_19871.csv > /app/public/converted_csv/wiki_wikipedia_a_nodes_19871.csv_sorted
[CMD] [2019-12-31 11:03:56] /usr/bin/sort /app/public/converted_csv/wiki_wikipedia_a_media_19872.csv > /app/public/converted_csv/wiki_wikipedia_a_media_19872.csv_sorted
[STOP] [2019-12-31 11:03:56] convert_to_csv
[START] [2019-12-31 11:03:56] calculate_delta
[CMD] [2019-12-31 11:03:56] echo "0a" > /app/public/diff/wiki_wikipedia_a_nodes_19871.diff
[CMD] [2019-12-31 11:03:56] tail -n +1 /app/public/converted_csv/wiki_wikipedia_a_nodes_19871.csv >> /app/public/diff/wiki_wikipedia_a_nodes_19871.diff
[CMD] [2019-12-31 11:03:56] echo "." >> /app/public/diff/wiki_wikipedia_a_nodes_19871.diff
[CMD] [2019-12-31 11:03:57] echo "0a" > /app/public/diff/wiki_wikipedia_a_media_19872.diff
[CMD] [2019-12-31 11:03:57] tail -n +1 /app/public/converted_csv/wiki_wikipedia_a_media_19872.csv >> /app/public/diff/wiki_wikipedia_a_media_19872.diff
[CMD] [2019-12-31 11:03:57] echo "." >> /app/public/diff/wiki_wikipedia_a_media_19872.diff
[STOP] [2019-12-31 11:03:57] calculate_delta
[START] [2019-12-31 11:03:57] parse_diff_and_store
[INFO] [2019-12-31 11:03:57] Loading nodes diff file into memory (true lines)...
[WARN] [2019-12-31 11:04:05] Filtered Scientific Name `Tyto capensis תנשמת עשב אפריקאית` to `Tyto capensis תנשמת עשב אפריקאית`
[INFO] [2019-12-31 11:04:09] Loading media diff file into memory (true lines)...
[INFO] [2019-12-31 11:07:30] Storing 21869 ScientificNames
[INFO] [2019-12-31 11:07:30] Processing group of 21869 in 22 groups of 1000
[INFO] [2019-12-31 11:07:39] Average Time: 0.404
[INFO] [2019-12-31 11:07:39] Total Time: 9s
[INFO] [2019-12-31 11:07:39] last 3 / first 3: 0.75
[INFO] [2019-12-31 11:07:39] Std.Dev: 0.07071067811865475; Max: 0.66
[INFO] [2019-12-31 11:07:39] Storing 21869 Identifiers
[INFO] [2019-12-31 11:07:39] Processing group of 21869 in 22 groups of 1000
[INFO] [2019-12-31 11:07:43] Average Time: 0.158
[INFO] [2019-12-31 11:07:43] Total Time: 4s
[INFO] [2019-12-31 11:07:43] last 3 / first 3: 0.98
[INFO] [2019-12-31 11:07:43] Std.Dev: 0.03162277660168379; Max: 0.22
[INFO] [2019-12-31 11:07:43] Storing 21869 Nodes
[INFO] [2019-12-31 11:07:43] Processing group of 21869 in 22 groups of 1000
[INFO] [2019-12-31 11:07:59] Average Time: 0.733
[INFO] [2019-12-31 11:07:59] Total Time: 17s
[INFO] [2019-12-31 11:07:59] last 3 / first 3: 0.1
[INFO] [2019-12-31 11:07:59] Std.Dev: 1.2; Max: 4.93
[INFO] [2019-12-31 11:07:59] Storing 30773 ArticlesSections
[INFO] [2019-12-31 11:07:59] Processing group of 30773 in 31 groups of 1000
[INFO] [2019-12-31 11:08:01] Average Time: 0.054
[INFO] [2019-12-31 11:08:01] Total Time: 2s
[INFO] [2019-12-31 11:08:01] last 3 / first 3: 1.4
[INFO] [2019-12-31 11:08:01] Std.Dev: 0.0; Max: 0.09
[INFO] [2019-12-31 11:08:01] Storing 30773 Articles
[INFO] [2019-12-31 11:08:01] Processing group of 30773 in 31 groups of 1000
[INFO] [2019-12-31 11:08:25] Average Time: 0.793
[INFO] [2019-12-31 11:08:25] Total Time: 25s
[INFO] [2019-12-31 11:08:25] last 3 / first 3: 0.23
[INFO] [2019-12-31 11:08:25] Std.Dev: 0.9316651759081692; Max: 5.8
[STOP] [2019-12-31 11:08:25] parse_diff_and_store
[START] [2019-12-31 11:08:25] resolve_keys
[INFO] [2019-12-31 11:09:25] Occurrences to nodes (through scientific_names)...
[INFO] [2019-12-31 11:09:25] traits to occurrences...
[INFO] [2019-12-31 11:09:25] traits to nodes (through occurrences)...
[INFO] [2019-12-31 11:09:25] Traits to sex term...
[INFO] [2019-12-31 11:09:25] Traits to lifestage term...
[INFO] [2019-12-31 11:09:25] MetaTraits to traits...
[INFO] [2019-12-31 11:09:25] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2019-12-31 11:09:25] Assocs to occurrences...
[INFO] [2019-12-31 11:09:25] Assocs to nodes...
[INFO] [2019-12-31 11:09:25] Assoc to sex term...
[INFO] [2019-12-31 11:09:25] Assoc to lifestage term...
[STOP] [2019-12-31 11:09:25] resolve_keys
[START] [2019-12-31 11:09:25] hold_for_later_1
[STOP] [2019-12-31 11:09:25] hold_for_later_1
[START] [2019-12-31 11:09:25] hold_for_later_2
[STOP] [2019-12-31 11:09:25] hold_for_later_2
[START] [2019-12-31 11:09:25] resolve_missing_parents
[STOP] [2019-12-31 11:09:31] resolve_missing_parents
[START] [2019-12-31 11:09:31] rebuild_nodes
[START] [2019-12-31 11:09:31] Flattener#flatten
[START] [2019-12-31 11:09:31] Flattener#study_resource
[START] [2019-12-31 11:09:31] Flattener#build_ancestry
[STOP] [2019-12-31 11:09:33] Flattener#build_ancestry
[INFO] [2019-12-31 11:09:33] 21869 ancestry keys
[START] [2019-12-31 11:09:33] build_node_ancestors
[INFO] [2019-12-31 11:09:33] old ancestors deleted.
[STOP] [2019-12-31 11:10:49] build_node_ancestors
[START] [2019-12-31 11:10:54] Flattener#propagate_ancestor_ids
[STOP] [2019-12-31 11:11:15] Flattener#propagate_ancestor_ids
[STOP] [2019-12-31 11:11:15] Flattener#flatten
[STOP] [2019-12-31 11:11:15] rebuild_nodes
[START] [2019-12-31 11:11:15] resolve_missing_media_owners
[STOP] [2019-12-31 11:11:15] resolve_missing_media_owners
[START] [2019-12-31 11:11:15] sanitize_media_verbatims
[STOP] [2019-12-31 11:11:15] sanitize_media_verbatims
[START] [2019-12-31 11:11:15] queue_downloads
[STOP] [2019-12-31 11:11:15] queue_downloads
[START] [2019-12-31 11:11:15] parse_names
[WARN] [2019-12-31 11:11:15] I see 21869 names which still need to be parsed.
[WARN] [2019-12-31 11:11:33] I see 14 names which still need to be parsed.
[STOP] [2019-12-31 11:11:35] parse_names
[START] [2019-12-31 11:11:35] denormalize_canonical_names_to_nodes
[STOP] [2019-12-31 11:11:35] denormalize_canonical_names_to_nodes
[START] [2019-12-31 11:11:35] match_nodes
[START] [2019-12-31 11:11:35] map_all_nodes_to_pages
[STOP] [2019-12-31 11:55:57] map_all_nodes_to_pages
[INFO] [2019-12-31 11:55:57] 2519 Unmatched nodes (of 21869)! That's too many to output. First 10: Biota (#62780259); Prokaryota (#62778253); Bacteria (#62770755); Negibacteria (#62784042); Flavobacteria (#62786517); Escherichia coli (#62780778); Protista (#62770779); Sarcomastigota (#62779359); Diaphoretickes (#62777024); Sar (#62773737)
[START] [2019-12-31 11:55:57] update_nodes
[STOP] [2019-12-31 11:56:05] update_nodes
[STOP] [2019-12-31 11:56:05] match_nodes
[START] [2019-12-31 11:56:05] reindex_search
[STOP] [2019-12-31 11:57:21] reindex_search
[START] [2019-12-31 11:57:21] normalize_units
[STOP] [2019-12-31 11:57:21] normalize_units
[START] [2019-12-31 11:57:21] calculate_statistics
[STOP] [2019-12-31 11:57:22] calculate_statistics
[START] [2019-12-31 11:57:22] complete_harvest_instance
[START] [2019-12-31 11:57:22] overall_tsv_creation
[INFO] [2019-12-31 11:57:22] Processing group of 21869 in 3 batches of 10000
[INFO] [2019-12-31 12:02:41] Average Time: 66.78
[INFO] [2019-12-31 12:02:41] Total Time: 5m20s
[STOP] [2019-12-31 12:02:41] overall_tsv_creation
[INFO] [2019-12-31 12:02:41] Done. Check your files:
[INFO] [2019-12-31 12:02:41] (21869 lines) /app/public/data/wiki_wikipedia_a/publish_nodes.tsv
[INFO] [2019-12-31 12:02:41] (21869 lines) /app/public/data/wiki_wikipedia_a/publish_identifiers.tsv
[INFO] [2019-12-31 12:02:42] (458159 lines) /app/public/data/wiki_wikipedia_a/publish_node_ancestors.tsv
[INFO] [2019-12-31 12:02:42] (21869 lines) /app/public/data/wiki_wikipedia_a/publish_scientific_names.tsv
[INFO] [2019-12-31 12:02:42] (468650 lines) /app/public/data/wiki_wikipedia_a/publish_articles.tsv
[INFO] [2019-12-31 12:02:42] (30773 lines) /app/public/data/wiki_wikipedia_a/publish_content_sections.tsv
[STOP] [2019-12-31 12:02:42] complete_harvest_instance
[START] [2019-12-31 12:02:42] completed
[STOP] [2019-12-31 12:02:42] completed
[STOP] [2019-12-31 12:02:42] logged process, took 3536.49
Latest Process