Harvest for wikipedia SZL Created 04 Aug 13:33

Stage: completed
Fetched: 04 Aug 13:33
Validated: 04 Aug 13:34
Deltas Created 04 Aug 13:34
Units Normalized: 04 Aug 15:15
Ancestry Built: 04 Aug 13:42
Nodes Matched: 04 Aug 15:10
Names Parsed: 04 Aug 13:43
New Models Stored: 04 Aug 13:40
Indexed: 04 Aug 15:15
Completed: 04 Aug 15:24
Time to Harvest: 2 minutes

Expected File Format Definitions

Harvesting Log (most recent first)

# Logfile created on 2020-01-16 10:35:08 -0500 by logger.rb/56815
[START] [2020-01-16 10:35:08] logged process
[START] [2020-01-16 10:35:08] create_harvest_instance
[STOP] [2020-01-16 10:35:09] create_harvest_instance
[START] [2020-01-16 10:35:09] fetch_files
[STOP] [2020-01-16 10:35:09] fetch_files
[START] [2020-01-16 10:35:09] validate_each_file
[STOP] [2020-01-16 10:35:25] validate_each_file
[START] [2020-01-16 10:35:25] convert_to_csv
[CMD] [2020-01-16 10:35:25] /usr/bin/sort /app/public/converted_csv/wiki_szl_silesia_nodes_19966.csv > /app/public/converted_csv/wiki_szl_silesia_nodes_19966.csv_sorted
[CMD] [2020-01-16 10:35:26] /usr/bin/sort /app/public/converted_csv/wiki_szl_silesia_media_19967.csv > /app/public/converted_csv/wiki_szl_silesia_media_19967.csv_sorted
[STOP] [2020-01-16 10:35:27] convert_to_csv
[START] [2020-01-16 10:35:27] calculate_delta
[CMD] [2020-01-16 10:35:27] echo "0a" > /app/public/diff/wiki_szl_silesia_nodes_19966.diff
[CMD] [2020-01-16 10:35:28] tail -n +1 /app/public/converted_csv/wiki_szl_silesia_nodes_19966.csv >> /app/public/diff/wiki_szl_silesia_nodes_19966.diff
[CMD] [2020-01-16 10:35:29] echo "." >> /app/public/diff/wiki_szl_silesia_nodes_19966.diff
[CMD] [2020-01-16 10:35:29] echo "0a" > /app/public/diff/wiki_szl_silesia_media_19967.diff
[CMD] [2020-01-16 10:35:30] tail -n +1 /app/public/converted_csv/wiki_szl_silesia_media_19967.csv >> /app/public/diff/wiki_szl_silesia_media_19967.diff
[CMD] [2020-01-16 10:35:31] echo "." >> /app/public/diff/wiki_szl_silesia_media_19967.diff
[STOP] [2020-01-16 10:35:32] calculate_delta
[START] [2020-01-16 10:35:32] parse_diff_and_store
[INFO] [2020-01-16 10:35:33] Loading nodes diff file into memory (true lines)...
[INFO] [2020-01-16 10:36:00] Loading media diff file into memory (true lines)...
[INFO] [2020-01-16 10:44:22] Storing 51729 ScientificNames
[INFO] [2020-01-16 10:44:22] Processing group of 51729 in 52 groups of 1000
[INFO] [2020-01-16 10:44:45] Average Time: 0.441
[INFO] [2020-01-16 10:44:45] Total Time: 24s
[INFO] [2020-01-16 10:44:45] last 3 / first 3: 0.83
[INFO] [2020-01-16 10:44:45] Std.Dev: 0.2345207879911715; Max: 1.74
[INFO] [2020-01-16 10:44:45] Storing 51729 Identifiers
[INFO] [2020-01-16 10:44:45] Processing group of 51729 in 52 groups of 1000
[INFO] [2020-01-16 10:44:52] Average Time: 0.138
[INFO] [2020-01-16 10:44:52] Total Time: 8s
[INFO] [2020-01-16 10:44:52] last 3 / first 3: 0.67
[INFO] [2020-01-16 10:44:52] Std.Dev: 0.1341640786499874; Max: 1.03
[INFO] [2020-01-16 10:44:52] Storing 51729 Nodes
[INFO] [2020-01-16 10:44:52] Processing group of 51729 in 52 groups of 1000
[INFO] [2020-01-16 10:45:16] Average Time: 0.444
[INFO] [2020-01-16 10:45:16] Total Time: 24s
[INFO] [2020-01-16 10:45:16] last 3 / first 3: 2.93
[INFO] [2020-01-16 10:45:16] Std.Dev: 0.4171330722922842; Max: 2.67
[INFO] [2020-01-16 10:45:16] Storing 84818 ArticlesSections
[INFO] [2020-01-16 10:45:16] Processing group of 84818 in 85 groups of 1000
[INFO] [2020-01-16 10:45:21] Average Time: 0.058
[INFO] [2020-01-16 10:45:21] Total Time: 6s
[INFO] [2020-01-16 10:45:21] last 3 / first 3: 1.76
[INFO] [2020-01-16 10:45:21] Std.Dev: 0.0; Max: 0.19
[INFO] [2020-01-16 10:45:21] Storing 84818 Articles
[INFO] [2020-01-16 10:45:21] Processing group of 84818 in 85 groups of 1000
[INFO] [2020-01-16 10:46:24] Average Time: 0.732
[INFO] [2020-01-16 10:46:24] Total Time: 1m3s
[INFO] [2020-01-16 10:46:24] last 3 / first 3: 1.58
[INFO] [2020-01-16 10:46:24] Std.Dev: 0.47853944456021597; Max: 3.26
[STOP] [2020-01-16 10:46:24] parse_diff_and_store
[START] [2020-01-16 10:46:24] resolve_keys
[STOP] [2020-01-16 10:46:31] resolve_keys
[ERR] [2020-01-16 10:46:31] ActiveRecord::StatementInvalid
[ERR] [2020-01-16 10:46:31] Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `nodes` t JOIN `nodes` o ON (t.`parent_resource_pk` = o.`resource_pk` AND t.harvest_id = 2664 AND o.harvest_id = t.harvest_id ) SET t.`parent_id` = o.`id`
[ERR] [2020-01-16 10:46:31] ./config/initializers/core_extensions.rb:50:in `clean_execute'
[ERR] [2020-01-16 10:46:31] ./config/initializers/core_extensions.rb:44:in `propagate_id'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:577:in `propagate_id'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:444:in `resolve_node_keys'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:427:in `resolve_keys'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:86:in `block (3 levels) in start'
[ERR] [2020-01-16 10:46:31] ../models/logged_process.rb:19:in `run_step'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:86:in `block (2 levels) in start'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:75:in `each_key'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:75:in `block in start'
[ERR] [2020-01-16 10:46:31] ../models/resource.rb:139:in `lock'
[ERR] [2020-01-16 10:46:31] ../models/resource_harvester.rb:72:in `start'
[ERR] [2020-01-16 10:46:31] ../models/resource.rb:223:in `harvest'
[STOP] [2020-01-16 10:46:31] logged process, took 682.64
[INFO] [2020-08-04 13:07:10] ## HARVEST: type = re_download_opendata_-harvest
[INFO] [2020-08-04 13:33:19] ## remove_type: ScientificName
[INFO] [2020-08-04 13:33:19] ++ Calling delete_all on 51729 instances...
[INFO] [2020-08-04 13:33:26] There was an error, retrying: Lockfile::StolenLockError: Lockfile::StolenLockError: DELETE FROM `scientific_names` WHERE `scientific_names`.`harvest_id` = 2664
[INFO] [2020-08-04 13:33:28] ## remove_type: ScientificName
[INFO] [2020-08-04 13:33:28] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:28] [13:33:28.354] Removed 0 Scientificnames
[INFO] [2020-08-04 13:33:28] ## remove_type: Vernacular
[INFO] [2020-08-04 13:33:28] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:28] [13:33:28.356] Removed 0 Vernaculars
[INFO] [2020-08-04 13:33:28] ## remove_type: Article
[INFO] [2020-08-04 13:33:28] ++ Calling delete_all on 84818 instances...
[INFO] [2020-08-04 13:33:37] [13:33:37.933] Removed 84818 Articles
[INFO] [2020-08-04 13:33:37] ## remove_type: Medium
[INFO] [2020-08-04 13:33:37] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:37] [13:33:37.935] Removed 0 Media
[INFO] [2020-08-04 13:33:37] ## remove_type: Trait
[INFO] [2020-08-04 13:33:37] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:37] [13:33:37.937] Removed 0 Traits
[INFO] [2020-08-04 13:33:37] ## remove_type: MetaTrait
[INFO] [2020-08-04 13:33:37] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:37] [13:33:37.938] Removed 0 Metatraits
[INFO] [2020-08-04 13:33:37] ## remove_type: OccurrenceMetadatum
[INFO] [2020-08-04 13:33:37] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:37] [13:33:37.941] Removed 0 Occurrencemetadata
[INFO] [2020-08-04 13:33:37] ## remove_type: Assoc
[INFO] [2020-08-04 13:33:37] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:37] [13:33:37.943] Removed 0 Assocs
[INFO] [2020-08-04 13:33:37] ## remove_type: MetaAssoc
[INFO] [2020-08-04 13:33:37] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:37] [13:33:37.944] Removed 0 Metaassocs
[INFO] [2020-08-04 13:33:37] ## remove_type: Identifier
[INFO] [2020-08-04 13:33:37] ++ Calling delete_all on 51729 instances...
[INFO] [2020-08-04 13:33:44] [13:33:44.212] Removed 51729 Identifiers
[INFO] [2020-08-04 13:33:44] ## remove_type: Reference
[INFO] [2020-08-04 13:33:44] ++ Calling delete_all on 0 instances...
[INFO] [2020-08-04 13:33:44] [13:33:44.214] Removed 0 References
[INFO] [2020-08-04 13:33:44] ## remove_type: Node
[INFO] [2020-08-04 13:33:44] ++ Calling delete_all on 51729 instances...
[INFO] [2020-08-04 13:33:54] [13:33:54.036] Removed 51729 Nodes
[START] [2020-08-04 13:33:54] logged process
[START] [2020-08-04 13:33:54] Creating resource from OpenData
[START] [2020-08-04 13:33:57] logged process
[START] [2020-08-04 13:33:57] Parse meta.xml file and create formats with fields
[STOP] [2020-08-04 13:33:58] Parse meta.xml file and create formats with fields
[STOP] [2020-08-04 13:33:58] Creating resource from OpenData
[START] [2020-08-04 13:33:58] logged process
[START] [2020-08-04 13:33:58] create_harvest_instance
[STOP] [2020-08-04 13:33:59] create_harvest_instance
[START] [2020-08-04 13:33:59] fetch_files
[STOP] [2020-08-04 13:33:59] fetch_files
[START] [2020-08-04 13:33:59] validate_each_file
[STOP] [2020-08-04 13:34:09] validate_each_file
[START] [2020-08-04 13:34:09] convert_to_csv
[CMD] [2020-08-04 13:34:09] /usr/bin/sort /app/public/converted_csv/wiki_szl_silesia_nodes_22450.csv > /app/public/converted_csv/wiki_szl_silesia_nodes_22450.csv_sorted
[CMD] [2020-08-04 13:34:09] /usr/bin/sort /app/public/converted_csv/wiki_szl_silesia_media_22451.csv > /app/public/converted_csv/wiki_szl_silesia_media_22451.csv_sorted
[STOP] [2020-08-04 13:34:10] convert_to_csv
[START] [2020-08-04 13:34:10] calculate_delta
[CMD] [2020-08-04 13:34:10] echo "0a" > /app/public/diff/wiki_szl_silesia_nodes_22450.diff
[CMD] [2020-08-04 13:34:10] tail -n +1 /app/public/converted_csv/wiki_szl_silesia_nodes_22450.csv >> /app/public/diff/wiki_szl_silesia_nodes_22450.diff
[CMD] [2020-08-04 13:34:10] echo "." >> /app/public/diff/wiki_szl_silesia_nodes_22450.diff
[CMD] [2020-08-04 13:34:10] echo "0a" > /app/public/diff/wiki_szl_silesia_media_22451.diff
[CMD] [2020-08-04 13:34:10] tail -n +1 /app/public/converted_csv/wiki_szl_silesia_media_22451.csv >> /app/public/diff/wiki_szl_silesia_media_22451.diff
[CMD] [2020-08-04 13:34:10] echo "." >> /app/public/diff/wiki_szl_silesia_media_22451.diff
[STOP] [2020-08-04 13:34:10] calculate_delta
[START] [2020-08-04 13:34:10] parse_diff_and_store
[INFO] [2020-08-04 13:34:10] Loading nodes diff file into memory (true lines)...
[INFO] [2020-08-04 13:34:27] Loading media diff file into memory (true lines)...
[INFO] [2020-08-04 13:38:36] Storing 51766 ScientificNames
[INFO] [2020-08-04 13:38:36] Processing group of 51766 in 52 groups of 1000
[INFO] [2020-08-04 13:38:55] Average Time: 0.351
[INFO] [2020-08-04 13:38:55] Total Time: 19s
[INFO] [2020-08-04 13:38:55] last 3 / first 3: 0.67
[INFO] [2020-08-04 13:38:55] Std.Dev: 0.044721359549995794; Max: 0.48
[INFO] [2020-08-04 13:38:55] Storing 51766 Identifiers
[INFO] [2020-08-04 13:38:55] Processing group of 51766 in 52 groups of 1000
[INFO] [2020-08-04 13:39:01] Average Time: 0.106
[INFO] [2020-08-04 13:39:01] Total Time: 6s
[INFO] [2020-08-04 13:39:01] last 3 / first 3: 0.76
[INFO] [2020-08-04 13:39:01] Std.Dev: 0.0; Max: 0.16
[INFO] [2020-08-04 13:39:01] Storing 51766 Nodes
[INFO] [2020-08-04 13:39:01] Processing group of 51766 in 52 groups of 1000
[INFO] [2020-08-04 13:39:21] Average Time: 0.393
[INFO] [2020-08-04 13:39:21] Total Time: 21s
[INFO] [2020-08-04 13:39:21] last 3 / first 3: 0.93
[INFO] [2020-08-04 13:39:21] Std.Dev: 0.37416573867739417; Max: 2.63
[INFO] [2020-08-04 13:39:21] Storing 84812 ArticlesSections
[INFO] [2020-08-04 13:39:21] Processing group of 84812 in 85 groups of 1000
[INFO] [2020-08-04 13:39:26] Average Time: 0.057
[INFO] [2020-08-04 13:39:26] Total Time: 6s
[INFO] [2020-08-04 13:39:26] last 3 / first 3: 1.0
[INFO] [2020-08-04 13:39:26] Std.Dev: 0.0; Max: 0.24
[INFO] [2020-08-04 13:39:26] Storing 84812 Articles
[INFO] [2020-08-04 13:39:26] Processing group of 84812 in 85 groups of 1000
[INFO] [2020-08-04 13:40:12] Average Time: 0.524
[INFO] [2020-08-04 13:40:12] Total Time: 46s
[INFO] [2020-08-04 13:40:12] last 3 / first 3: 0.95
[INFO] [2020-08-04 13:40:12] Std.Dev: 0.4483302354291979; Max: 2.95
[STOP] [2020-08-04 13:40:12] parse_diff_and_store
[START] [2020-08-04 13:40:12] resolve_keys
[INFO] [2020-08-04 13:41:13] Occurrences to nodes (through scientific_names)...
[INFO] [2020-08-04 13:41:13] traits to occurrences...
[INFO] [2020-08-04 13:41:13] traits to nodes (through occurrences)...
[INFO] [2020-08-04 13:41:13] Traits to sex term...
[INFO] [2020-08-04 13:41:13] Traits to lifestage term...
[INFO] [2020-08-04 13:41:13] MetaTraits to traits...
[INFO] [2020-08-04 13:41:13] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2020-08-04 13:41:13] Assocs to occurrences...
[INFO] [2020-08-04 13:41:13] Assocs to nodes...
[INFO] [2020-08-04 13:41:13] Assoc to sex term...
[INFO] [2020-08-04 13:41:13] Assoc to lifestage term...
[STOP] [2020-08-04 13:41:13] resolve_keys
[START] [2020-08-04 13:41:13] hold_for_later_1
[STOP] [2020-08-04 13:41:13] hold_for_later_1
[START] [2020-08-04 13:41:13] hold_for_later_2
[STOP] [2020-08-04 13:41:13] hold_for_later_2
[START] [2020-08-04 13:41:13] resolve_missing_parents
[STOP] [2020-08-04 13:41:17] resolve_missing_parents
[START] [2020-08-04 13:41:17] rebuild_nodes
[START] [2020-08-04 13:41:17] Flattener#flatten
[START] [2020-08-04 13:41:17] Flattener#study_resource
[START] [2020-08-04 13:41:17] Flattener#build_ancestry
[STOP] [2020-08-04 13:41:36] Flattener#build_ancestry
[INFO] [2020-08-04 13:41:36] 51766 ancestry keys
[START] [2020-08-04 13:41:36] build_node_ancestors
[INFO] [2020-08-04 13:41:36] old ancestors deleted.
[STOP] [2020-08-04 13:42:33] build_node_ancestors
[START] [2020-08-04 13:42:34] Flattener#propagate_ancestor_ids
[STOP] [2020-08-04 13:42:54] Flattener#propagate_ancestor_ids
[STOP] [2020-08-04 13:42:54] Flattener#flatten
[STOP] [2020-08-04 13:42:54] rebuild_nodes
[START] [2020-08-04 13:42:54] resolve_missing_media_owners
[STOP] [2020-08-04 13:42:54] resolve_missing_media_owners
[START] [2020-08-04 13:42:54] sanitize_media_verbatims
[STOP] [2020-08-04 13:42:54] sanitize_media_verbatims
[START] [2020-08-04 13:42:54] queue_downloads
[STOP] [2020-08-04 13:42:54] queue_downloads
[START] [2020-08-04 13:42:54] parse_names
[WARN] [2020-08-04 13:42:54] I see 51766 names which still need to be parsed.
[WARN] [2020-08-04 13:43:33] I see 8 names which still need to be parsed.
[STOP] [2020-08-04 13:43:34] parse_names
[START] [2020-08-04 13:43:34] denormalize_canonical_names_to_nodes
[STOP] [2020-08-04 13:43:35] denormalize_canonical_names_to_nodes
[START] [2020-08-04 13:43:35] match_nodes
[START] [2020-08-04 13:43:35] map_all_nodes_to_pages
[STOP] [2020-08-04 15:10:03] map_all_nodes_to_pages
[INFO] [2020-08-04 15:10:03] 740 Unmatched nodes (of 51766)! That's too many to output. First 10: Biota (#80698860); Chromalveolata (#80700054); Hacrobia (#80659223); Coccolithophyceae (#80698487); Prymnesiophycidae (#80698481); Dinozoa (#80696690); Embryophyta (#80698499); Spermatophytina (#80698982); Magnoliophyta (#80697010); Magnoliopsida (#80696750)
[START] [2020-08-04 15:10:03] update_nodes
[STOP] [2020-08-04 15:10:28] update_nodes
[STOP] [2020-08-04 15:10:28] match_nodes
[START] [2020-08-04 15:10:28] reindex_search
[STOP] [2020-08-04 15:15:45] reindex_search
[START] [2020-08-04 15:15:45] normalize_units
[STOP] [2020-08-04 15:15:45] normalize_units
[START] [2020-08-04 15:15:45] calculate_statistics
[STOP] [2020-08-04 15:15:46] calculate_statistics
[START] [2020-08-04 15:15:46] complete_harvest_instance
[START] [2020-08-04 15:15:46] overall_tsv_creation
[INFO] [2020-08-04 15:15:46] Processing group of 51766 in 6 batches of 10000
[INFO] [2020-08-04 15:24:47] Average Time: 46.762
[INFO] [2020-08-04 15:24:47] Total Time: 9m2s
[STOP] [2020-08-04 15:24:47] overall_tsv_creation
[INFO] [2020-08-04 15:24:47] Done. Check your files:
[INFO] [2020-08-04 15:24:47] (51766 lines) /app/public/data/wiki_szl_silesia/publish_nodes.tsv
[INFO] [2020-08-04 15:24:48] (51766 lines) /app/public/data/wiki_szl_silesia/publish_identifiers.tsv
[INFO] [2020-08-04 15:24:48] (519986 lines) /app/public/data/wiki_szl_silesia/publish_node_ancestors.tsv
[INFO] [2020-08-04 15:24:48] (51766 lines) /app/public/data/wiki_szl_silesia/publish_scientific_names.tsv
[INFO] [2020-08-04 15:24:48] (263348 lines) /app/public/data/wiki_szl_silesia/publish_articles.tsv
[INFO] [2020-08-04 15:24:48] (84812 lines) /app/public/data/wiki_szl_silesia/publish_content_sections.tsv
[STOP] [2020-08-04 15:24:48] complete_harvest_instance
[START] [2020-08-04 15:24:48] completed
[STOP] [2020-08-04 15:24:48] completed
[STOP] [2020-08-04 15:24:48] logged process, took 6650.5

Latest Process