Harvest for
wikipedia BR
Created
01 Nov 03:10
Stage:
completed
Fetched:
01 Nov 03:10
Validated:
01 Nov 03:10
Deltas Created
01 Nov 03:10
Units Normalized:
01 Nov 03:28
Ancestry Built:
01 Nov 03:24
Nodes Matched:
01 Nov 03:27
Names Parsed:
01 Nov 03:24
New Models Stored:
01 Nov 03:11
Indexed:
01 Nov 03:28
Completed:
01 Nov 03:33
Time to Harvest:
less than a minute
Harvesting Log
(125 lines)
[INFO] [2024-11-01 03:10:51] Created harvest instance #4559
[STOP] [2024-11-01 03:10:51] create_harvest_instance
[START] [2024-11-01 03:10:51] fetch_files
[STOP] [2024-11-01 03:10:51] fetch_files
[START] [2024-11-01 03:10:51] validate_each_file
[INFO] [2024-11-01 03:10:51] Looping over 2 formats...
[INFO] [2024-11-01 03:10:51] ...nodes (/app/public/data/wiki_br_breton/taxon.tab)
[INFO] [2024-11-01 03:10:51] Valid: /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_nodes_31200.csv (9811 lines)
[INFO] [2024-11-01 03:10:51] ...media (/app/public/data/wiki_br_breton/media_resource.tab)
[INFO] [2024-11-01 03:10:53] Valid: /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_media_31201.csv (13101 lines)
[STOP] [2024-11-01 03:10:53] validate_each_file
[START] [2024-11-01 03:10:53] convert_to_csv
[INFO] [2024-11-01 03:10:53] Looping over 2 formats...
[INFO] [2024-11-01 03:10:53] ...nodes (/app/public/data/wiki_br_breton/taxon.tab)
[CMD] [2024-11-01 03:10:53] /usr/bin/sort /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_nodes_31200.csv > /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_nodes_31200.csv_sorted
[INFO] [2024-11-01 03:10:53] Converted: /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_nodes_31200.csv (9811 lines)
[INFO] [2024-11-01 03:10:53] ...media (/app/public/data/wiki_br_breton/media_resource.tab)
[CMD] [2024-11-01 03:10:53] /usr/bin/sort /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_media_31201.csv > /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_media_31201.csv_sorted
[INFO] [2024-11-01 03:10:54] Converted: /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_media_31201.csv (13101 lines)
[STOP] [2024-11-01 03:10:54] convert_to_csv
[START] [2024-11-01 03:10:54] calculate_delta
[INFO] [2024-11-01 03:10:54] Looping over 2 formats...
[INFO] [2024-11-01 03:10:54] ...nodes (/app/public/data/wiki_br_breton/taxon.tab)
[CMD] [2024-11-01 03:10:54] echo "0a" > /app/public/data/wiki_br_breton/diff/wiki_br_breton_nodes_31200.diff
[CMD] [2024-11-01 03:10:54] tail -n +1 /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_nodes_31200.csv >> /app/public/data/wiki_br_breton/diff/wiki_br_breton_nodes_31200.diff
[CMD] [2024-11-01 03:10:54] echo "." >> /app/public/data/wiki_br_breton/diff/wiki_br_breton_nodes_31200.diff
[INFO] [2024-11-01 03:10:54] Created diff: /app/public/data/wiki_br_breton/diff/wiki_br_breton_nodes_31200.diff (9813 lines)
[INFO] [2024-11-01 03:10:54] ...media (/app/public/data/wiki_br_breton/media_resource.tab)
[CMD] [2024-11-01 03:10:54] echo "0a" > /app/public/data/wiki_br_breton/diff/wiki_br_breton_media_31201.diff
[CMD] [2024-11-01 03:10:54] tail -n +1 /app/public/data/wiki_br_breton/converted_csv/wiki_br_breton_media_31201.csv >> /app/public/data/wiki_br_breton/diff/wiki_br_breton_media_31201.diff
[CMD] [2024-11-01 03:10:54] echo "." >> /app/public/data/wiki_br_breton/diff/wiki_br_breton_media_31201.diff
[INFO] [2024-11-01 03:10:54] Created diff: /app/public/data/wiki_br_breton/diff/wiki_br_breton_media_31201.diff (13103 lines)
[STOP] [2024-11-01 03:10:54] calculate_delta
[START] [2024-11-01 03:10:54] parse_diff_and_store
[INFO] [2024-11-01 03:10:54] Handling diff: /app/public/data/wiki_br_breton/diff/wiki_br_breton_nodes_31200.diff (9813 lines)
[INFO] [2024-11-01 03:10:54] Loading nodes diff file into memory (9813 lines)...
[INFO] [2024-11-01 03:10:58] Storing 9811 ScientificNames (29433/9811/9813)
[INFO] [2024-11-01 03:11:01] Storing 9811 Identifiers (29433/9811/9813)
[INFO] [2024-11-01 03:11:03] Storing 9811 Nodes (29433/9811/9813)
[INFO] [2024-11-01 03:11:08] Handling diff: /app/public/data/wiki_br_breton/diff/wiki_br_breton_media_31201.diff (13103 lines)
[INFO] [2024-11-01 03:11:08] Loading media diff file into memory (13103 lines)...
[INFO] [2024-11-01 03:11:33] Storing 9999 ArticlesSections (19998/10000/13103)
[INFO] [2024-11-01 03:11:35] Storing 9999 Articles (19998/10000/13103)
[INFO] [2024-11-01 03:11:54] Storing 3102 ArticlesSections (26202/13101/13103)
[INFO] [2024-11-01 03:11:54] Storing 3102 Articles (26202/13101/13103)
[STOP] [2024-11-01 03:11:55] parse_diff_and_store
[START] [2024-11-01 03:11:55] resolve_keys
[2024-11-01 03:13:08] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2024-11-01 03:22:46] Occurrences to nodes (through scientific_names)...
[INFO] [2024-11-01 03:22:46] traits to occurrences...
[INFO] [2024-11-01 03:22:46] traits to nodes (through occurrences)...
[INFO] [2024-11-01 03:22:46] Traits to sex term...
[INFO] [2024-11-01 03:22:46] Traits to lifestage term...
[INFO] [2024-11-01 03:22:46] MetaTraits to traits...
[INFO] [2024-11-01 03:22:46] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2024-11-01 03:22:46] Assocs to occurrences...
[INFO] [2024-11-01 03:22:46] Assocs to nodes...
[INFO] [2024-11-01 03:22:46] Assoc to sex term...
[INFO] [2024-11-01 03:22:46] Assoc to lifestage term...
[INFO] [2024-11-01 03:22:46] MetaAssoc to assocs...
[STOP] [2024-11-01 03:22:46] resolve_keys
[START] [2024-11-01 03:22:46] hold_for_later_1
[STOP] [2024-11-01 03:22:46] hold_for_later_1
[START] [2024-11-01 03:22:46] hold_for_later_2
[STOP] [2024-11-01 03:22:46] hold_for_later_2
[START] [2024-11-01 03:22:46] resolve_missing_parents
[STOP] [2024-11-01 03:22:47] resolve_missing_parents
[START] [2024-11-01 03:22:47] rebuild_nodes
[START] [2024-11-01 03:22:47] Flattener#flatten
[START] [2024-11-01 03:22:47] Flattener#study_resource
[START] [2024-11-01 03:22:47] Flattener#build_ancestry
[STOP] [2024-11-01 03:22:48] Flattener#build_ancestry
[INFO] [2024-11-01 03:22:48] 9811 ancestry keys
[START] [2024-11-01 03:22:48] build_node_ancestors
[INFO] [2024-11-01 03:22:48] old ancestors deleted.
[STOP] [2024-11-01 03:23:52] build_node_ancestors
[START] [2024-11-01 03:23:54] Flattener#propagate_ancestor_ids
[STOP] [2024-11-01 03:24:20] Flattener#propagate_ancestor_ids
[STOP] [2024-11-01 03:24:20] Flattener#flatten
[STOP] [2024-11-01 03:24:20] rebuild_nodes
[START] [2024-11-01 03:24:20] resolve_missing_media_owners
[STOP] [2024-11-01 03:24:20] resolve_missing_media_owners
[START] [2024-11-01 03:24:21] sanitize_media_verbatims
[STOP] [2024-11-01 03:24:21] sanitize_media_verbatims
[START] [2024-11-01 03:24:21] queue_downloads
[STOP] [2024-11-01 03:24:21] queue_downloads
[START] [2024-11-01 03:24:21] parse_names
[WARN] [2024-11-01 03:24:21] I see 9811 names which still need to be parsed.
[WARN] [2024-11-01 03:24:22] Names to parse: 9811 formatted: 9811 learned: 9789 parsed: 9811
[STOP] [2024-11-01 03:24:28] parse_names
[START] [2024-11-01 03:24:28] denormalize_canonical_names_to_nodes
[STOP] [2024-11-01 03:24:29] denormalize_canonical_names_to_nodes
[START] [2024-11-01 03:24:29] match_nodes
[START] [2024-11-01 03:24:29] map_all_nodes_to_pages
[STOP] [2024-11-01 03:27:46] map_all_nodes_to_pages
[INFO] [2024-11-01 03:27:46] 811 Unmatched nodes (of 9811)! That's too many to output. Full list in /app/public/data/wiki_br_breton/unmatched_nodes.txt ; First 10: Canonical: Galapagos tortoise; Node#163973335; ResourceID: Q20014035; Canonical: Biota; Node#163974132; ResourceID: Q2382443; Canonical: Prokaryota; Node#163973063; ResourceID: Q19081; Canonical: Bacteria; Node#163970457; ResourceID: Q10876; Canonical: Pseudomonadati; Node#163970906; ResourceID: Q124372016; Canonical: Pseudomonadota; Node#163971364; ResourceID: Q12962137; Canonical: Caryophanales; Node#163975684; ResourceID: Q33014980; Canonical: Amorphea; Node#163976349; ResourceID: Q474156; Canonical: Obazoa; Node#163973872; ResourceID: Q22087764; Canonical: Filozoa; Node#163970589; ResourceID: Q1131559
[START] [2024-11-01 03:27:46] update_nodes
[STOP] [2024-11-01 03:27:51] update_nodes
[STOP] [2024-11-01 03:27:51] match_nodes
[START] [2024-11-01 03:27:51] reindex_search
[STOP] [2024-11-01 03:28:38] reindex_search
[START] [2024-11-01 03:28:38] normalize_units
[STOP] [2024-11-01 03:28:38] normalize_units
[START] [2024-11-01 03:28:38] calculate_statistics
[INFO] [2024-11-01 03:30:36] Duplicate page_id count: 0
[STOP] [2024-11-01 03:30:36] calculate_statistics
[START] [2024-11-01 03:30:36] complete_harvest_instance
[START] [2024-11-01 03:30:36] overall_tsv_creation
[INFO] [2024-11-01 03:30:36] Exporting 9811 nodes as TSV in batches of 10000...
[INFO] [2024-11-01 03:30:36] Processing group of 9811 in 1 batches of 10000
[INFO] [2024-11-01 03:33:49] Processed 9811/9811 nodes
[INFO] [2024-11-01 03:33:49] Average Time: 40.95
[INFO] [2024-11-01 03:33:49] Total Time: 3m14s
[STOP] [2024-11-01 03:33:49] overall_tsv_creation
[INFO] [2024-11-01 03:33:49] Done. Check your files:
[INFO] [2024-11-01 03:33:49] (9811 lines) /app/public/data/wiki_br_breton/publish_nodes.tsv
[INFO] [2024-11-01 03:33:49] (9811 lines) /app/public/data/wiki_br_breton/publish_identifiers.tsv
[INFO] [2024-11-01 03:33:49] (623128 lines) /app/public/data/wiki_br_breton/publish_node_ancestors.tsv
[INFO] [2024-11-01 03:33:50] (9811 lines) /app/public/data/wiki_br_breton/publish_scientific_names.tsv
[INFO] [2024-11-01 03:33:50] (103698 lines) /app/public/data/wiki_br_breton/publish_articles.tsv
[INFO] [2024-11-01 03:33:50] (13101 lines) /app/public/data/wiki_br_breton/publish_content_sections.tsv
[STOP] [2024-11-01 03:33:50] complete_harvest_instance
[START] [2024-11-01 03:33:50] completed
[STOP] [2024-11-01 03:33:50] completed
[STOP] [2024-11-01 03:33:50] logged process, took 1378.92
Latest Process