Harvest for wikipedia 한국어 위키백과 Created 21 Jul 22:10

Stage: completed
Fetched: 21 Jul 22:10
Validated: 21 Jul 22:11
Deltas Created 21 Jul 22:11
Units Normalized: 21 Jul 22:29
Ancestry Built: 21 Jul 22:22
Nodes Matched: 21 Jul 22:29
Names Parsed: 21 Jul 22:22
New Models Stored: 21 Jul 22:15
Indexed: 21 Jul 22:29
Completed: 21 Jul 22:36
Time to Harvest: less than a minute

Harvesting Log

(132 lines)
[INFO] [2022-07-21 22:10:54] Created harvest instance #4148
[STOP] [2022-07-21 22:10:54] create_harvest_instance
[START] [2022-07-21 22:10:54] fetch_files
[STOP] [2022-07-21 22:10:54] fetch_files
[START] [2022-07-21 22:10:54] validate_each_file
[INFO] [2022-07-21 22:10:54] Looping over 2 formats...
[INFO] [2022-07-21 22:10:54] ...nodes (/app/public/data/wiki_ko_tar_gz/taxon.tab)
[INFO] [2022-07-21 22:10:55] Valid: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29410.csv (20798 lines)
[INFO] [2022-07-21 22:10:55] ...media (/app/public/data/wiki_ko_tar_gz/media_resource.tab)
[INFO] [2022-07-21 22:11:05] Valid: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29409.csv (19104 lines)
[STOP] [2022-07-21 22:11:05] validate_each_file
[START] [2022-07-21 22:11:05] convert_to_csv
[INFO] [2022-07-21 22:11:05] Looping over 2 formats...
[INFO] [2022-07-21 22:11:05] ...nodes (/app/public/data/wiki_ko_tar_gz/taxon.tab)
[CMD] [2022-07-21 22:11:05] /usr/bin/sort /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29410.csv > /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29410.csv_sorted
[INFO] [2022-07-21 22:11:05] Converted: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29410.csv (20798 lines)
[INFO] [2022-07-21 22:11:05] ...media (/app/public/data/wiki_ko_tar_gz/media_resource.tab)
[CMD] [2022-07-21 22:11:05] /usr/bin/sort /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29409.csv > /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29409.csv_sorted
[INFO] [2022-07-21 22:11:07] Converted: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29409.csv (19104 lines)
[STOP] [2022-07-21 22:11:07] convert_to_csv
[START] [2022-07-21 22:11:07] calculate_delta
[INFO] [2022-07-21 22:11:07] Looping over 2 formats...
[INFO] [2022-07-21 22:11:07] ...nodes (/app/public/data/wiki_ko_tar_gz/taxon.tab)
[CMD] [2022-07-21 22:11:07] echo "0a" > /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29410.diff
[CMD] [2022-07-21 22:11:08] tail -n +1 /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29410.csv >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29410.diff
[CMD] [2022-07-21 22:11:08] echo "." >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29410.diff
[INFO] [2022-07-21 22:11:08] Created diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29410.diff (20800 lines)
[INFO] [2022-07-21 22:11:08] ...media (/app/public/data/wiki_ko_tar_gz/media_resource.tab)
[CMD] [2022-07-21 22:11:08] echo "0a" > /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29409.diff
[CMD] [2022-07-21 22:11:08] tail -n +1 /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29409.csv >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29409.diff
[CMD] [2022-07-21 22:11:09] echo "." >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29409.diff
[INFO] [2022-07-21 22:11:10] Created diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29409.diff (19106 lines)
[STOP] [2022-07-21 22:11:10] calculate_delta
[START] [2022-07-21 22:11:10] parse_diff_and_store
[INFO] [2022-07-21 22:11:10] Handling diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29410.diff (20800 lines)
[INFO] [2022-07-21 22:11:10] Loading nodes diff file into memory (20800 lines)...
[WARN] [2022-07-21 22:11:11] Filtered Scientific Name `Cuon alpinus fumosus/javanicus` to `Cuon alpinus fumosusjavanicus`
[INFO] [2022-07-21 22:11:13] Storing 9999 ScientificNames (29997/10000/20800)
[INFO] [2022-07-21 22:11:17] Storing 9999 Identifiers (29997/10000/20800)
[INFO] [2022-07-21 22:11:18] Storing 9999 Nodes (29997/10000/20800)
[INFO] [2022-07-21 22:11:25] Storing 10000 ScientificNames (59997/20000/20800)
[INFO] [2022-07-21 22:11:27] Storing 10000 Identifiers (59997/20000/20800)
[INFO] [2022-07-21 22:11:28] Storing 10000 Nodes (59997/20000/20800)
[INFO] [2022-07-21 22:11:32] Storing 799 ScientificNames (62394/20798/20800)
[INFO] [2022-07-21 22:11:32] Storing 799 Identifiers (62394/20798/20800)
[INFO] [2022-07-21 22:11:32] Storing 799 Nodes (62394/20798/20800)
[INFO] [2022-07-21 22:11:32] Handling diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29409.diff (19106 lines)
[INFO] [2022-07-21 22:11:33] Loading media diff file into memory (19106 lines)...
[INFO] [2022-07-21 22:13:30] Storing 9999 ArticlesSections (19998/10000/19106)
[INFO] [2022-07-21 22:13:30] Storing 9999 Articles (19998/10000/19106)
[INFO] [2022-07-21 22:15:26] Storing 9105 ArticlesSections (38208/19104/19106)
[INFO] [2022-07-21 22:15:27] Storing 9105 Articles (38208/19104/19106)
[STOP] [2022-07-21 22:15:33] parse_diff_and_store
[START] [2022-07-21 22:15:33] resolve_keys
[2022-07-21 22:15:41] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2022-07-21 22:21:38] Occurrences to nodes (through scientific_names)...
[INFO] [2022-07-21 22:21:38] traits to occurrences...
[INFO] [2022-07-21 22:21:38] traits to nodes (through occurrences)...
[INFO] [2022-07-21 22:21:38] Traits to sex term...
[INFO] [2022-07-21 22:21:38] Traits to lifestage term...
[INFO] [2022-07-21 22:21:38] MetaTraits to traits...
[INFO] [2022-07-21 22:21:38] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2022-07-21 22:21:38] Assocs to occurrences...
[INFO] [2022-07-21 22:21:38] Assocs to nodes...
[INFO] [2022-07-21 22:21:38] Assoc to sex term...
[INFO] [2022-07-21 22:21:38] Assoc to lifestage term...
[INFO] [2022-07-21 22:21:38] MetaAssoc to assocs...
[STOP] [2022-07-21 22:21:38] resolve_keys
[START] [2022-07-21 22:21:38] hold_for_later_1
[STOP] [2022-07-21 22:21:38] hold_for_later_1
[START] [2022-07-21 22:21:38] hold_for_later_2
[STOP] [2022-07-21 22:21:38] hold_for_later_2
[START] [2022-07-21 22:21:38] resolve_missing_parents
[STOP] [2022-07-21 22:21:39] resolve_missing_parents
[START] [2022-07-21 22:21:39] rebuild_nodes
[START] [2022-07-21 22:21:39] Flattener#flatten
[START] [2022-07-21 22:21:39] Flattener#study_resource
[START] [2022-07-21 22:21:39] Flattener#build_ancestry
[STOP] [2022-07-21 22:21:41] Flattener#build_ancestry
[INFO] [2022-07-21 22:21:41] 20798 ancestry keys
[START] [2022-07-21 22:21:41] build_node_ancestors
[INFO] [2022-07-21 22:21:41] old ancestors deleted.
[STOP] [2022-07-21 22:22:24] build_node_ancestors
[START] [2022-07-21 22:22:30] Flattener#propagate_ancestor_ids
[STOP] [2022-07-21 22:22:43] Flattener#propagate_ancestor_ids
[STOP] [2022-07-21 22:22:43] Flattener#flatten
[STOP] [2022-07-21 22:22:43] rebuild_nodes
[START] [2022-07-21 22:22:43] resolve_missing_media_owners
[STOP] [2022-07-21 22:22:43] resolve_missing_media_owners
[START] [2022-07-21 22:22:43] sanitize_media_verbatims
[STOP] [2022-07-21 22:22:43] sanitize_media_verbatims
[START] [2022-07-21 22:22:43] queue_downloads
[STOP] [2022-07-21 22:22:43] queue_downloads
[START] [2022-07-21 22:22:43] parse_names
[WARN] [2022-07-21 22:22:43] I see 20798 names which still need to be parsed.
[WARN] [2022-07-21 22:22:44] Names to parse: 10000 formatted: 10000 learned: 9990 parsed: 10000
[WARN] [2022-07-21 22:22:51] Names to parse: 10000 formatted: 10000 learned: 9969 parsed: 10000
[WARN] [2022-07-21 22:22:57] Names to parse: 798 formatted: 798 learned: 798 parsed: 798
[STOP] [2022-07-21 22:22:59] parse_names
[START] [2022-07-21 22:22:59] denormalize_canonical_names_to_nodes
[STOP] [2022-07-21 22:22:59] denormalize_canonical_names_to_nodes
[START] [2022-07-21 22:22:59] match_nodes
[START] [2022-07-21 22:22:59] map_all_nodes_to_pages
[STOP] [2022-07-21 22:28:55] map_all_nodes_to_pages
[INFO] [2022-07-21 22:28:56] 2565 Unmatched nodes (of 20798)! That's too many to output. Full list in /app/public/data/wiki_ko_tar_gz/unmatched_nodes.txt ; First 10: Canonical: Parakaryon myojinensis; Node#118154863; ResourceID: Q22329203; Canonical: Biota; Node#118155220; ResourceID: Q2382443; Canonical: Acytota; Node#118149809; ResourceID: Q169731; Canonical: Prokaryota; Node#118152826; ResourceID: Q19081; Canonical: Proteoarchaeota; Node#118154397; ResourceID: Q21282292; Canonical: DPANN; Node#118155525; ResourceID: Q24862848; Canonical: Desulfurococcus fermentans; Node#118145720; ResourceID: Q12592886; Canonical: Desulfurococcus amylolyticus; Node#118149406; ResourceID: Q16179752; Canonical: Desulfurococcus mobilis; Node#118160879; ResourceID: Q5804547; Canonical: Thermoproteus neutrophilus; Node#118146547; ResourceID: Q13361172
[START] [2022-07-21 22:28:56] update_nodes
[STOP] [2022-07-21 22:29:05] update_nodes
[STOP] [2022-07-21 22:29:05] match_nodes
[START] [2022-07-21 22:29:05] reindex_search
[STOP] [2022-07-21 22:29:49] reindex_search
[START] [2022-07-21 22:29:49] normalize_units
[STOP] [2022-07-21 22:29:49] normalize_units
[START] [2022-07-21 22:29:49] calculate_statistics
[INFO] [2022-07-21 22:29:53] Duplicate page_id count: 0
[STOP] [2022-07-21 22:29:53] calculate_statistics
[START] [2022-07-21 22:29:53] complete_harvest_instance
[START] [2022-07-21 22:29:53] overall_tsv_creation
[INFO] [2022-07-21 22:29:53] Processing group of 20798 in 3 batches of 10000
[INFO] [2022-07-21 22:36:02] Average Time: 42.233
[INFO] [2022-07-21 22:36:02] Total Time: 6m9s
[STOP] [2022-07-21 22:36:02] overall_tsv_creation
[INFO] [2022-07-21 22:36:02] Done. Check your files:
[INFO] [2022-07-21 22:36:02] (20798 lines) /app/public/data/wiki_ko_tar_gz/publish_nodes.tsv
[INFO] [2022-07-21 22:36:02] (20798 lines) /app/public/data/wiki_ko_tar_gz/publish_identifiers.tsv
[INFO] [2022-07-21 22:36:02] (487288 lines) /app/public/data/wiki_ko_tar_gz/publish_node_ancestors.tsv
[INFO] [2022-07-21 22:36:03] (20798 lines) /app/public/data/wiki_ko_tar_gz/publish_scientific_names.tsv
[INFO] [2022-07-21 22:36:03] (354077 lines) /app/public/data/wiki_ko_tar_gz/publish_articles.tsv
[INFO] [2022-07-21 22:36:03] (19104 lines) /app/public/data/wiki_ko_tar_gz/publish_content_sections.tsv
[STOP] [2022-07-21 22:36:03] complete_harvest_instance
[START] [2022-07-21 22:36:03] completed
[STOP] [2022-07-21 22:36:03] completed
[STOP] [2022-07-21 22:36:03] logged process, took 1509.32

Latest Process