Harvest for wikipedia 한국어 위키백과 Created 15 Jun 07:28

Stage: completed
Fetched: 15 Jun 07:28
Validated: 15 Jun 07:28
Deltas Created 15 Jun 07:28
Units Normalized: 15 Jun 07:49
Ancestry Built: 15 Jun 07:41
Nodes Matched: 15 Jun 07:48
Names Parsed: 15 Jun 07:41
New Models Stored: 15 Jun 07:33
Indexed: 15 Jun 07:49
Completed: 15 Jun 07:56
Time to Harvest: less than a minute

Harvesting Log

(132 lines)
[INFO] [2022-06-15 07:28:29] Created harvest instance #4143
[STOP] [2022-06-15 07:28:29] create_harvest_instance
[START] [2022-06-15 07:28:29] fetch_files
[STOP] [2022-06-15 07:28:29] fetch_files
[START] [2022-06-15 07:28:29] validate_each_file
[INFO] [2022-06-15 07:28:29] Looping over 2 formats...
[INFO] [2022-06-15 07:28:29] ...nodes (/app/public/data/wiki_ko_tar_gz/taxon.tab)
[INFO] [2022-06-15 07:28:29] Valid: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29393.csv (20908 lines)
[INFO] [2022-06-15 07:28:29] ...media (/app/public/data/wiki_ko_tar_gz/media_resource.tab)
[INFO] [2022-06-15 07:28:34] Valid: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29392.csv (19237 lines)
[STOP] [2022-06-15 07:28:34] validate_each_file
[START] [2022-06-15 07:28:34] convert_to_csv
[INFO] [2022-06-15 07:28:34] Looping over 2 formats...
[INFO] [2022-06-15 07:28:34] ...nodes (/app/public/data/wiki_ko_tar_gz/taxon.tab)
[CMD] [2022-06-15 07:28:34] /usr/bin/sort /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29393.csv > /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29393.csv_sorted
[INFO] [2022-06-15 07:28:34] Converted: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29393.csv (20908 lines)
[INFO] [2022-06-15 07:28:34] ...media (/app/public/data/wiki_ko_tar_gz/media_resource.tab)
[CMD] [2022-06-15 07:28:34] /usr/bin/sort /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29392.csv > /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29392.csv_sorted
[INFO] [2022-06-15 07:28:37] Converted: /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29392.csv (19237 lines)
[STOP] [2022-06-15 07:28:37] convert_to_csv
[START] [2022-06-15 07:28:37] calculate_delta
[INFO] [2022-06-15 07:28:37] Looping over 2 formats...
[INFO] [2022-06-15 07:28:37] ...nodes (/app/public/data/wiki_ko_tar_gz/taxon.tab)
[CMD] [2022-06-15 07:28:37] echo "0a" > /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29393.diff
[CMD] [2022-06-15 07:28:37] tail -n +1 /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_nodes_29393.csv >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29393.diff
[CMD] [2022-06-15 07:28:37] echo "." >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29393.diff
[INFO] [2022-06-15 07:28:37] Created diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29393.diff (20910 lines)
[INFO] [2022-06-15 07:28:37] ...media (/app/public/data/wiki_ko_tar_gz/media_resource.tab)
[CMD] [2022-06-15 07:28:37] echo "0a" > /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29392.diff
[CMD] [2022-06-15 07:28:37] tail -n +1 /app/public/data/wiki_ko_tar_gz/converted_csv/wiki_ko_tar_gz_media_29392.csv >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29392.diff
[CMD] [2022-06-15 07:28:38] echo "." >> /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29392.diff
[INFO] [2022-06-15 07:28:39] Created diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29392.diff (19239 lines)
[STOP] [2022-06-15 07:28:39] calculate_delta
[START] [2022-06-15 07:28:39] parse_diff_and_store
[INFO] [2022-06-15 07:28:39] Handling diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_nodes_29393.diff (20910 lines)
[INFO] [2022-06-15 07:28:39] Loading nodes diff file into memory (20910 lines)...
[INFO] [2022-06-15 07:28:43] Storing 9999 ScientificNames (29997/10000/20910)
[INFO] [2022-06-15 07:28:46] Storing 9999 Identifiers (29997/10000/20910)
[INFO] [2022-06-15 07:28:48] Storing 9999 Nodes (29997/10000/20910)
[INFO] [2022-06-15 07:28:54] Storing 10000 ScientificNames (59997/20000/20910)
[INFO] [2022-06-15 07:28:57] Storing 10000 Identifiers (59997/20000/20910)
[INFO] [2022-06-15 07:29:04] Storing 10000 Nodes (59997/20000/20910)
[INFO] [2022-06-15 07:29:07] Storing 909 ScientificNames (62724/20908/20910)
[INFO] [2022-06-15 07:29:08] Storing 909 Identifiers (62724/20908/20910)
[INFO] [2022-06-15 07:29:08] Storing 909 Nodes (62724/20908/20910)
[INFO] [2022-06-15 07:29:08] Handling diff: /app/public/data/wiki_ko_tar_gz/diff/wiki_ko_tar_gz_media_29392.diff (19239 lines)
[INFO] [2022-06-15 07:29:08] Loading media diff file into memory (19239 lines)...
[INFO] [2022-06-15 07:31:05] Storing 9999 ArticlesSections (19998/10000/19239)
[INFO] [2022-06-15 07:31:05] Storing 9999 Articles (19998/10000/19239)
[INFO] [2022-06-15 07:33:00] Storing 9238 ArticlesSections (38474/19237/19239)
[INFO] [2022-06-15 07:33:00] Storing 9238 Articles (38474/19237/19239)
[STOP] [2022-06-15 07:33:06] parse_diff_and_store
[START] [2022-06-15 07:33:06] resolve_keys
[2022-06-15 07:33:13] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2022-06-15 07:39:54] Occurrences to nodes (through scientific_names)...
[INFO] [2022-06-15 07:39:54] traits to occurrences...
[INFO] [2022-06-15 07:39:54] traits to nodes (through occurrences)...
[INFO] [2022-06-15 07:39:54] Traits to sex term...
[INFO] [2022-06-15 07:39:54] Traits to lifestage term...
[INFO] [2022-06-15 07:39:54] MetaTraits to traits...
[INFO] [2022-06-15 07:39:54] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2022-06-15 07:39:54] Assocs to occurrences...
[INFO] [2022-06-15 07:39:54] Assocs to nodes...
[INFO] [2022-06-15 07:39:54] Assoc to sex term...
[INFO] [2022-06-15 07:39:54] Assoc to lifestage term...
[INFO] [2022-06-15 07:39:54] MetaAssoc to assocs...
[STOP] [2022-06-15 07:39:54] resolve_keys
[START] [2022-06-15 07:39:54] hold_for_later_1
[STOP] [2022-06-15 07:39:54] hold_for_later_1
[START] [2022-06-15 07:39:54] hold_for_later_2
[STOP] [2022-06-15 07:39:54] hold_for_later_2
[START] [2022-06-15 07:39:54] resolve_missing_parents
[STOP] [2022-06-15 07:39:56] resolve_missing_parents
[START] [2022-06-15 07:39:56] rebuild_nodes
[START] [2022-06-15 07:39:56] Flattener#flatten
[START] [2022-06-15 07:39:56] Flattener#study_resource
[START] [2022-06-15 07:39:56] Flattener#build_ancestry
[STOP] [2022-06-15 07:39:58] Flattener#build_ancestry
[INFO] [2022-06-15 07:39:58] 20908 ancestry keys
[START] [2022-06-15 07:39:58] build_node_ancestors
[INFO] [2022-06-15 07:39:58] old ancestors deleted.
[STOP] [2022-06-15 07:40:43] build_node_ancestors
[START] [2022-06-15 07:40:49] Flattener#propagate_ancestor_ids
[STOP] [2022-06-15 07:41:03] Flattener#propagate_ancestor_ids
[STOP] [2022-06-15 07:41:03] Flattener#flatten
[STOP] [2022-06-15 07:41:03] rebuild_nodes
[START] [2022-06-15 07:41:03] resolve_missing_media_owners
[STOP] [2022-06-15 07:41:03] resolve_missing_media_owners
[START] [2022-06-15 07:41:03] sanitize_media_verbatims
[STOP] [2022-06-15 07:41:03] sanitize_media_verbatims
[START] [2022-06-15 07:41:03] queue_downloads
[STOP] [2022-06-15 07:41:03] queue_downloads
[START] [2022-06-15 07:41:03] parse_names
[WARN] [2022-06-15 07:41:03] I see 20908 names which still need to be parsed.
[WARN] [2022-06-15 07:41:04] Names to parse: 10000 formatted: 10000 learned: 9990 parsed: 10000
[WARN] [2022-06-15 07:41:10] Names to parse: 10000 formatted: 10000 learned: 9966 parsed: 10000
[WARN] [2022-06-15 07:41:16] Names to parse: 908 formatted: 908 learned: 908 parsed: 908
[STOP] [2022-06-15 07:41:18] parse_names
[START] [2022-06-15 07:41:18] denormalize_canonical_names_to_nodes
[STOP] [2022-06-15 07:41:18] denormalize_canonical_names_to_nodes
[START] [2022-06-15 07:41:18] match_nodes
[START] [2022-06-15 07:41:18] map_all_nodes_to_pages
[STOP] [2022-06-15 07:48:12] map_all_nodes_to_pages
[INFO] [2022-06-15 07:48:12] 2527 Unmatched nodes (of 20908)! That's too many to output. Full list in /app/public/data/wiki_ko_tar_gz/unmatched_nodes.txt ; First 10: Canonical: Parakaryon myojinensis; Node#116970481; ResourceID: Q22329203; Canonical: Biota; Node#116970836; ResourceID: Q2382443; Canonical: Acytota; Node#116965391; ResourceID: Q169731; Canonical: Prokaryota; Node#116968435; ResourceID: Q19081; Canonical: Proteoarchaeota; Node#116970013; ResourceID: Q21282292; Canonical: DPANN; Node#116971141; ResourceID: Q24862848; Canonical: Desulfurococcus fermentans; Node#116961263; ResourceID: Q12592886; Canonical: Desulfurococcus amylolyticus; Node#116964985; ResourceID: Q16179752; Canonical: Desulfurococcus mobilis; Node#116976507; ResourceID: Q5804547; Canonical: Thermoproteus neutrophilus; Node#116962095; ResourceID: Q13361172
[START] [2022-06-15 07:48:12] update_nodes
[STOP] [2022-06-15 07:48:21] update_nodes
[STOP] [2022-06-15 07:48:21] match_nodes
[START] [2022-06-15 07:48:21] reindex_search
[STOP] [2022-06-15 07:49:06] reindex_search
[START] [2022-06-15 07:49:06] normalize_units
[STOP] [2022-06-15 07:49:06] normalize_units
[START] [2022-06-15 07:49:06] calculate_statistics
[INFO] [2022-06-15 07:49:57] Duplicate page_id count: 0
[STOP] [2022-06-15 07:49:57] calculate_statistics
[START] [2022-06-15 07:49:57] complete_harvest_instance
[START] [2022-06-15 07:49:57] overall_tsv_creation
[INFO] [2022-06-15 07:49:57] Processing group of 20908 in 3 batches of 10000
[WARN] [2022-06-15 07:51:58] Encountered new license, please find a logo URL and give it a name: ko
[INFO] [2022-06-15 07:56:47] Average Time: 45.713
[INFO] [2022-06-15 07:56:47] Total Time: 6m51s
[STOP] [2022-06-15 07:56:47] overall_tsv_creation
[INFO] [2022-06-15 07:56:47] Done. Check your files:
[INFO] [2022-06-15 07:56:47] (20908 lines) /app/public/data/wiki_ko_tar_gz/publish_nodes.tsv
[INFO] [2022-06-15 07:56:47] (20908 lines) /app/public/data/wiki_ko_tar_gz/publish_identifiers.tsv
[INFO] [2022-06-15 07:56:48] (490322 lines) /app/public/data/wiki_ko_tar_gz/publish_node_ancestors.tsv
[INFO] [2022-06-15 07:56:48] (20908 lines) /app/public/data/wiki_ko_tar_gz/publish_scientific_names.tsv
[INFO] [2022-06-15 07:56:48] (349657 lines) /app/public/data/wiki_ko_tar_gz/publish_articles.tsv
[INFO] [2022-06-15 07:56:48] (19237 lines) /app/public/data/wiki_ko_tar_gz/publish_content_sections.tsv
[STOP] [2022-06-15 07:56:48] complete_harvest_instance
[START] [2022-06-15 07:56:48] completed
[STOP] [2022-06-15 07:56:48] completed
[STOP] [2022-06-15 07:56:48] logged process, took 1699.58

Latest Process