Harvest for USDA PLANTS text Created 26 May 10:33

Stage: completed
Fetched: 26 May 10:33
Validated: 26 May 10:33
Deltas Created 26 May 10:33
Units Normalized: 26 May 10:35
Ancestry Built: 26 May 10:33
Nodes Matched: 26 May 10:35
Names Parsed: 26 May 10:34
New Models Stored: 26 May 10:33
Indexed: 26 May 10:35
Completed: 26 May 10:37
Time to Harvest: less than a minute

Harvesting Log

(145 lines)
[INFO] [2023-05-26 10:33:17] Created harvest instance #4353
[STOP] [2023-05-26 10:33:17] create_harvest_instance
[START] [2023-05-26 10:33:17] fetch_files
[STOP] [2023-05-26 10:33:17] fetch_files
[START] [2023-05-26 10:33:17] validate_each_file
[INFO] [2023-05-26 10:33:17] Created new folder: /app/public/converted_csv
[INFO] [2023-05-26 10:33:17] Looping over 3 formats...
[INFO] [2023-05-26 10:33:17] ...agents (/app/public/data/usda_plants_text/agent.tab)
[INFO] [2023-05-26 10:33:17] Valid: /app/public/data/usda_plants_text/converted_csv/usda_plants_text_agents_30368.csv (7 lines)
[INFO] [2023-05-26 10:33:17] ...nodes (/app/public/data/usda_plants_text/taxon.tab)
[INFO] [2023-05-26 10:33:17] Valid: /app/public/data/usda_plants_text/converted_csv/usda_plants_text_nodes_30370.csv (3503 lines)
[INFO] [2023-05-26 10:33:17] ...media (/app/public/data/usda_plants_text/media_resource.tab)
[INFO] [2023-05-26 10:33:18] Valid: /app/public/data/usda_plants_text/converted_csv/usda_plants_text_media_30369.csv (3503 lines)
[STOP] [2023-05-26 10:33:18] validate_each_file
[START] [2023-05-26 10:33:18] convert_to_csv
[INFO] [2023-05-26 10:33:18] Looping over 3 formats...
[INFO] [2023-05-26 10:33:18] ...agents (/app/public/data/usda_plants_text/agent.tab)
[CMD] [2023-05-26 10:33:18] /usr/bin/sort /app/public/data/usda_plants_text/converted_csv/usda_plants_text_agents_30368.csv > /app/public/data/usda_plants_text/converted_csv/usda_plants_text_agents_30368.csv_sorted
[INFO] [2023-05-26 10:33:18] Converted: /app/public/data/usda_plants_text/converted_csv/usda_plants_text_agents_30368.csv (7 lines)
[INFO] [2023-05-26 10:33:18] ...nodes (/app/public/data/usda_plants_text/taxon.tab)
[CMD] [2023-05-26 10:33:18] /usr/bin/sort /app/public/data/usda_plants_text/converted_csv/usda_plants_text_nodes_30370.csv > /app/public/data/usda_plants_text/converted_csv/usda_plants_text_nodes_30370.csv_sorted
[INFO] [2023-05-26 10:33:18] Converted: /app/public/data/usda_plants_text/converted_csv/usda_plants_text_nodes_30370.csv (3503 lines)
[INFO] [2023-05-26 10:33:18] ...media (/app/public/data/usda_plants_text/media_resource.tab)
[CMD] [2023-05-26 10:33:18] /usr/bin/sort /app/public/data/usda_plants_text/converted_csv/usda_plants_text_media_30369.csv > /app/public/data/usda_plants_text/converted_csv/usda_plants_text_media_30369.csv_sorted
[INFO] [2023-05-26 10:33:18] Converted: /app/public/data/usda_plants_text/converted_csv/usda_plants_text_media_30369.csv (3503 lines)
[STOP] [2023-05-26 10:33:18] convert_to_csv
[START] [2023-05-26 10:33:18] calculate_delta
[INFO] [2023-05-26 10:33:18] Created diff dir: /app/public/diff
[INFO] [2023-05-26 10:33:18] Looping over 3 formats...
[INFO] [2023-05-26 10:33:18] ...agents (/app/public/data/usda_plants_text/agent.tab)
[CMD] [2023-05-26 10:33:18] echo "0a" > /app/public/data/usda_plants_text/diff/usda_plants_text_agents_30368.diff
[CMD] [2023-05-26 10:33:18] tail -n +1 /app/public/data/usda_plants_text/converted_csv/usda_plants_text_agents_30368.csv >> /app/public/data/usda_plants_text/diff/usda_plants_text_agents_30368.diff
[CMD] [2023-05-26 10:33:18] echo "." >> /app/public/data/usda_plants_text/diff/usda_plants_text_agents_30368.diff
[INFO] [2023-05-26 10:33:18] Created diff: /app/public/data/usda_plants_text/diff/usda_plants_text_agents_30368.diff (9 lines)
[INFO] [2023-05-26 10:33:18] ...nodes (/app/public/data/usda_plants_text/taxon.tab)
[CMD] [2023-05-26 10:33:18] echo "0a" > /app/public/data/usda_plants_text/diff/usda_plants_text_nodes_30370.diff
[CMD] [2023-05-26 10:33:18] tail -n +1 /app/public/data/usda_plants_text/converted_csv/usda_plants_text_nodes_30370.csv >> /app/public/data/usda_plants_text/diff/usda_plants_text_nodes_30370.diff
[CMD] [2023-05-26 10:33:18] echo "." >> /app/public/data/usda_plants_text/diff/usda_plants_text_nodes_30370.diff
[INFO] [2023-05-26 10:33:18] Created diff: /app/public/data/usda_plants_text/diff/usda_plants_text_nodes_30370.diff (3505 lines)
[INFO] [2023-05-26 10:33:18] ...media (/app/public/data/usda_plants_text/media_resource.tab)
[CMD] [2023-05-26 10:33:18] echo "0a" > /app/public/data/usda_plants_text/diff/usda_plants_text_media_30369.diff
[CMD] [2023-05-26 10:33:18] tail -n +1 /app/public/data/usda_plants_text/converted_csv/usda_plants_text_media_30369.csv >> /app/public/data/usda_plants_text/diff/usda_plants_text_media_30369.diff
[CMD] [2023-05-26 10:33:18] echo "." >> /app/public/data/usda_plants_text/diff/usda_plants_text_media_30369.diff
[INFO] [2023-05-26 10:33:18] Created diff: /app/public/data/usda_plants_text/diff/usda_plants_text_media_30369.diff (3505 lines)
[STOP] [2023-05-26 10:33:18] calculate_delta
[START] [2023-05-26 10:33:18] parse_diff_and_store
[INFO] [2023-05-26 10:33:18] Handling diff: /app/public/data/usda_plants_text/diff/usda_plants_text_agents_30368.diff (9 lines)
[INFO] [2023-05-26 10:33:18] Loading agents diff file into memory (9 lines)...
[INFO] [2023-05-26 10:33:18] Storing 7 Attributions (7/7/9)
[INFO] [2023-05-26 10:33:18] Handling diff: /app/public/data/usda_plants_text/diff/usda_plants_text_nodes_30370.diff (3505 lines)
[INFO] [2023-05-26 10:33:18] Loading nodes diff file into memory (3505 lines)...
[WARN] [2023-05-26 10:33:19] Filtered Scientific Name `Canavalia  vitiensis` to `Canavalia vitiensis`
[WARN] [2023-05-26 10:33:19] Filtered Scientific Name `Digitaria  patens` to `Digitaria patens`
[WARN] [2023-05-26 10:33:19] Filtered Scientific Name `Galactia  striata` to `Galactia striata`
[WARN] [2023-05-26 10:33:19] Filtered Scientific Name `Lupinus  elatus` to `Lupinus elatus`
[WARN] [2023-05-26 10:33:19] Filtered Scientific Name `Lupinus  obtusilobus` to `Lupinus obtusilobus`
[WARN] [2023-05-26 10:33:19] Filtered Scientific Name `Panicum  subquadriparum` to `Panicum subquadriparum`
[WARN] [2023-05-26 10:33:19] Filtered Scientific Name `Poa  kelloggii` to `Poa kelloggii`
[INFO] [2023-05-26 10:33:20] Storing 4041 ScientificNames (8082/3503/3505)
[INFO] [2023-05-26 10:33:21] Storing 4041 Nodes (8082/3503/3505)
[INFO] [2023-05-26 10:33:22] Handling diff: /app/public/data/usda_plants_text/diff/usda_plants_text_media_30369.diff (3505 lines)
[INFO] [2023-05-26 10:33:22] Loading media diff file into memory (3505 lines)...
[INFO] [2023-05-26 10:33:26] Storing 10649 ContentAttributions (17655/3503/3505)
[INFO] [2023-05-26 10:33:27] Storing 3503 ArticlesSections (17655/3503/3505)
[INFO] [2023-05-26 10:33:27] Storing 3503 Articles (17655/3503/3505)
[STOP] [2023-05-26 10:33:28] parse_diff_and_store
[START] [2023-05-26 10:33:28] resolve_keys
[2023-05-26 10:33:30] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2023-05-26 10:33:54] Occurrences to nodes (through scientific_names)...
[INFO] [2023-05-26 10:33:54] traits to occurrences...
[INFO] [2023-05-26 10:33:54] traits to nodes (through occurrences)...
[INFO] [2023-05-26 10:33:54] Traits to sex term...
[INFO] [2023-05-26 10:33:54] Traits to lifestage term...
[INFO] [2023-05-26 10:33:54] MetaTraits to traits...
[INFO] [2023-05-26 10:33:54] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2023-05-26 10:33:54] Assocs to occurrences...
[INFO] [2023-05-26 10:33:54] Assocs to nodes...
[INFO] [2023-05-26 10:33:54] Assoc to sex term...
[INFO] [2023-05-26 10:33:54] Assoc to lifestage term...
[INFO] [2023-05-26 10:33:54] MetaAssoc to assocs...
[STOP] [2023-05-26 10:33:55] resolve_keys
[START] [2023-05-26 10:33:55] hold_for_later_1
[STOP] [2023-05-26 10:33:55] hold_for_later_1
[START] [2023-05-26 10:33:55] hold_for_later_2
[STOP] [2023-05-26 10:33:55] hold_for_later_2
[START] [2023-05-26 10:33:55] resolve_missing_parents
[STOP] [2023-05-26 10:33:55] resolve_missing_parents
[START] [2023-05-26 10:33:55] rebuild_nodes
[START] [2023-05-26 10:33:55] Flattener#flatten
[START] [2023-05-26 10:33:55] Flattener#study_resource
[START] [2023-05-26 10:33:55] Flattener#build_ancestry
[STOP] [2023-05-26 10:33:55] Flattener#build_ancestry
[INFO] [2023-05-26 10:33:55] 4041 ancestry keys
[START] [2023-05-26 10:33:55] build_node_ancestors
[INFO] [2023-05-26 10:33:55] old ancestors deleted.
[STOP] [2023-05-26 10:33:56] build_node_ancestors
[START] [2023-05-26 10:33:57] Flattener#propagate_ancestor_ids
[STOP] [2023-05-26 10:33:57] Flattener#propagate_ancestor_ids
[STOP] [2023-05-26 10:33:57] Flattener#flatten
[STOP] [2023-05-26 10:33:57] rebuild_nodes
[START] [2023-05-26 10:33:57] resolve_missing_media_owners
[STOP] [2023-05-26 10:33:57] resolve_missing_media_owners
[START] [2023-05-26 10:33:57] sanitize_media_verbatims
[STOP] [2023-05-26 10:33:57] sanitize_media_verbatims
[START] [2023-05-26 10:33:57] queue_downloads
[STOP] [2023-05-26 10:33:57] queue_downloads
[START] [2023-05-26 10:33:57] parse_names
[WARN] [2023-05-26 10:33:57] I see 4041 names which still need to be parsed.
[WARN] [2023-05-26 10:33:58] Names to parse: 4041 formatted: 4041 learned: 3995 parsed: 4041
[STOP] [2023-05-26 10:34:01] parse_names
[START] [2023-05-26 10:34:01] denormalize_canonical_names_to_nodes
[STOP] [2023-05-26 10:34:01] denormalize_canonical_names_to_nodes
[START] [2023-05-26 10:34:01] match_nodes
[START] [2023-05-26 10:34:01] map_all_nodes_to_pages
[STOP] [2023-05-26 10:35:36] map_all_nodes_to_pages
[INFO] [2023-05-26 10:35:36] 337 Unmatched nodes (of 4041)! That's too many to output. Full list in /app/public/data/usda_plants_text/unmatched_nodes.txt ; First 10: Canonical: Pinus discolor; Node#134870021; ResourceID: PIDI3_Pinus_discolor; Canonical: Taxales; Node#134870629; ResourceID: Plantae/Pinopsida/Taxales; Canonical: Acacia angustissima; Node#134866999; ResourceID: ACAN_Acacia_angustissima; Canonical: Acacia neovernicosa; Node#134867049; ResourceID: ACNE4_Acacia_neovernicosa; Canonical: Acacia vogeliana; Node#134867085; ResourceID: ACVO_Acacia_vogeliana; Canonical: Astragalus austiniae; Node#134867347; ResourceID: ASAU_Astragalus_austiniae; Canonical: Astragalus gambelianus; Node#134867434; ResourceID: ASGA_Astragalus_gambelianus; Canonical: Astragalus schmolliae; Node#134867579; ResourceID: ASSC5_Astragalus_schmolliae; Canonical: Astragalus wetherillii; Node#134867637; ResourceID: ASWE2_Astragalus_wetherillii; Canonical: Baptisia alba; Node#134867702; ResourceID: BAAL_Baptisia_alba
[START] [2023-05-26 10:35:36] update_nodes
[STOP] [2023-05-26 10:35:38] update_nodes
[STOP] [2023-05-26 10:35:38] match_nodes
[START] [2023-05-26 10:35:38] reindex_search
[STOP] [2023-05-26 10:35:42] reindex_search
[START] [2023-05-26 10:35:42] normalize_units
[STOP] [2023-05-26 10:35:42] normalize_units
[START] [2023-05-26 10:35:42] calculate_statistics
[INFO] [2023-05-26 10:36:55] Duplicate page_id count: 0
[STOP] [2023-05-26 10:36:55] calculate_statistics
[START] [2023-05-26 10:36:55] complete_harvest_instance
[START] [2023-05-26 10:36:55] overall_tsv_creation
[INFO] [2023-05-26 10:36:55] Exporting 4041 nodes as TSV in batches of 10000...
[INFO] [2023-05-26 10:36:55] Processing group of 4041 in 1 batches of 10000
[INFO] [2023-05-26 10:37:12] Processed 4041/4041 nodes
[INFO] [2023-05-26 10:37:12] Average Time: 7.34
[INFO] [2023-05-26 10:37:12] Total Time: 17s
[STOP] [2023-05-26 10:37:12] overall_tsv_creation
[INFO] [2023-05-26 10:37:12] Done. Check your files:
[INFO] [2023-05-26 10:37:12] (4041 lines) /app/public/data/usda_plants_text/publish_nodes.tsv
[INFO] [2023-05-26 10:37:12] (18026 lines) /app/public/data/usda_plants_text/publish_node_ancestors.tsv
[INFO] [2023-05-26 10:37:12] (4041 lines) /app/public/data/usda_plants_text/publish_scientific_names.tsv
[INFO] [2023-05-26 10:37:12] (3503 lines) /app/public/data/usda_plants_text/publish_articles.tsv
[INFO] [2023-05-26 10:37:12] (10649 lines) /app/public/data/usda_plants_text/publish_attributions.tsv
[INFO] [2023-05-26 10:37:12] (3503 lines) /app/public/data/usda_plants_text/publish_content_sections.tsv
[STOP] [2023-05-26 10:37:12] complete_harvest_instance
[START] [2023-05-26 10:37:12] completed
[STOP] [2023-05-26 10:37:12] completed
[STOP] [2023-05-26 10:37:12] logged process, took 234.96

Latest Process