Harvest for
Flickr BHL
Created
03 Apr 15:29
Stage:
completed
Fetched:
03 Apr 15:29
Validated:
03 Apr 15:29
Deltas Created
03 Apr 15:29
Units Normalized:
03 Apr 15:37
Ancestry Built:
03 Apr 15:31
Nodes Matched:
03 Apr 15:37
Names Parsed:
03 Apr 15:31
New Models Stored:
03 Apr 15:31
Indexed:
03 Apr 15:37
Completed:
03 Apr 15:41
Time to Harvest:
less than a minute
Harvesting Log
(198 lines)
[INFO] [2023-04-03 15:29:29] Created harvest instance #4329
[STOP] [2023-04-03 15:29:29] create_harvest_instance
[START] [2023-04-03 15:29:29] fetch_files
[STOP] [2023-04-03 15:29:29] fetch_files
[START] [2023-04-03 15:29:30] validate_each_file
[INFO] [2023-04-03 15:29:30] Looping over 4 formats...
[INFO] [2023-04-03 15:29:30] ...agents (/app/public/data/flickrBHL/agent.tab)
[INFO] [2023-04-03 15:29:30] Valid: /app/public/data/flickrBHL/converted_csv/flickrBHL_agents_30280.csv (3230 lines)
[INFO] [2023-04-03 15:29:30] ...nodes (/app/public/data/flickrBHL/taxon.tab)
[INFO] [2023-04-03 15:29:30] Valid: /app/public/data/flickrBHL/converted_csv/flickrBHL_nodes_30279.csv (14324 lines)
[INFO] [2023-04-03 15:29:30] ...media (/app/public/data/flickrBHL/media_resource.tab)
[INFO] [2023-04-03 15:29:34] Valid: /app/public/data/flickrBHL/converted_csv/flickrBHL_media_30278.csv (33118 lines)
[INFO] [2023-04-03 15:29:34] ...vernaculars (/app/public/data/flickrBHL/vernacular_name.tab)
[INFO] [2023-04-03 15:29:34] Valid: /app/public/data/flickrBHL/converted_csv/flickrBHL_vernaculars_30277.csv (7723 lines)
[STOP] [2023-04-03 15:29:34] validate_each_file
[START] [2023-04-03 15:29:34] convert_to_csv
[INFO] [2023-04-03 15:29:34] Looping over 4 formats...
[INFO] [2023-04-03 15:29:34] ...agents (/app/public/data/flickrBHL/agent.tab)
[CMD] [2023-04-03 15:29:34] /usr/bin/sort /app/public/data/flickrBHL/converted_csv/flickrBHL_agents_30280.csv > /app/public/data/flickrBHL/converted_csv/flickrBHL_agents_30280.csv_sorted
[INFO] [2023-04-03 15:29:34] Converted: /app/public/data/flickrBHL/converted_csv/flickrBHL_agents_30280.csv (3230 lines)
[INFO] [2023-04-03 15:29:34] ...nodes (/app/public/data/flickrBHL/taxon.tab)
[CMD] [2023-04-03 15:29:34] /usr/bin/sort /app/public/data/flickrBHL/converted_csv/flickrBHL_nodes_30279.csv > /app/public/data/flickrBHL/converted_csv/flickrBHL_nodes_30279.csv_sorted
[INFO] [2023-04-03 15:29:34] Converted: /app/public/data/flickrBHL/converted_csv/flickrBHL_nodes_30279.csv (14324 lines)
[INFO] [2023-04-03 15:29:34] ...media (/app/public/data/flickrBHL/media_resource.tab)
[CMD] [2023-04-03 15:29:34] /usr/bin/sort /app/public/data/flickrBHL/converted_csv/flickrBHL_media_30278.csv > /app/public/data/flickrBHL/converted_csv/flickrBHL_media_30278.csv_sorted
[INFO] [2023-04-03 15:29:35] Converted: /app/public/data/flickrBHL/converted_csv/flickrBHL_media_30278.csv (33118 lines)
[INFO] [2023-04-03 15:29:35] ...vernaculars (/app/public/data/flickrBHL/vernacular_name.tab)
[CMD] [2023-04-03 15:29:35] /usr/bin/sort /app/public/data/flickrBHL/converted_csv/flickrBHL_vernaculars_30277.csv > /app/public/data/flickrBHL/converted_csv/flickrBHL_vernaculars_30277.csv_sorted
[INFO] [2023-04-03 15:29:35] Converted: /app/public/data/flickrBHL/converted_csv/flickrBHL_vernaculars_30277.csv (7723 lines)
[STOP] [2023-04-03 15:29:35] convert_to_csv
[START] [2023-04-03 15:29:35] calculate_delta
[INFO] [2023-04-03 15:29:35] Looping over 4 formats...
[INFO] [2023-04-03 15:29:35] ...agents (/app/public/data/flickrBHL/agent.tab)
[CMD] [2023-04-03 15:29:35] echo "0a" > /app/public/data/flickrBHL/diff/flickrBHL_agents_30280.diff
[CMD] [2023-04-03 15:29:35] tail -n +1 /app/public/data/flickrBHL/converted_csv/flickrBHL_agents_30280.csv >> /app/public/data/flickrBHL/diff/flickrBHL_agents_30280.diff
[CMD] [2023-04-03 15:29:35] echo "." >> /app/public/data/flickrBHL/diff/flickrBHL_agents_30280.diff
[INFO] [2023-04-03 15:29:36] Created diff: /app/public/data/flickrBHL/diff/flickrBHL_agents_30280.diff (3232 lines)
[INFO] [2023-04-03 15:29:36] ...nodes (/app/public/data/flickrBHL/taxon.tab)
[CMD] [2023-04-03 15:29:36] echo "0a" > /app/public/data/flickrBHL/diff/flickrBHL_nodes_30279.diff
[CMD] [2023-04-03 15:29:36] tail -n +1 /app/public/data/flickrBHL/converted_csv/flickrBHL_nodes_30279.csv >> /app/public/data/flickrBHL/diff/flickrBHL_nodes_30279.diff
[CMD] [2023-04-03 15:29:36] echo "." >> /app/public/data/flickrBHL/diff/flickrBHL_nodes_30279.diff
[INFO] [2023-04-03 15:29:36] Created diff: /app/public/data/flickrBHL/diff/flickrBHL_nodes_30279.diff (14326 lines)
[INFO] [2023-04-03 15:29:36] ...media (/app/public/data/flickrBHL/media_resource.tab)
[CMD] [2023-04-03 15:29:36] echo "0a" > /app/public/data/flickrBHL/diff/flickrBHL_media_30278.diff
[CMD] [2023-04-03 15:29:36] tail -n +1 /app/public/data/flickrBHL/converted_csv/flickrBHL_media_30278.csv >> /app/public/data/flickrBHL/diff/flickrBHL_media_30278.diff
[CMD] [2023-04-03 15:29:37] echo "." >> /app/public/data/flickrBHL/diff/flickrBHL_media_30278.diff
[INFO] [2023-04-03 15:29:37] Created diff: /app/public/data/flickrBHL/diff/flickrBHL_media_30278.diff (33120 lines)
[INFO] [2023-04-03 15:29:37] ...vernaculars (/app/public/data/flickrBHL/vernacular_name.tab)
[CMD] [2023-04-03 15:29:37] echo "0a" > /app/public/data/flickrBHL/diff/flickrBHL_vernaculars_30277.diff
[CMD] [2023-04-03 15:29:37] tail -n +1 /app/public/data/flickrBHL/converted_csv/flickrBHL_vernaculars_30277.csv >> /app/public/data/flickrBHL/diff/flickrBHL_vernaculars_30277.diff
[CMD] [2023-04-03 15:29:38] echo "." >> /app/public/data/flickrBHL/diff/flickrBHL_vernaculars_30277.diff
[INFO] [2023-04-03 15:29:38] Created diff: /app/public/data/flickrBHL/diff/flickrBHL_vernaculars_30277.diff (7725 lines)
[STOP] [2023-04-03 15:29:38] calculate_delta
[START] [2023-04-03 15:29:38] parse_diff_and_store
[INFO] [2023-04-03 15:29:38] Handling diff: /app/public/data/flickrBHL/diff/flickrBHL_agents_30280.diff (3232 lines)
[INFO] [2023-04-03 15:29:38] Loading agents diff file into memory (3232 lines)...
[INFO] [2023-04-03 15:29:38] Storing 3230 Attributions (3230/3230/3232)
[INFO] [2023-04-03 15:29:39] Handling diff: /app/public/data/flickrBHL/diff/flickrBHL_nodes_30279.diff (14326 lines)
[INFO] [2023-04-03 15:29:40] Loading nodes diff file into memory (14326 lines)...
[WARN] [2023-04-03 15:29:41] Filtered Scientific Name `Xylophilus corticalis` to `Xylophilus corticalis`
[WARN] [2023-04-03 15:29:41] Filtered Scientific Name `Carcharhinus obscurus\` to `Carcharhinus obscurus`
[WARN] [2023-04-03 15:29:41] Filtered Scientific Name `Lobelia purpurascens` to `Lobelia purpurascens`
[WARN] [2023-04-03 15:29:42] Filtered Scientific Name `Tyrannus cubensis` to `Tyrannus cubensis`
[WARN] [2023-04-03 15:29:42] Filtered Scientific Name `Romalaeosoma adonina` to `Romalaeosoma adonina`
[WARN] [2023-04-03 15:29:43] Filtered Scientific Name `Dichopogon strictus\` to `Dichopogon strictus`
[WARN] [2023-04-03 15:29:43] Filtered Scientific Name `Luzuriaga marginata` to `Luzuriaga marginata`
[WARN] [2023-04-03 15:29:43] Filtered Scientific Name `Acanthophorus confinis\` to `Acanthophorus confinis`
[INFO] [2023-04-03 15:29:44] Storing 13252 ScientificNames (26504/10000/14326)
[INFO] [2023-04-03 15:29:50] Storing 13252 Nodes (26504/10000/14326)
[WARN] [2023-04-03 15:29:55] Filtered Scientific Name `Salix aurita x lapponum` to `Salix aurita x lapponum`
[WARN] [2023-04-03 15:29:56] Filtered Scientific Name `Betula nana x pubescens` to `Betula nana x pubescens`
[WARN] [2023-04-03 15:29:56] Filtered Scientific Name `Rana halecina` to `Rana halecina`
[WARN] [2023-04-03 15:29:56] SKIPPED 743 Scientific names (38536/14324/14326) with resource_pks already be in the database!
[WARN] [2023-04-03 15:29:56] SKIPPED 743 Nodes (38536/14324/14326) with resource_pks already be in the database!
[INFO] [2023-04-03 15:29:56] Storing 5273 ScientificNames (38536/14324/14326)
[INFO] [2023-04-03 15:29:59] Storing 5273 Nodes (38536/14324/14326)
[INFO] [2023-04-03 15:30:01] Handling diff: /app/public/data/flickrBHL/diff/flickrBHL_media_30278.diff (33120 lines)
[INFO] [2023-04-03 15:30:01] Loading media diff file into memory (33120 lines)...
[INFO] [2023-04-03 15:30:14] Storing 9999 Media (13877/10000/33120)
[INFO] [2023-04-03 15:30:19] Storing 3878 ContentAttributions (13877/10000/33120)
[INFO] [2023-04-03 15:30:33] Storing 10000 Media (26449/20000/33120)
[INFO] [2023-04-03 15:30:38] Storing 2572 ContentAttributions (26449/20000/33120)
[INFO] [2023-04-03 15:30:51] Storing 10000 Media (40026/30000/33120)
[INFO] [2023-04-03 15:30:56] Storing 3577 ContentAttributions (40026/30000/33120)
[INFO] [2023-04-03 15:31:01] Storing 3119 Media (44242/33118/33120)
[INFO] [2023-04-03 15:31:02] Storing 1097 ContentAttributions (44242/33118/33120)
[INFO] [2023-04-03 15:31:02] Handling diff: /app/public/data/flickrBHL/diff/flickrBHL_vernaculars_30277.diff (7725 lines)
[INFO] [2023-04-03 15:31:03] Loading vernaculars diff file into memory (7725 lines)...
[INFO] [2023-04-03 15:31:04] Storing 7723 Vernaculars (7723/7723/7725)
[STOP] [2023-04-03 15:31:05] parse_diff_and_store
[START] [2023-04-03 15:31:05] resolve_keys
[2023-04-03 15:31:12] Resolving downloaded urls (this is not actually downloading them yet)
[INFO] [2023-04-03 15:31:23] Occurrences to nodes (through scientific_names)...
[INFO] [2023-04-03 15:31:23] traits to occurrences...
[INFO] [2023-04-03 15:31:23] traits to nodes (through occurrences)...
[INFO] [2023-04-03 15:31:23] Traits to sex term...
[INFO] [2023-04-03 15:31:23] Traits to lifestage term...
[INFO] [2023-04-03 15:31:23] MetaTraits to traits...
[INFO] [2023-04-03 15:31:23] MetaTraits (simple, measurement row refers to parent) to traits...
[INFO] [2023-04-03 15:31:23] Assocs to occurrences...
[INFO] [2023-04-03 15:31:23] Assocs to nodes...
[INFO] [2023-04-03 15:31:23] Assoc to sex term...
[INFO] [2023-04-03 15:31:23] Assoc to lifestage term...
[INFO] [2023-04-03 15:31:23] MetaAssoc to assocs...
[STOP] [2023-04-03 15:31:24] resolve_keys
[START] [2023-04-03 15:31:24] hold_for_later_1
[STOP] [2023-04-03 15:31:24] hold_for_later_1
[START] [2023-04-03 15:31:24] hold_for_later_2
[STOP] [2023-04-03 15:31:24] hold_for_later_2
[START] [2023-04-03 15:31:24] resolve_missing_parents
[STOP] [2023-04-03 15:31:24] resolve_missing_parents
[START] [2023-04-03 15:31:24] rebuild_nodes
[START] [2023-04-03 15:31:24] Flattener#flatten
[START] [2023-04-03 15:31:24] Flattener#study_resource
[START] [2023-04-03 15:31:24] Flattener#build_ancestry
[STOP] [2023-04-03 15:31:27] Flattener#build_ancestry
[INFO] [2023-04-03 15:31:27] 18525 ancestry keys
[START] [2023-04-03 15:31:27] build_node_ancestors
[INFO] [2023-04-03 15:31:27] old ancestors deleted.
[STOP] [2023-04-03 15:31:27] build_node_ancestors
[START] [2023-04-03 15:31:28] Flattener#propagate_ancestor_ids
[STOP] [2023-04-03 15:31:28] Flattener#propagate_ancestor_ids
[STOP] [2023-04-03 15:31:28] Flattener#flatten
[STOP] [2023-04-03 15:31:28] rebuild_nodes
[START] [2023-04-03 15:31:28] resolve_missing_media_owners
[STOP] [2023-04-03 15:31:28] resolve_missing_media_owners
[START] [2023-04-03 15:31:28] sanitize_media_verbatims
[STOP] [2023-04-03 15:31:28] sanitize_media_verbatims
[START] [2023-04-03 15:31:28] queue_downloads
[STOP] [2023-04-03 15:31:28] queue_downloads
[START] [2023-04-03 15:31:28] parse_names
[WARN] [2023-04-03 15:31:28] I see 18525 names which still need to be parsed.
[INFO] [2023-04-03 15:31:29] 0% of media downloaded
[WARN] [2023-04-03 15:31:29] Names to parse: 10000 formatted: 10000 learned: 9678 parsed: 10000
[INFO] [2023-04-03 15:31:30] 0% of media downloaded
[WARN] [2023-04-03 15:31:38] Names to parse: 8525 formatted: 8525 learned: 8422 parsed: 8525
[STOP] [2023-04-03 15:31:45] parse_names
[START] [2023-04-03 15:31:45] denormalize_canonical_names_to_nodes
[STOP] [2023-04-03 15:31:46] denormalize_canonical_names_to_nodes
[START] [2023-04-03 15:31:46] match_nodes
[START] [2023-04-03 15:31:46] map_all_nodes_to_pages
[INFO] [2023-04-03 15:31:49] 0% of media downloaded
[STOP] [2023-04-03 15:37:13] map_all_nodes_to_pages
[INFO] [2023-04-03 15:37:13] 4139 Unmatched nodes (of 18525)! That's too many to output. Full list in /app/public/data/flickrBHL/unmatched_nodes.txt ; First 10: Canonical: Gymnokitta; Node#134263515; ResourceID: Gymnokitta; Canonical: Gymnokitta; Node#134263516; ResourceID: 00001b0d516661a7d1f0f2ebbae898ba; Canonical: Tantalus leucocephalus; Node#134263517; ResourceID: 0001052caeebea9ea7b62db6aea1e80b; Canonical: Motacilla rubecula; Node#134265453; ResourceID: 16fc14eb8b790e0018de88843d9e40b3; Canonical: Motacilla acutipennis; Node#134279356; ResourceID: d88d1739dd2da7434bcc0ead34e15b71; Canonical: Motacilla ciliata; Node#134280328; ResourceID: e67a88e33e1f8f1b6d3ba31ccc10c4c3; Canonical: Juliamyia typica; Node#134263528; ResourceID: 0024689b45e647668c64d8efd3e3b76f; Canonical: Nemotelus villosus; Node#134274339; ResourceID: 8e9c6b4e4de3cfc61cb290d5a2396995; Canonical: Nemotelus pantherinus; Node#134275114; ResourceID: 9a9c618644582721d6fdf601f005b9c3; Canonical: Arctous; Node#134271439; ResourceID: Ericaceae/Arctous
[START] [2023-04-03 15:37:13] update_nodes
[STOP] [2023-04-03 15:37:22] update_nodes
[STOP] [2023-04-03 15:37:22] match_nodes
[START] [2023-04-03 15:37:22] reindex_search
[STOP] [2023-04-03 15:37:38] reindex_search
[START] [2023-04-03 15:37:38] normalize_units
[STOP] [2023-04-03 15:37:38] normalize_units
[START] [2023-04-03 15:37:38] calculate_statistics
[INFO] [2023-04-03 15:37:55] Duplicate page_id count: 0
[STOP] [2023-04-03 15:37:55] calculate_statistics
[START] [2023-04-03 15:37:55] complete_harvest_instance
[START] [2023-04-03 15:37:55] overall_tsv_creation
[INFO] [2023-04-03 15:37:55] Exporting 18525 nodes as TSV in batches of 10000...
[INFO] [2023-04-03 15:37:55] Processing group of 18525 in 2 batches of 10000
[INFO] [2023-04-03 15:39:48] Processed 10000/18525 nodes
[INFO] [2023-04-03 15:40:39] 20% of media downloaded
[INFO] [2023-04-03 15:41:43] Processed 18525/18525 nodes
[INFO] [2023-04-03 15:41:43] Average Time: 105.25
[INFO] [2023-04-03 15:41:43] Total Time: 3m49s
[STOP] [2023-04-03 15:41:43] overall_tsv_creation
[INFO] [2023-04-03 15:41:43] Done. Check your files:
[INFO] [2023-04-03 15:41:43] (18525 lines) /app/public/data/flickrBHL/publish_nodes.tsv
[INFO] [2023-04-03 15:41:43] (8574 lines) /app/public/data/flickrBHL/publish_node_ancestors.tsv
[INFO] [2023-04-03 15:41:44] (18525 lines) /app/public/data/flickrBHL/publish_scientific_names.tsv
[INFO] [2023-04-03 15:41:44] (33118 lines) /app/public/data/flickrBHL/publish_media.tsv
[INFO] [2023-04-03 15:41:44] (5569 lines) /app/public/data/flickrBHL/publish_image_info.tsv
[INFO] [2023-04-03 15:41:44] (7723 lines) /app/public/data/flickrBHL/publish_vernaculars.tsv
[INFO] [2023-04-03 15:41:44] (11124 lines) /app/public/data/flickrBHL/publish_attributions.tsv
[STOP] [2023-04-03 15:41:45] complete_harvest_instance
[START] [2023-04-03 15:41:45] completed
[STOP] [2023-04-03 15:41:45] completed
[STOP] [2023-04-03 15:41:45] logged process, took 735.42
[INFO] [2023-04-03 15:49:39] 40% of media downloaded
[INFO] [2023-04-03 15:53:51] 50% of media downloaded
[INFO] [2023-04-03 16:03:19] 70% of media downloaded
[INFO] [2023-04-03 16:07:28] 80% of media downloaded
[INFO] [2023-04-03 16:11:03] 90% of media downloaded
[INFO] [2023-04-03 16:14:28] 100% of media downloaded
[ERR] [2023-04-03 16:14:28][hdls] NO additional images were found to download
[INFO] [2023-04-03 16:14:28] 100% of media downloaded
[ERR] [2023-04-03 16:14:28][hdls] NO additional images were found to download
[ERR] [2023-04-03 16:14:28][hdls] NO additional images were found to download
[INFO] [2023-04-03 16:14:28] 100% of media downloaded
[ERR] [2023-04-03 16:14:28][hdls] NO additional images were found to download
[INFO] [2023-04-03 16:14:28] 100% of media downloaded
[ERR] [2023-04-03 16:14:29][hdls] NO additional images were found to download
[INFO] [2023-04-03 16:14:29] 100% of media downloaded
[ERR] [2023-04-03 16:14:29][hdls] NO additional images were found to download
[ERR] [2023-04-03 16:14:29][hdls] NO additional images were found to download
[ERR] [2023-04-03 16:14:29][hdls] NO additional images were found to download
[ERR] [2023-04-03 16:14:29][hdls] NO additional images were found to download
[ERR] [2023-04-03 16:14:29][hdls] NO additional images were found to download
[INFO] [2023-04-03 16:14:31] 100% of media downloaded
[ERR] [2023-04-03 16:14:31][hdls] NO additional images were found to download
Latest Process