Images
Providing data
The actual images used for model training can be provided in two ways:
- (Preferred): Images are accessible via http and the url of each image is provided in the Images file;
- If the former option is impossible (e.g. for data security/copyright reasons), images can be supplied on a disk containing subfolders for each species (or subspecies), where the folder name is the correct accepted scientific name according to a valid reference (such as waarneming.nl).
Further constraints on images include:
- Allowed image formats:
.jpg
,.png
,.bmp
; - Minimum resolution: 500x500 pixels;
- Images must be in colour.
Note
Please make sure to supply any number of images of a taxon that you have available. Even a single image can be helpful because we pool all images per taxon.
Metadata input files
In addition to the actual images, some metadata is required to build the models. This includes, of course, the correct identification of the specimen in each image and other required and optional data, such as the taxonomic classification of the specimen.
The data supplier should give two input files:
- Images file: A
.parquet
file [preferred] or a.csv
file (see example below) listing information for all images; - Taxa file: A
.parquet
file [preferred] or a.csv
file (see example below) providing taxonomies for all taxa present in the images file.
Get images file example Get taxa file example
Note
For improved readability, these example are provided as .csv
files. Please note that starting from 2025 the preferred file format is .parquet
for improved file read speeds and reduced file sizes.
File format - CSV
If you choose to provide .csv
files, please consider the following.
Input files must be comma-separated (,
) files. Fields with commas in their value should be enclosed with double quotes ("
). See also https://tools.ietf.org/html/rfc4180 for details.
Details and examples for the required columns and allowed values for both the images- and taxa files are given provided on their respective pages. While the order of the columns is not important, please make sure that the column headers match the ones given.
Please make sure that the input images- and taxa files are encoded in UTF-8
.
Validation
The images- and taxa input files will be automatically validated at Naturalis before we start the model training. Validation errors will be reported to the data suppliers. We will validate according to all constraints on fields in the images- and taxa files.
Qualified ID
In many source systems, IDs are mere integers. In order to not confuse e.g. observation records, taxa, and morphologies, we work with a combination of source system and original ID (uid) and source system, separated by a colon (:
).
A qualified id is thus the <source_id>:<uid>
. For example:
WRN:312313
for Waarneming.nlCOL:9849028342
for Catalogue of life.
Data suppliers can use their own source_id
under the condition that they do not conflict with existing source_id
’s.
Existing source_id
’s:
source_id |
source system |
---|---|
NIA | Naturalis Identification API |
COL | Catalogue of life |
NSR | Nederlands soortenregister |
INAT | iNaturalist |
GBIF | Global Biodiversity Information Facility |
WRN | Waarneming.nl / Waarnemingen.be / Observation.org |
NBIC | Artsdatabanken Norway |
UK_iRecord | UKSI |
DK_ART | Arter.dk |
APSE | SLU Sweden |
FINBIF | Laji.fi Finland |