Images

Providing data

The actual images used for model training can be provided in two ways:

(Preferred): Images are accessible via http and the url of each image is provided in the Images file;
If the former option is impossible (e.g. for data security/copyright reasons), images can be supplied on a disk containing subfolders for each species (or subspecies), where the folder name is the correct accepted scientific name according to a valid reference (such as waarneming.nl).

Further constraints on images include:

Allowed image formats: .jpg, .png, .bmp;
Minimum resolution: 500x500 pixels;
Images must be in colour.

Note

Please make sure to supply any number of images of a taxon that you have available. Even a single image can be helpful because we pool all images per taxon.

Metadata input files

In addition to the actual images, some metadata is required to build the models. This includes, of course, the correct identification of the specimen in each image and other required and optional data, such as the taxonomic classification of the specimen.

The data supplier should give two input files:

Images file: A .parquet file [preferred] or a .csv file (see example below) listing information for all images;
Taxa file: A .parquet file [preferred] or a .csv file (see example below) providing taxonomies for all taxa present in the images file.

Get images file example Get taxa file example

Note

For improved readability, these example are provided as .csv files. Please note that starting from 2025 the preferred file format is .parquet for improved file read speeds and reduced file sizes.

File format - CSV

If you choose to provide .csv files, please consider the following.

Input files must be comma-separated (,) files. Fields with commas in their value should be enclosed with double quotes ("). See also https://tools.ietf.org/html/rfc4180 for details.

Details and examples for the required columns and allowed values for both the images- and taxa files are given provided on their respective pages. While the order of the columns is not important, please make sure that the column headers match the ones given.

Please make sure that the input images- and taxa files are encoded in UTF-8.

Validation

The images- and taxa input files will be automatically validated at Naturalis before we start the model training. Validation errors will be reported to the data suppliers. We will validate according to all constraints on fields in the images- and taxa files.

Qualified ID

In many source systems, IDs are mere integers. In order to not confuse e.g. observation records, taxa, and morphologies, we work with a combination of source system and original ID (uid) and source system, separated by a colon (:).

A qualified id is thus the <source_id>:<uid>. For example:

WRN:312313 for Waarneming.nl
COL:9849028342 for Catalogue of life.

Data suppliers can use their own source_id under the condition that they do not conflict with existing source_id’s.

Existing source_id’s:

`source_id`	source system
NIA	Naturalis Identification API
COL	Catalogue of life
NSR	Nederlands soortenregister
INAT	iNaturalist
GBIF	Global Biodiversity Information Facility
WRN	Waarneming.nl / Waarnemingen.be / Observation.org
NBIC	Artsdatabanken Norway
UK_iRecord	UKSI
DK_ART	Arter.dk
APSE	SLU Sweden
FINBIF	Laji.fi Finland