Data checks and other remarks

Updated: 28-11-2023

We encourage you to perform some data checks before delivering the dataset for model training, to ensure that the data is of optimal quality data for building the models. These guidelines below are based on our experience in the past years.

Do not put a lower limit on the number of observations per taxon

Do not exclude observations or images for taxa with a low number of examples. By combining your data with those of other partners, we might have sufficient examples to include them in the model.

Do not put an upper limit on the number of observations per taxon

Do not exclude observations or images for taxa with a very large number of examples. Although we limit the number of examples per taxon internally, we aim to do this in a way that keeps as much diversity as possible. If you still feel the need to limit the number of examples for a taxon for practical reasons, make sure you take a random selection of the data. It is advisable to discuss your selection strategy with us before delivering the data.

Make sure your taxonomy file is consistent and as complete as possible

Future releases of the Nature Identification API (NIA) will rely more on taxonomic information. Please make sure your taxonomy is complete and consistent. Consistent means amongst others that two taxa that share a common ancestor, have the same taxonomy above the common ancestor. A future release of NIA will also give output for higher taxa. To link these to your own taxonomy, it is necessary that, where possible, higher taxa are included in the provided taxa file.

Only deliver data for the correct geographic range

Do not deliver observations/images outside the geographic range of the model, even if the taxon itself occurs within the geographic range. E.g. when delivering data for the European Multi-source Model, only deliver data observed in Europe. The main reason being that in future versions of NIA we will introduce geographical modeling to improve identification results. Also check that you do not deliver taxa/observations outside the native geographic range, for example do not include data from overseas areas when delivering data for Europe.