Dataset

The proposed Multilingual HTR dataset for this competition consists of 20 000 lines in five different languages (English, Spanish, German, French, and Portuguese), with a balance of 4 000 lines for each language. The split is done by assigning 90% of the samples to the training set and 10% to the test set. The main statistics of the dataset are presented in Table 1.

Table 1. Multilingual HTR dataset statistics.

As mentioned, the data samples prepared for this competition are partitioned into a training and a test set as follows:

90% (18 000) of samples are used for training (Tr);
10% (2 000) of samples are used for testing (Ts).

The provided training data (Tr), prepared for this competition, consist of:

Images of rendered training text samples.
A file, containing ground-truth transcriptions phrases.

Table 2, shows sample images of the Tr set.

English

Spanish

German

French

Portuguese

Table 2. Multilingual HTR dataset sample images and ground-truth transcriptions
in English, Spanish, German, Frech and Portuguese.

The goal of the competition is to obtain the lowest WER on the test data set. Both tracks, T1 and T2, will be evaluated with the same test set (Ts). This test set will consist of only rendered images, which will be made available according to the competition schedule. In addition, the test set will be merged with several thousand images, thus participants will not be able to distinguish the actual test set. The ground-truth associated to the Ts set will be published once the competition officialy concludes.

BACK TO TOP >>

Search This Blog

AERFAI Contest 2024: Multilingual HTR

Dataset