Dataset



The proposed Multilingual HTR dataset for this competition consists of 20 000 lines in five different languages (English, Spanish, German, French, and Portuguese), with a balance of 4 000 lines for each language. The split is done by assigning 90% of the samples to the training set and 10% to the test set. The main statistics of the dataset are presented in Table 1.

 
Table 1. Multilingual HTR dataset statistics.

As mentioned, the data samples prepared for this competition are partitioned into a training and a test set as follows:
  1. 90% (18 000) of samples are used for training (Tr);
  2. 10% (2 000) of samples are used for testing (Ts).
The provided training data (Tr), prepared for this competition, consist of:
  1. Images of rendered training text samples.
  2. A file, containing ground-truth transcriptions phrases.
Table 2, shows sample images of the Tr set.

English
 

 
Spanish
 

German
 

 
French
 

 
Portuguese
 


Table 2. Multilingual HTR dataset sample images and ground-truth transcriptions
in English, Spanish, German, Frech and Portuguese.

The goal of the competition is to obtain the lowest WER on the test data set. Both tracks, T1 and T2, will be evaluated with the same test set (Ts). This test set will consist of only rendered images, which will be made available according to the competition schedule. In addition, the test set will be merged with several thousand images, thus participants will not be able to distinguish the actual test set. The ground-truth associated to the Ts set will be published once the competition officialy concludes.

BACK TO TOP >>