Researchers Amit Moryossef and Zifan Jiang, from Bar-Ilan University and the University of Zürich, conducted a study using the SignBank dataset, a multilingual, multidomain sign language dataset, and created a new version of the dataset they call SignBank+.
The researchers, along with Mathias Müller and Sarah Ebling, had already published a paper about work they conducted using the original dataset. In this follow-up study, Moryossef and Jiang stated that SignBank+ is a cleaned-up and optimized-for-translation version of the dataset, and the premise this time is that “a meticulously curated dataset will enhance the accuracy and reliability of translation models.”
In their previous study, the researchers used SignWriting as a notation system. SignWriting is a universal system of visual symbols that represent hand signs, movements, and gestures of signed languages. Moryossef and Jiang stated that their initial findings validated the use of this type of intermediate text representation for signed language machine translation, and used it again with SignBank+.
GPT and Signed-to-Spoken Language MT
The researchers set out to simplify the translation process and improve model training and implementation. Whereas their previous research (using the SignBank dataset) focused on machine translation (MT) between signed and spoken languages in both directions, their subsequent work focused on signed-to-spoken language MT, using SignBank+ and SignWriting as the intermediate step to produce text translation.
The researchers collected and annotated fingerspelling for letters and numbers, removed inconsistencies and errors from the original dataset, and expanded it by adding variations to multiple terms using a sample of 22 sign languages.
The dataset cleaning involved using ChatGPT. For this phase, they defined a “pseudo function” that obtained the number of signs, language code, and existing terms and returned a cleaned, parallel version of the terms. They verified this method by using the gpt-3.5-turbo-0613 model on manually cleaned samples, and comparing the results to the cleaned dataset.
To evaluate the quality of their dataset cleaning and expansion work, they tested its effect on multilingual MT. To that end, they trained MT models, including the Bergamot model (which they praised as “the only one that includes a realistic training pipeline for machine translation deployment”) using the original data, the cleaned data, and the expanded data.
🌟 Just published "SignBank+: Multilingual Sign Language Translation Dataset" (w/ @Jiang_Zifan).
Just $500 in dirty dataset cleanup with ChatGPT, leads to a 20-30 BLEU score boost—no changes to the models required.
Check out the paper! https://t.co/gzY78fCgXi pic.twitter.com/DBK17vLnUP— Amit Moryossef (@amitmoryossef) September 22, 2023
SignBank+ Renders Improved Text Target Results
For their test set, the researchers used manually annotated data, including tags to identify the source and target languages. Testing conditions included a variety of test frameworks and pre-trained, non-optimized models, as well as multilingual translation scenarios, using the first 3000 entries.
The researcher’s machine translation experiments revealed that, in the conditions they worked under, performance was consistently better using the clean dataset compared to the original data. They argue that SignBank+ is thus more useful for signed-to-spoken language MT.
The researchers also noted that the term expansion added noise by introducing multiple targets for the same source. On the other hand, they added that the noise from the clean dataset also serves a purpose by “introducing more overlaps between identical translations, thus drowning the noise [of the original dataset].”
For future work, the researchers intend to continue adding several variations for the cleaned-up terms in the dataset. They believe that “variability in language representation can significantly benefit the robustness of machine translation models by providing multiple ways of expressing the same idea.”