Tuesday, February 27, 2024, 01:42 PM
Have you noticed how hype has a habit of hiding important facts? Now, as excitement grows over the sensational potential of new AI to improve lives, Guildhawk warns there is a little talked-about secret killer that can drown AI. It is a poor-quality dataset and it transpires there’s a flood of them out there.New research has revealed that lots of translation on the internet is poor quality and not suitable for training GPT and models. That’s a big problem for companies who want to scrape the Web for data to train their machine learning engines because they must be trained on clean, high-quality data otherwise they generate results that are wrong. What is the big problem with poor-quality AI training data, and how do you solve it?
The web is flooded with Machine-Generated Translations, especially in lower-resource languages
Researchers have discovered a surprising prevalence of machine-generated translations across the web, particularly in languages with fewer resources. This “multi-way parallel” content, translated into multiple languages simultaneously, constitutes a significant portion of total web content.
This is especially prominent for lower-resource languages and dialects that have relatively fewer data available for training conversational AI systems. By contrast, English, Spanish, and Chinese languages have higher resources.
Discover how Human Translation compares with Machine Translation.
Large Dataset Reveals Low Quality of Machine Translations
To analyse this phenomenon, a research team created a massive dataset called Multi-Way ccMatrix, containing 6.4 billion unique sentences in 90 languages. By examining patterns of “multi-way parallelism” (sets of sentences translated into three or more languages), they found that the quality of these translations tends to be low, especially those involving many languages.
Shorter, predictable sentences dominate, biasing training data
The study also revealed a bias towards shorter, more predictable sentences in multi-way parallel data. These sentences often originated from low-quality articles, suggesting a trend of low-quality English content being translated en masse into many lower-resource languages via machine translation.
Furthermore, the topic distribution of multi-way parallel data differed significantly, potentially impacting the training of large language models (LLMs).
Concerns for LLM Training: Low-quality data, hallucinations, and more
These findings raise significant concerns for training LLMs, particularly when using web-scraped data containing low-quality machine translations. The researchers emphasise the crucial role of data quality in LLM training, highlighting the potential risks of less fluent models with more hallucinations if trained on such data.
Train your AI translation engine with clean data: Future-proof your translations
Translation is a big investment for a business, especially when you commission expert human linguists with credentials specific to your industry or vertical. These verified translations are perfect for training machine learning models due to their high quality. Using these to train an AI engine specific to your company is a prudent step since it improves the results.
Clean data in, clean data out. Moreover, it helps you to future-proof your translations, so they keep giving you a return on investment. That saves you time and money, after all, why keep translating the same content when an AI translation engine can do this for you?
Future-proofing translated content has been Guildhawk’s focus since we were established in 2001. Our partners love receiving amazingly accurate translations, but they do not want to keep translating them each year. That’s why we started training machine learning models with high-quality translated data that has undergone a rigorous vetting process by professional linguists.
This is what powers our AI-translation GAI platform. Now, our partners ask us to build translation engines specific to their business to guarantee the quality and accuracy of results. Our training strategy for GAI relies exclusively on using data that has undergone a rigorous vetting process by professional linguists.
Multilingual data labelling system: Ensuring quality for Machine Translation training
It is imperative to implement a diligent multilingual data labelling system to confirm that only approved data is added to a training repository. At Guildhawk, we have been organising labelling data for almost a decade. We maintain the strict rule that unverified data has no place in our system, as the use of such could lead to poor machine translation results.
Our labelling system doesn’t act as an added chore when we need it, but it is part of an ongoing process. This system allows us to label data on the go, enabling us to sort data by domain and accuracy level, among other factors. The result? We have a sensational data lake that is high-quality and generates translation results that are accurate and authentic, particularly for specific verticals.
In the key to mitigating the negative impacts of low-quality data on LLM training lies in filtering and detecting machine-generated content. The strategy applied by Guildhawk is clear: Never compromise on quality, and always prioritise accurate, verified data for machine translation training.
Key Takeaways:
Machine-generated translations are widespread on the web, especially in lower-resource languages.
These translations are often low-quality, potentially impacting LLM training.
The content selection process exhibits bias towards shorter, predictable sentences and low-quality topics.
Never compromise on quality, and always prioritise accurate, verified data for machine translation training.