Microsoft announced that Microsoft Translator, its AI-powered text translation service, now supports more than 100 different languages and dialects. With the addition of 12 new languages including Georgian, Macedonian, Tibetan, and Uyghur, Microsoft claims that Translator can now make text and information in documents accessible to 5.66 billion people worldwide.
Its Translator isn’t the first to support more than 100 languages — Google Translate reached that milestone first in February 2016. (Amazon Translate only supports 71.) But Microsoft says that the new languages are underpinned by unique advances in AI and will be available in the Translator apps, Office, and Translator for Bing, as well as Azure Cognitive Services Translator and Azure Cognitive Services Speech.
As of today, Translator supports the following new languages, which Microsoft says are natively spoken by 84.6 million people collectively:
Powering Translator’s upgrades is Z-code, a part of Microsoft’s larger XYZ-code initiative to combine AI models for text, vision, audio, and language in order to create AI systems that can speak, see, hear, and understand. The team comprises a group of scientists and engineers who are part of Azure AI and the Project Turing research group, focusing on building multilingual, large-scale language models that support various production teams.
With Z-code, Microsoft is using transfer learning to move beyond the most common languages and improve translation accuracy for “low-resource” languages, which refers to languages with under 1 million sentences of training data. (Like all models, Microsoft’s learn from examples in large datasets sourced from a mixture of public and private archives.) Approximately 1,500 known languages fit this criteria, which is why Microsoft developed a multilingual translation training process that marries language families and language models.
Techniques like neural machine translation, rewriting-based paradigms, and on-device processing have led to quantifiable leaps in machine translation accuracy. But until recently, even the state-of-the-art algorithms lagged behind human performance. Efforts beyond Microsoft illustrate the magnitude of the problem — the Masakhane project, which aims to render thousands of languages on the African continent automatically translatable, has yet to move beyond the data-gathering and transcription phase. Additionally, Common Voice, Mozilla’s effort to build an open source collection of transcribed speech data, has vetted only dozens of languages since its 2017 launch.
Z-code language models are trained multilingually across many languages, and that knowledge is transferred between languages. Another round of training transfers knowledge between translation tasks. For example, the models’ translation skills (“machine translation”) are used to help improve their ability to understand natural language (“natural language understanding”).
Chief rival Google is also using emerging AI techniques to improve the language-translation quality across its service. Not to be outdone, Facebook recently revealed a model that uses a combination of word-for-word translations and back-translations to outperform systems for more than 100 language pairings. And in academia, MIT CSAIL researchers have presented an unsupervised model — i.e., a model that learns from test data that hasn’t been explicitly labeled or categorized — that can translate between texts in two languages without direct translational data between the two.
Of course, no machine translation system is perfect. Some researchers claim that AI-translated text is less “lexically” rich than human translations, and there’s ample evidence that language models amplify biases present in the datasets they’re trained on. AI researchers from MIT, Intel, and the Canadian initiative CIFAR have found high levels of bias from language models including BERT, XLNet, OpenAI’s GPT-2, and RoBERTa. Beyond this, Google identified (and claims to have addressed) gender bias in the translation models underpinning Google Translate, particularly with regard to resource-poor languages like Turkish, Finnish, Persian, and Hungarian.
© 2021 LeackStat.com
2024 © Leackstat. All rights reserved