There are still thousands of languages ​​that automatic translators don't know

There are still thousands of languages ​​that automatic translators don't know

Of the more than seven thousand languages ​​spoken around the world, only a hundred are covered by the machine translation systems of Google or Microsoft. What to do with others, to study and save them?

Pieter Bruegel the Elder, the Tower of Babel (Wikipedia) Almost 80 years of history since its birth, machine translation has made great strides. We now use it daily without realizing it. In reality, however, these systems - which are closely linked with machine learning, deep learning and natural language processing - cover only a small fraction of the languages ​​spoken in the world. Google Translate currently offers the ability to communicate in around 108 different languages, while Microsoft's Bing Translator covers 70. A drop in the bucket if you think that in the world there are more than seven thousand languages ​​spoken and at least four thousand also have a writing system.

To understand the state of the art today, we need to rewind the tapes of the history of automatic translation. At first, it relied on specific rules that allowed for very limited translation. Only in the 1990s did the use of statistical systems begin to spread, which began to give a certain reliability to the result of the translation process.

These systems are based on the use of a certain amount of data a language that allow the match between two languages ​​based on the quantity and pairs of aligned sentences. The real turning point, recently, has been with the application of neural networks. "These are predictive systems that reconstruct the way the brain works and thinks by predicting the relationships between words, not limited to their linear concatenation," Carlo Eugeni, researcher and professor of audiovisual translation at the University of Leeds, explains to Wired.

Over a billion people now use #GoogleTranslate! Congratulations to the team on their historic technical advances that have dramatically improved translation quality. Breaking down language barriers is a big step as we work to address global challenges. https://t.co/IhztgOBL0q

- Kent Walker (@Kent_Walker) April 28, 2021



A matter of data

Let i Statistical and neural systems work between two languages, with the difference that neural networks improve quality based on data and their variability. Translated: The more data of various kinds there is, the better the automatic translation will work. The heart of the problem for the many languages ​​that are not yet translated by these systems lies precisely here: the lack of data and sources. "We have difficulties with languages ​​with less data, but they are not the least spoken ones", continues Eugeni: "Think, for example, of the Tamil language (spoken in several Asian countries, ed): it is more spoken than Italian, but there is not enough data to guarantee a quality translation. ”

The lack of data is due to many causes. “Many languages, for example, have only the oral version, think of tribal languages, and there are no written traces. In addition, for the languages ​​that have the writing, it is still necessary to have the so-called parallel text, or the translation into another language, such as English ", continues the teacher. But that's not the only problem. “To create a neural network, it is not necessary to have only one translation, but to have many parallel texts of different types in order to feed the translation system”, adds Eugeni. The example of Tamil also returns in this case: there is a lot of material, but no translations.

The parallel text, therefore, is fundamental for neural networks together with the reliability of the sources and the quality of the data. The reverse case is interesting. Example: Irish and Maltese, languages ​​spoken very little compared to many others, which however have a greater amount of data and of good quality by virtue of the fact that a lot of documentation relating to the European Union is also translated into these languages. "Although the results are not comparable to those of languages ​​such as English, Spanish, French or Italian, the neural network for these European languages ​​works much better than much more widely spoken languages ​​precisely because the predictive system is powered by good sources", explains Eugeni . Suffice it to say that, in the span of a decade, the European Parliament alone produces a data collection of about 1.37 billion words in 23 languages. The same is true for other institutions such as, for example, the Canadian Parliament or the United Nations.

Here's a dash of regional flavor for a happy new beginning with #MicrosoftTranslator. Translate in over 13 Indian languages ​​now. #NewYear #NewBeginnings #Ugadi #Vishu #GudiPadwa #PoilaBaisakh #PanaSankranti #Vaisakhi pic.twitter.com/SBCMq7Fqhf

- Microsoft India R & D (@microsoftidc) April 14, 2021



New horizons

Faced with this situation - many languages ​​spoken, but with scarce data and without parallel texts - research does not stop and does not give up. The last frontier is called massive multilingual neural machine translation. “We take the predictive system applied between two languages ​​with a lot of data and transfer - always in a predictive way - to one or more languages ​​called low resource, that is, with few data and of poor quality”, explains Eugeni. An example to explain how this multilingual neural network works is Luxembourgish, which is rarely spoken and with very strong dialectal variations. "Even though it is spoken by very few - explains the professor - it derives from German and the translation system works precisely because the neural network that is applied to the translation between English and Luxembourgish comes from a similar language".

From the perspective of overcoming the logic of working with language pairs, there are other projects as well. As reported by the BBC, Iarpa, the research arm of the US intelligence services, is funding research to develop a system capable of finding, translating and summarizing information from any low-resourc language and, be it text or voice. The developers aim to arrive at an “English-in, English-out” system which, given a domain-sensitive English query, will retrieve relevant data from a large multilingual archive and display it translated.

All these projects are useful when you need to quickly translate a text or information - as long as it is not of vital importance - from a language with little data to one consolidated in automatic translation systems. Not only that: including minor languages ​​in these processes also means protecting and saving them. There are projects in America and India for the development of automatic translation software specifically aimed at languages ​​of this type with the aim of making them survive, even simply by creating written materials for only oral languages. "At the level of linguistic heritage management, the only way is the digitization of culture - underlines Eugeni - even if not everyone agrees". As the case of the Maori reminds us who, according to Wired Uk, want to prevent big techs from accessing linguistic data.

Towards the universal language?

If it is true that the teams of research are using neural network technology to address the problem, it is equally true that neural network models have revolutionized language processing in recent years. Instead of just memorizing words and phrases, they can - by simplifying - learn their meaning, helping users in everyday life.

"Conceptually it is a real revolution", concludes Eugeni: "The dream of many linguists has been for many years - and perhaps still is - to find a universal linguistic system, which allows anyone in the world to understand each other, bringing the clocks back to before Babel. With the neural networks applied to multilingual translation (massive multilingual neural machine translation) one day it could be possible to translate from one language into any other language ".

The well-known linguist already spoke about the subject of the universality of language Noam Chomsky in 1957. The theory held that, as humans, we have an innate ability to interact with our fellow men. Among those who are more skeptical and those who are more optimistic, chasing the chimera of universalism to allow greater accessibility of translation systems could be the key to a tomorrow without language barriers.


Tech - 20 hours ago

The platform to get the green pass is online

adsJSCode ("nativeADV1", [[2 , 1]], "true", "1", "native", "read-more", "1"); Business - 22 hours ago

The manager who revolutionized Microsoft also becomes president

adsJSCode ("nativeADV2", [[2,1]], "true", "2", " native "," read-more "," 2 "); Web - 22 hours ago

Internet inventor will sell first source code at auction as Nft

Topics

Europe Google Internet Microsoft language globalData.fldTopic = "Europe , Google, Internet, Language, Microsoft "

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.




Powered by Blogger.