Category Archives: Semantics: blog

About the typology of machine translation systems

The distinction between rule-based and statistically-based translation may well be artificial and obscure what is really the interesting distinction in machine translation modules. The latter may well lie in the fact that some methods capture (at least partially) the semantics of a text, and are for example able to enumerate lemmas in the text, change the person of verbs or the gender of nouns, etc. In contrast, other translation methods do not capture the semantics of the text and only perform the translation. At least this type of classification seems to be relevant to artificial intelligence.

Interjections

What are interjections (Hello! Good evening! Merry Christmas! Happy Birthday!…) in the present framework? They are words preceded by a punctuation mark (period, comma, exclamation mark, question mark, etc.) and followed by a punctuation mark.

An analysis of French word ‘très’

According to our analysis, the word ‘très’ is likely to occur in the following grammatical types:

  • Adjective modifier: here, ‘très’ modifies the meaning of an adjective: très beau (very beautiful, biddisimu), très content (very happy, cuntentissimu)
  • Adverb modifier: ‘très’ here modifies the meaning of an adverb: ‘très rarement’ = very rarely, raramenti; ‘très souvent’ = very often, mori à spessu
  • Adverb (i.e. in our terminology, a Verb modifier): ‘very’ modifies here the meaning of a verb: ‘j’ai très faim’ = I am very hungry, t’aghju mori fami; ‘il avait très soif’ = he was very thirsty, t’aia mori seti: where the verb is here the verbal locution ‘avoir faim’ = to be hungry, avè a fami; avoir soif = to be thirsty, avè a seti

Leaving ambiguity unresolved

Disambiguation is an essential process in machine translation. Sometimes, however, it seems more rational and logical to leave an ambiguity in the translation. This is the case when (i) there is an ambiguous word in the sentence to be translated; and (ii) the context does not provide an objective reason to choose one of the two occurrences. It seems that in this case, the best translation is the one that leaves the ambiguity intact.

Let’s take an example. Consider the following French sentence: ‘Son palais était en feu.’. The French word ‘palais’ is ambiguous, because it corresponds in English and in Corsican to two different words (palace, palazzu and palate, palatu).

Thus, we have 3 possibilities of translation:

  • His palate was on fire
  • His palace was on fire
  • His palace/palate was on fire

The third translation, in my opinion, is better, because it points out that the context is insufficient to choose one of the two alternatives.

Consider now, on the one hand, the following sentence: ‘Il avait mangé du piment fort. Son palais était en feu.’ Now the context provides an objective motivation to choose one of the two occurence. This yields the following translation: He had eaten some hot pepper. His palate was on fire.

On the other hand, consider the following sentence: ‘Les ennemis du prince avaient lancé des engins incendiaires. Son palais était en feu.’ We also have here an objective reason to choose the other alternative. It translates then: The prince’s enemies had thrown incendiary devices. His palace was on fire.

Dictionary = Corpus?

As far as machine translation is concerned, it seems that the best thing is to combine the best of the two approaches: rule-based or statistic-based. If it were possible to converge the two approaches, it seems that the benefit could be great. Let us try to define what could allow such a convergence, based on the two-sided grammatical approach. Let us try to illustrate this with a few examples.
To begin with, u soli sittimbrinu = ‘le soleil de septembre’ (the sun of September). In Corsican language, sittimbrinu is a masculine singular adjective that means ‘de septembre’ (of September). In French, ‘de septembre’ is–from an analytic perspective–a preposition followed by a common masculine singular noun. But according to the two-sided analysis ‘de septembre’ (of September) is also–from a synthetic perspective–a masculine singular adjective. This double nature, according to this two-sided analysis of ‘de septembre’, allows in fact the alignment of ‘de septembre’ (of September) with sittimbrinu.
More generally, if we define words or groups of words according to the two-sided grammatical analysis in the dictionary, we also have an alignment tool, which can be used for a translation system based on statistics, in the same way as a corpus. Thus, if it is sufficiently provided, the dictionary is also a corpus, and even more, an aligned corpus.

Grammatical taxonomy again: the case of prepositions

Let’s look at the translation of the word ‘whose’. Depending on the case, ‘whose’ can be a

  • relative pronoun: ‘la difficulté dont je t’ai parlé’ (the difficulty I told you about), ‘voilà le professeur dont j’apprécie beaucoup les cours’ (this is the teacher whose classes I really enjoy.)
  • or, more rarely, a preposition: ‘il y avait cinq couleurs, dont le rouge et le bleu’. (there were five colours, including red and blue.)

It is the latter case that we will be looking at. In this case, ‘dont’ is translated into English as ‘including’. In Corsican, the translation is: c’eranu cinque culori, frà i quali u rossu è u turchinu. But if we translate ‘il y avait cinq plantes, dont le ciste et la bruyère’ (‘there were five plants, including cistus and heather’), we get: c’eranu cinque piante, frà e quale u muchju è a scopa. Thus the translation of ‘dont’ (including) as a preposition is either frà i quali (masculine plural, culore being masculine in Corsican) or frà e quale (feminine plural), depending on which noun ‘dont’ refers to.

Thus ‘dont’ is translated into the masculine plural or the feminine plural, depending on the noun – either masculine or feminine – to which it refers. This casts doubt on the ‘prepositional’ nature of ‘dont’, and leads to further analysis to determine whether there might not be a more suitable grammatical type.

It is worth noting that ‘dont (including) can be replaced by ‘parmi lequels’ (among which, frà i quali) or ‘parmi lesquelles’ (among which, frà e quale) depending on whether the noun to which ‘whose’ refers is in the masculine plural or the feminine plural. This suggests that ‘whose’ could be conceived of as a preposition followed by a pronoun. In the spirit of this analysis, the BDL site notes: ‘Dont’ is probably the relative pronoun whose use is the most delicate. To use it correctly, one must know that dont always ‘hides’ the preposition ‘de’; ‘dont’ is equivalent to ‘de qui’, ‘de quoi’, ‘duquel’, etc. This link between ‘dont’ and ‘de’ goes back to the Latin origin of ‘dont’, which is from ‘unde’ “from where”.

More generally, this suggests that further analysis of some prepositions may be needed.

Creating new grammatical types

Italian has ‘prepositions followed by articles’ (preposizione articolate). This is a specific grammatical type, which refers to a word (e.g. della) that replaces a preposition (di) followed by an article (la):

	il	lo	l’	la	i	gli	le
di	del	dello	dell’	della	dei	degli	delle
a	al	allo	all’	alla	ai	agli	alle
da	dal	dallo	dall’	dalla	dai	dagli	dalle
in	nel	nello	nell’	nella	nei	negli	nelle
su	sul	sullo	sull’	sulla	sui	sugli	sulle

This specific grammatical type also corresponds to:

  • in French: du = de le, des = de les
  • in Corsican and especially in the Sartenese variant: ‘llu = di lu, ‘lla = di la, etc.

This raises the general problem of the number of grammatical types we should retain. Should we create new grammatical types beyond the classical ones, in order to optimise translators and NLP in general? What is the best grammatical type to retain for ‘prepositions followed by an article’: a new primitive one or a compound one (always keeping Occam’s razor in mind)? A preposition followed by an article behaves like a preposition for words on its left, and like an article for words on its right.

Evaluation of the performance after changes

Just performed a series of open tests, using the (pseudo-random) article of the day from wikipedia in French.The results are the following, concerning the Taravese version of the Corsican language:
95,76
95,76
94,34
95,76
99,25
95,04
95,48
that is to say an average of about 95%, taking into account that the ‘cismuntinca’ version generally obtains a slightly lower result, because of the masculine and feminine plurals which are different (whereas they are identical in Taravese).

Grammatical word-disambiguation again and again

The main difficulty here seems to lie in the adaptation of the grammatical disambiguation module. Indeed, for the French language, such a module performs disambiguation with respect to about 100 categories. The number of pairs (or 3-tuples, 4-tuples, etc.) of disambiguation, for French, is about 250. The question is: when we change languages, how many categories of n-tuples of disambiguation does this result in? In particular, when one switches from French to Italian, does this result in a big change in the categories to be disambiguated?

Let’s take an example, with a particular category of words to disambiguate. One such category is for example AQfs/Vsing3present (feminine singular adjective or verb in the 3rd person singular present tense). A word in Italian that belongs to this type is ‘stanca’. So we have both uses:

  • ‘è stanca’ (she is tired): AQfs
  • stanca il cavallo’ (it tires the horse): Vsing3present
    In French, we don’t have this kind of disambiguation category directly because the category concerned is broader than that: it includes at least the 1st person singular of the present. Thus we have the word ‘sèche’, which belongs to this type of disambiguation category:
  • ‘la feuille est sèche’ (the leaf is dry): AQfs
  • ‘je sèche mes cheveux’ (I dry my hair): Vsing1present
  • ‘il sèche sa chemise’ (he dries his shirt): Vsing3present

Of course, the code that allows the disambiguation of AQfs/Vsing1present/Vsing3present should also allow the derivation of the disambiguation of AQfs/Vsing3present. But this gives an idea of the kind of problems that arise and the adaptation needed.

If the types of disambiguation are very different from one language to another, it will be necessary to have a disambiguation module which is capable of adapting to many new types of disambiguation and which is therefore very flexible. This appears to be a considerable difficulty for the creation of an eco-system. It seems that Apertium, faced with this difficulty, has chosen a statistical module as a solution for its eco-system. However, the question of whether such a flexible module, adaptable without difficulty from one language to another, is feasible in the context of rule-based MT, remains an open question.

First feasability test: dictionary morphing

The first test carried out to transform the dictionary (in the extended sense) based on the French-Corsican pair, into a dictionary related to the Italian-Gallurian pair, shows that it is feasible. The result – of an acceptable but perfectible quality – is obtained in 21 minutes (with 16 GO RAM & Intel core i7-8550U CPU). We start with a multi-lingual dictionary based on French entries, and the final result is an Italian-Gallurese dictionary.

Translation from Italian to Gallurese

Our new project will be to try to implement the translation from Italian into Gallurese. For this is an essential pair for the Gallurese language, which is a priority. The major difficulty in doing this is:
– on the one hand, to (automatically) transform the dictionary (in the extended sense) based on the French-Corsican pair, into a dictionary related to the Italian-Gallurese pair
– on the other hand, to implement automatically (without having to rewrite them entirely) the other modules, and in particular the one based on grammatical disambiguation.

The stakes here seem high. It is a question of transforming a system that can translate one pair of languages (i.e. French into Corsican) into an eco-system that can translate several pairs of languages (the target language of which being an endangered language).

Adjective modifiers again

We will consider again a category of words such as ‘very’, when they precede an adjective. Traditionally, this category is termed ‘adverbs’ or ‘adverbs of degree’, but we prefer ‘adjective modifier’, because (i) analytically, they change the meaning of an adjective and (ii) synthetically, an adjective modifier followed by an adjective is still an adjective. A more complete list is: almost, absolutely, badly, barely, completely, decidedly, deeply, enormously, entirely, extremely, fairly, fully, greatly, hardly, highly, how, incredibly, intensely, less, most, much, nearly, perfectly, positively, practically, pretty, purely, quite, rather, really, scarcely, simply, somewhat, strongly, terribly, thoroughly, totally, utterly, very, virtually, well.

If we look at sentences such as: il est bien content (he is very happy, hè beddu cuntenti), ils étaient bien contents (they were very happy, erani beddi cuntenti), elle serait bien contente (she would be very happy, saria bedda cuntenti), elles sont bien contentes (they are very happy, sò beddi cuntenti), we can see that the modifier of the adjective ‘bien’ is rendered as very in English and in Corsican as:

  • bellu/beddu: singular masculine
  • belli/beddi: plural masculine
  • bella/bedda: feminine singular
  • belle/beddi: feminine plural

This shows that the adjective modifier is invariable in French and English, but varies in gender and number in Corsican. Thus, in Corsican grammar, it seems appropriate to distinguish between:

  • singular masculine adjective modifier
  • plural masculine adjective modifier
  • singular feminine adjective modifier
  • plural feminine adjective modifier

On the other hand, such a distinction does not seem useful in English and French, where the category of ‘adjective modifier’ is sufficient and there is no need for further detail.

On ‘reflexive pronouns’

Pursuing the reflection on grammatical categories, we will examine now “reflexive pronouns”. These are:

  • me te se nous vous se (French)
  • mi ti si ci vi si (Corsican)
  • myself yourself himself/herself/itself ourselves yourselves themselves

Let us take an example:

  • je me promène, tu te promènes, il se promène, nous nous promenons, vous vous promenez, ils se promènent
  • I walk, you walk, he walks, we walk, we walk, you walk, they walk
  • spassieghju, spassieghji, spassieghja, spassiemu, spassieti, spassièghjani

These reflexive pronouns are usually associated with so-called pronominal verbs.
From our point of view, this classification as ‘pronouns’ is unsatisfactory, because they always precede a verb,1 but are placed after a personal subject pronoun, an indefinite pronoun, or a nominal group. In particular, the notion of pronoun following a pronoun is not coherent, from the point of view of our analysis, where the main criterion for typology is the position of a given grammatical type in relation to another.

Let us recall here that the idea behind this reconstruction of grammatical typology is the hypothesis that traditional classification lacks coherence and that this considerably hinders the development of natural language analysis and, at the same time, the development of machine translation modules based on the emulation of human reasoning.

This example suggests that the classic ‘reflexive pronoun’ is a word that introduces into the verb to which it refers a notion of reflexivity of action. In this sense, it is more of a specialized verb modifier. It is thus more akin to the adverb in the sense that we have defined it, i.e. a verb modifier in the broad sense. The adverb in this sense can be placed before or after the verb. On the other hand, the reflexive verb modifier as we have defined it can only be placed in French before the verb.

1 I oversimplify here, since there are also some structures like: tu t’en souviens (you remember it, ti n’inveni).

Grammatical word-disambiguation again

The challenge is especially that of generalizing the grammatical word-disambiguation to several languages. Creating a module of grammatical word-disambiguation for each language appears to be a long and arduous task. This seems to be the main difficulty. But if a module specific to a given language can be generalized to several other languages, this could be an important advance in the field of rule-based machine translation (which simulates human reasoning seems to me a more appropriate term).

We can describe the problem more precisely. We have about 100 grammatical categories for a given language. We also have about 300 ambiguous grammatical types – to fix ideas – which are: e.g., adverb or preposition, singular masculine noun or singular masculine adjective, etc. The problem is to describe an algorithm to remove the ambiguity and determine the corresponding grammatical type according to the context.

Now rewriting the complete module of disambiguation by grammatical type, so that it can be used and adapted to other languages (Italian in the first place). It remains to be seen if this can be done.

First steps in gallurese language

The translator takes his first steps in translating from French into the Gallurian language. The first tests show a score of 75-80%, with many errors in grammar, spelling and vocabulary. It will be necessary to reach a score of 90% before the result can be published.

The ideal would have been the Italian-Gallurian translation, but this is not yet possible: it will be necessary to translate (i) Italian into French, then (ii) French into Gallurian.

Hinting at the Control problem

The question of choosing the best system to solve the problems posed by word disambiguation in the field of translation seems to be linked to the AGI control problem (how to avoid that an AGI finally turns out to be harmful for its creators). It seems that when we have the choice between several methods to develop an AI, it is wiser to choose the one that allows a better control of the AGI. As far as machine translation is concerned, we should thus prefer in this regard the method that emulates human reasoning, and that produces a response that can be broken down step by step into the reasoning that leads to it. This makes it possible to accurately determine the cause of an error, but also to remedy it. This problem does not only concern machine translation, but has a somewhat extended scope. For grammatical disambiguation concerns machine translation, but also the understanding of natural language, and disambiguation according to context, in the very absence of any translation.

On the implementation of grammatical disambiguation

Grammatical disambiguation – i.e. whether ‘maintenant’ is and adverb (now) or the gerundive (maintaining) of the verb ‘maintenir’ – seems to be the crucial issue for the adoption of the rule-based model or statistical model for machine translation. This problem is widespread and seems to concern all languages. For the French language, this problem of grammatical disambiguation concerns about 1 word out of 7. Effective grammatical disambiguation is difficult to implement. The advantage of adopting the statistical method for grammatical disambiguation is that the same method can be generalized and used for several languages. In the case of the rule-based model, the module of grammatical disambiguation must be rewritten for each language, which generates considerable complexity and requires a very significant development time. Therefore, a rule-based method for grammatical disambiguation that can be easily applied to several languages would be of great interest. This seems to be the main difficulty that rule-based machine translation is designed to overcome.

But if we want an artificial intelligence that not only provides an (mostly accurate) answer without being able to really explain its reasoning, but is truly able to emulate human reasoning and to justify and describe step by step the reasoning that leads to its answer, then it is worth the effort.

The 90% rule

The translation from French to Gallurese is in progress and currently under development. An application for Android is first planned. It will be called ‘traducidori gaddhuresu’. Currently the French-Gallurese translator is undergoing testing. It will only be published if its performance (evaluated by an open test) is above 90%. This is a rule that we apply to ourselves, and is specific to endangered languages. We consider that for them, a poor or low quality translation can be more harmful than useful.

A “traducidori gaddhuresu” in preparation

After the Corsican language, the second endangered language for which we would like to develop a translator is the Gallurese language (“traducidori gaddhuresu”). As far as the ‘traducidori gaddhuresu’ is concerned, we are considering an Android application and a Windows version.

The priority pair for Gallurese is Italian-Gallurese. However, it will not be possible to make an Italian-Gallurese translator at first. It is a French-Gallurese translator that is first of all in preparation. It will therefore be necessary, at first, to translate a text from Italian into French first (especially with Deepl, which is of very good quality), and then to use the French-Gallurese translator.