Tag Archives: rule-based machine translation

Dictionary = Corpus?

As far as machine translation is concerned, it seems that the best thing is to combine the best of the two approaches: rule-based or statistic-based. If it were possible to converge the two approaches, it seems that the benefit could be great. Let us try to define what could allow such a convergence, based on the two-sided grammatical approach. Let us try to illustrate this with a few examples.
To begin with, u soli sittimbrinu = ‘le soleil de septembre’ (the sun of September). In Corsican language, sittimbrinu is a masculine singular adjective that means ‘de septembre’ (of September). In French, ‘de septembre’ is–from an analytic perspective–a preposition followed by a common masculine singular noun. But according to the two-sided analysis ‘de septembre’ (of September) is also–from a synthetic perspective–a masculine singular adjective. This double nature, according to this two-sided analysis of ‘de septembre’, allows in fact the alignment of ‘de septembre’ (of September) with sittimbrinu.
More generally, if we define words or groups of words according to the two-sided grammatical analysis in the dictionary, we also have an alignment tool, which can be used for a translation system based on statistics, in the same way as a corpus. Thus, if it is sufficiently provided, the dictionary is also a corpus, and even more, an aligned corpus.

Grammatical word-disambiguation again and again

The main difficulty here seems to lie in the adaptation of the grammatical disambiguation module. Indeed, for the French language, such a module performs disambiguation with respect to about 100 categories. The number of pairs (or 3-tuples, 4-tuples, etc.) of disambiguation, for French, is about 250. The question is: when we change languages, how many categories of n-tuples of disambiguation does this result in? In particular, when one switches from French to Italian, does this result in a big change in the categories to be disambiguated?

Let’s take an example, with a particular category of words to disambiguate. One such category is for example AQfs/Vsing3present (feminine singular adjective or verb in the 3rd person singular present tense). A word in Italian that belongs to this type is ‘stanca’. So we have both uses:

  • ‘è stanca’ (she is tired): AQfs
  • stanca il cavallo’ (it tires the horse): Vsing3present
    In French, we don’t have this kind of disambiguation category directly because the category concerned is broader than that: it includes at least the 1st person singular of the present. Thus we have the word ‘sèche’, which belongs to this type of disambiguation category:
  • ‘la feuille est sèche’ (the leaf is dry): AQfs
  • ‘je sèche mes cheveux’ (I dry my hair): Vsing1present
  • ‘il sèche sa chemise’ (he dries his shirt): Vsing3present

Of course, the code that allows the disambiguation of AQfs/Vsing1present/Vsing3present should also allow the derivation of the disambiguation of AQfs/Vsing3present. But this gives an idea of the kind of problems that arise and the adaptation needed.

If the types of disambiguation are very different from one language to another, it will be necessary to have a disambiguation module which is capable of adapting to many new types of disambiguation and which is therefore very flexible. This appears to be a considerable difficulty for the creation of an eco-system. It seems that Apertium, faced with this difficulty, has chosen a statistical module as a solution for its eco-system. However, the question of whether such a flexible module, adaptable without difficulty from one language to another, is feasible in the context of rule-based MT, remains an open question.

Adjective modifiers again

We will consider again a category of words such as ‘very’, when they precede an adjective. Traditionally, this category is termed ‘adverbs’ or ‘adverbs of degree’, but we prefer ‘adjective modifier’, because (i) analytically, they change the meaning of an adjective and (ii) synthetically, an adjective modifier followed by an adjective is still an adjective. A more complete list is: almost, absolutely, badly, barely, completely, decidedly, deeply, enormously, entirely, extremely, fairly, fully, greatly, hardly, highly, how, incredibly, intensely, less, most, much, nearly, perfectly, positively, practically, pretty, purely, quite, rather, really, scarcely, simply, somewhat, strongly, terribly, thoroughly, totally, utterly, very, virtually, well.

If we look at sentences such as: il est bien content (he is very happy, hè beddu cuntenti), ils étaient bien contents (they were very happy, erani beddi cuntenti), elle serait bien contente (she would be very happy, saria bedda cuntenti), elles sont bien contentes (they are very happy, sò beddi cuntenti), we can see that the modifier of the adjective ‘bien’ is rendered as very in English and in Corsican as:

  • bellu/beddu: singular masculine
  • belli/beddi: plural masculine
  • bella/bedda: feminine singular
  • belle/beddi: feminine plural

This shows that the adjective modifier is invariable in French and English, but varies in gender and number in Corsican. Thus, in Corsican grammar, it seems appropriate to distinguish between:

  • singular masculine adjective modifier
  • plural masculine adjective modifier
  • singular feminine adjective modifier
  • plural feminine adjective modifier

On the other hand, such a distinction does not seem useful in English and French, where the category of ‘adjective modifier’ is sufficient and there is no need for further detail.

Grammatical word-disambiguation again

The challenge is especially that of generalizing the grammatical word-disambiguation to several languages. Creating a module of grammatical word-disambiguation for each language appears to be a long and arduous task. This seems to be the main difficulty. But if a module specific to a given language can be generalized to several other languages, this could be an important advance in the field of rule-based machine translation (which simulates human reasoning seems to me a more appropriate term).

We can describe the problem more precisely. We have about 100 grammatical categories for a given language. We also have about 300 ambiguous grammatical types – to fix ideas – which are: e.g., adverb or preposition, singular masculine noun or singular masculine adjective, etc. The problem is to describe an algorithm to remove the ambiguity and determine the corresponding grammatical type according to the context.

Now rewriting the complete module of disambiguation by grammatical type, so that it can be used and adapted to other languages (Italian in the first place). It remains to be seen if this can be done.

On the implementation of grammatical disambiguation

Grammatical disambiguation – i.e. whether ‘maintenant’ is and adverb (now) or the gerundive (maintaining) of the verb ‘maintenir’ – seems to be the crucial issue for the adoption of the rule-based model or statistical model for machine translation. This problem is widespread and seems to concern all languages. For the French language, this problem of grammatical disambiguation concerns about 1 word out of 7. Effective grammatical disambiguation is difficult to implement. The advantage of adopting the statistical method for grammatical disambiguation is that the same method can be generalized and used for several languages. In the case of the rule-based model, the module of grammatical disambiguation must be rewritten for each language, which generates considerable complexity and requires a very significant development time. Therefore, a rule-based method for grammatical disambiguation that can be easily applied to several languages would be of great interest. This seems to be the main difficulty that rule-based machine translation is designed to overcome.

But if we want an artificial intelligence that not only provides an (mostly accurate) answer without being able to really explain its reasoning, but is truly able to emulate human reasoning and to justify and describe step by step the reasoning that leads to its answer, then it is worth the effort.

The status of adjective modifiers

What is the status of adjective modifiers (tant, tout juste, un rien, un tantinet, très, extrêmement, … = so much, just a little, a little, a little, very, extremely, …) in the present grammatical typology? Adjectives are defined as noun modifiers. So adjective modifiers would be modifiers of noun modifiers? This sounds intriguing. In reality, we do not have the concept of ‘modifiers of modifiers’. In fact, we have the following rules:

  • a verb modifier followed by a verb is a verb
  • a determinant modifier followed by a determinant is a determinant
  • and generally speaking, a modifier of an X followed by an X is an X (where X is a given grammatical type)
    So a noun modifier followed by a noun is a noun, i.e. an adjective followed by a noun is a noun. For example: ‘un très beau livre’ (a very nice book), where ‘very’ is an adjective modifier, ‘nice’ is an adjective, i.e. a noun modifier, and ‘book’ is a noun.
    Hence finally, ‘an adjective modifier is a modifier of a noun modifier’ reads as follows: an adjective modifier is a modifier of [noun modifier].

Why it’s worth it to engage in rule-based translation

Rule-based translation is difficult to implement. The main difficulty encountered is taking into account the groups of words, so as to be on a par with statistics-based translation. The main problems in this regard are (i) polymorphic disambiguation; and (ii) building a fair typology of grammatical types. But once these steps begin to be mastered, there are many advantages. What seems essential here is that with the same piece of software, both machine translation and text analysis can be carried out. Among the modules that are easy to implement are the following:

  • lemmatizer
  • part-of-speech tagger
  • singularizer
  • pluralizer
  • grammar checker
  • type extractor: a module that allows you to extract words from a text according to their grammatical category

For the implementation of rule-based translation provides the machine with some inherent understanding of the text, in the same way that a human being does. To put it in a nutshell, it is better artificial intelligence.

Finally, other modules, more advanced, seem possible (to be confirmed).

Reflections on grammatical typologies

It is useful to point out the differences that may exist between different grammatical typologies. The classical grammatical taxonomy is essentially aimed at teaching and comprehension. It therefore has a pedagogical purpose. On the other hand, the taxonomy that is useful for rule-based machine translation has a different purpose: it aims essentially at allowing disambiguation, both grammatically and semantically, because ambiguity is a fundamental and very common problem in this particular context. Such a typology essentially focuses on the location of word types, on the structures encountered in the sentence. This explains why typologies can be different, as they have different goals and purposes.

Analyzing relative pronouns

What is the status of ‘relative pronouns’ of classical grammar within the present conceptual framework? Traditionally, a distinction is made between simple relative pronouns (qui, que, dont, où ; who, what, whose, where) and compound relative pronouns (à qui, pour lesquelles, à côté duquel, etc.; to whom, for whom, beside whom, etc.). If we look first at simple relative pronouns, the category does not seem satisfactory, in particular because of the presence of ‘qui’ (who) and ‘que’ (what), whose grammatical role appears, in the present context, to be quite different. Consider the two short sentences: ‘la maison que j’habite est grande’; et ‘l’homme qui parle est grand’. (the house I live in is big and the man who speaks is tall.). As these two examples illustrate, the structures following ‘que’ and ‘qui’ appear different. Here, ‘que’ is followed by a personal pronoun (‘j’habite’: I live) and a conjugated verb; and ‘qui’ is followed directly by a conjugated verb (‘parle’: speaks). From our present perspective, these are inherently different structures. Here, it turns out that ‘dont’ and ‘où’ admit the same type of structure as ‘que’. Thus, the homogeneous category, from our point of view, is formed here by ‘que’, ‘dont’, ‘où’, but not by ‘qui’. If we extend this analysis to other words, by searching for those who could fit into this category, we also find: ‘duquel’ (= de lequel; from which), ‘de laquelle’, ‘desquels’ (= de lesquels; from which), ‘desquelles’ (= de lesquelles; from which), ‘auquel’ (à lequel), à laquelle, ‘auxquels’ (à lesquels), ‘auxquelles’ (à lesquelles). But we also have all forms of the same type built from another preposition than ‘de’ or ‘à’: ‘sur lequel’, ‘sur laquelle’, …, ‘par lequel’, ‘par laquelle’, ‘avec lequel’, etc. Les pronoms relatifs composés classiques tels que ‘à qui’, ‘pour lesquelles’, ‘à côté duquel’, etc.; to whom, for whom, beside whom, etc.), s’intègrent également naturellement dans cette catégorie. But from the point of view of two-sided grammar, ‘à l’aide duquel’, ‘au moyen de laquelle’, ‘à la suite de quoi’, ‘à l’aide de qui’, etc. (with the help of which, by means of which, as a result of which, with the help of whom, etc.) also belong to this category. (to be continued)

Word sense disambiguation: a hard case

Let us consider a hard case for word sense disambiguation, in the context of French to Corsican MT. But the same goes for French to English MT. It relates to French words such as: ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The corresponding verbs ‘accomplir’ (to fulfill, to accomplish), ‘affaiblir’ (to weaken), ‘affranchir’ (to free), ‘alourdir’ (to burden), ‘amortir’ (to damp) have the same word for simple present and simple past at the third person singular: respectively ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The upshot is that a single sentence such as: ‘Il affaiblit sa position.’ can be translated either into he weakens his position or into he weakened his position. If the context is unambiguous with regard to the sence of the discourse, the correct tense can be adequately chosen. But in the lack of informative context, it would be opportune to let the ambiguity prevail.

It should be pointed out that any such verbs are not rare. A more complete list includes: accomplit, affaiblit, affranchit, alourdit, amortit, anéantit, anoblit, aplatit, arrondit, assombrit, bannit, bâtit, blanchit, blondit, démolit, éblouit, emplit, enfouit, enhardit, enlaidit, ennoblit, envahit, épaissit, étourdit, exclut, franchit, glapit, investit, jaunit, jouit, munit, noircit, obéit, obscurcit, occit, périt, réagit, régit, réjouit, remplit, répartit, resplendit, rétrécit, rit, rougit, rouvrit, saisit, sévit, surgit.

.

What is required from Artificial General Intelligence with regard to Machine Translation?

Illustration from www.pixabay.com

We will be interested in a series of posts to try to define what is required of an AGI (Artificial General Intelligence) in order to reach the level of superintelligence in MT (machine translation). (All this is highly speculative, but we shall give it a try.)
One of the difficulties that arise in machine translation relates to the translation of expressions. This leads us to mention one of the required skills of a superintelligence. It is the ability to identify an expression within a text in a given language and then to translate it into another language. Let us mention that expressions are of different types: verbal, nominal, adjectival, adverbial, … To fix the ideas we can focus here on verbal expressions. For example, the French expression ‘couper les cheveux en quatre’ (litterally, cut the hairs in four, i.e. to split hairs), which translates into Corsican language into either castrà i falchetti (litterally, to chastise the hawks) or castrà i cucchi (litterally, to chastise the cuckoos). In order to properly translate such an expression, a superintelligence must be able to:

  • identify ‘couper les cheveux en quatre’ as a verbal expression in a French corpus
  • identify castrà i falchetti as a verbal expression within a Corsican corpus
  • associate the two expressions as the proper translation of each other

It appears here that such an aptitude falls under the scope of AGI (Artificial general intelligence).

Superintelligent machine translation (updated)

Illustration from pixabay.com

Let us consider superintelligence with regard to machine translation. To fix ideas, we can propose a rough definition: it consists of a machine with the ability to translate with 99% (or above) accuracy from one of the 8000 languages to another. It seems relevant here to mention the present 8000 human languages, including some 4000 or 5000 languages which are at risk of extinction before the end of the XXIth century. It could also include relevantly some extinct languages which are somewhat well-described and meet the conditions for building rule-based translation. But arguably, this definition needs some additional criteria. What appears to be the most important is the ability to self-improve its performance. In practise, this could be done by reading or hearing texts. The superintelligent translation machine should be able to acquire new vocabulary from its readings or hearings: not only words and vocabulary, but also locutions (noun locutions, adjective locutions, adverbial locutions, verbal locutions, etc.). It should also be able to acquire new sentence structures from its readings and enrich its database of grammatical sentence structures. It should also be able to make grow its database of word meanings for ambiguous words and instantly build the associate disambiguation rules. In addition, it should be capable of detecting and implementing specific grammatical structures.
It seems superintelligence will be reached when the superintelligent translation machine will be able to perform all that without any human help.

Also relevant in this discussion is the fact, previously argued, that rule-based translation is better suited to endangered langages translation than statistic-based translation. Why? Because high-scale corpora do not exist for endangered languages. From the above definition of SMT, it follows that rule-based translation is also best suited to SMT, since it massively includes endangered languages (but arguably, statistic-based MT could still be used for translating main languages one into another).

Let us speculate now on how this path to superintelligent translation will be achieved. We can mention here:

  • a quantitative scenario: (i) acquire, fist, an ability to translate very accurately, say, 100 languages. (ii) develop, second, the ability to self-improve (iii) extend, third, the translation ability to whole set of 8000 human languages.
  • alternatively, there could be a qualitative scenario: (i) acquire, first, an ability to translate somewhat accurately the 8000 languages (the accuracy could vary from language to language, especially with rare endangered languages). (ii) suggest improvements to vocabulary, locutions, sentence structures, disambiguation rules, etc. that are verified and validated by human (iii) acquire, third, the ability to self-improve by reading texts or hearing conversations.
  • it is worth mentioning a third alternative that would consist of  an hybrid scenario, i.e. a mix of quantitative and qualitative improvements. It will be our preferred scenario.

But we should provide more details on how these steps could be achieved. To fix ideas, let us focus on the word self-improvement module: it allows the superintelligent machine translation to extend its vocabulary in any language. This could be accomplished by reading or hearing new texts in any language. When facing a new word, the superintelligent machine translation (SMT, for short) should be able to translate it instantly into the 8000 other languages and add it to its vocabulary database.

To give another example, another module would be locution self-improvement module: it allows the superintelligent machine translation to extend its locution knowledge in any language.

Also relevant to this topic is the following question: could SMT be achieved without AGI ( general AI)? We shall address this question later.

 

Is rule-based MT more ethical than statistical MT?

In the ongoing debate on safe IA, it is a relevant open question of whether rule-based MT is more ethical than statistical MT. Here are some arguments in favor of rule-based MT in this context (without blaming statistical MT which has its own strengths):

  • it emulates human reasoning: it translates a text just as a human would do
  • there is much control on rule-based MT since the resulting translated text can be traced back: a detailed step-by-step translation process can be provided if required
  • rule-based MT can be consistently part of and integrate itself into a whole project of brain emulation, which emulates general human reasoning

Semantic disambiguation of French ‘femme’: in the mud, gold is still shining

In Corsican language, French word ‘femme’ can be translated, depending on the context

  • either into donna (woman)
  • or into moglia (wife)

The above sample still contains a lot of vocabulary and grammatical disambiguation errors (easy/medium difficulty), but it handles successfully the semantic disambiguation (hard) of ‘femme’, two instances of which are properly translated into moglia (wife). As the Corsican proverb says, in a cianga l’oru luci sempri (in the mud, gold is still shining).

French samples are from the French corpora of the University of Leipzig.

What are the conditions for a given endangered language to be a candidate for rule-based machine translation?

What are the conditions for a given endangered language to be a candidate for rule-based machine translation? For a given endangered language to be a candidate for rule-based machine translation, some requirements are in order. There is notably need for:

  • a dictionary: some specialized lexicons are useful too
  • a list of locutions and their translation: to be more accurate what is needed are noun locutions, adjective locutions, adverbial locutions, verbal locutions and their translations in other language.
  • a detailed grammar (in any language): ideally, the grammar should be very detailed, mentioning notably irregular verbs, noun plurals, etc. Subjonctive, conditional tenses must also be accurately described.
  • in addition, elision rules, euphony rules, should also be described.
  • most importantly: a description of the main variants of the language and their differences. This is needed to handle what we can call the ‘variant problem’ (we shall say a bit more about this later): as an effect of diversity, endangered languages are often polynomic and come with variants. But translation must be coherent and a mix of several variants is not acceptable as a translation.

Let us mention that endangered languages are commonly associated with another language, being in a diglossia relationship one with another. To take an example, Corsican language is associated with French. So we consider the French-Corsican pair, and what is relevant is a French-Corsican. If we consider the sardinian gallurese language (‘gaddhuresu’), the relevant pair is Italian-Gallurese. Other relevant pairs are:

  • Italian-Sassarese
  • Italian-Sicilian
  • Italian-Venetian

How rule-based and statistical machine translation can help each other

Here are a few suggestions on how rule-based and statistical machine translation  can help each other:

To begin with, rule-based and statistical machine translation are often contrasted and compared: it would be oversimplifying to conclude that one is better than the other. From a more objective standpoint, let us consider that each method has its strengths and weaknesses. Let us investigate on how one could make them collaborate in order to add up their respective strengths

in the case of an endangered language, the lack of good quality corpora has been pointed out. But one way for rule-based and statistical machine translation to collaborate would be to use rule-based translation for building a better quality corpus for statistical machine translation

suppose we begin with a statistical machine translation software that performs 50% on average with regard to French to Corsican translation

let us sketch the process of creating these better corpora: let us take the example of the French-Corsican diglossic pair (the Corsican language being considered by Unesco as a definitely endangered language). Now presently we lack a quality French-Corsican corpus or to say it more accurately, the corpus at our disposal is a low-quality one. The idea would be to use rule-based machine translation to create a much better corpus to use with statistical machine translation.

let us sketch now the different steps of this collaborative process: (i) create a French-Corsican corpus with the help of rule-based machine translation: if the software has some average 90% performance, then the corpus would be on average 90% reliable. With appropriate training, statistical MT should now perform some, say, 80% on average (to be compared with the previous 50% performance)
(ii) from this French-Corsican corpus, other corpora pairs can be created, such as Italian-Corsican, English-Corsican, etc. since French-Italian, English-Italian, etc. corpora of excellent quality already exist. The performance gain should then extend to other language pairs such as Italian-Corsican, English-Corsican, etc.

with the help of this process, we re finally in a position to combine and add up the strengths of the two complementary approaches to MT: on the one hand, rule-based MT is able to translate with good accuracy even in the lack of corpora; on the other hand, statistical machine translation is able to handle successfully and fastly a great many language pairs. To sum up, as the Corsican proverb says: una mani lava l’altra (One hand washes the other).