Tag Archives: Corsican language

Further reflexions on the status of “I love you” in Corsican language

Let us briefly recall the problem: translating ‘I love you’ might sound trivial, but it’s not. In fact, ‘ti amu‘ is not the best translation. The best translation is ‘ti tengu caru‘ when addressed to a male person, or ‘ti tengu cara‘ when addressed to a female person. Hence the proposed preliminary translation ‘ti tengu caru/cara‘. Such rough translation requires further disambiguation, but on what precise grounds?

Let us look at the issue from an analytical perspective. It appears that we need to assign a reference to the pronoun ‘te’ (you, ti). The latter could be identified according to the context, depending on whether the person ‘te’ refers to is male or female. At this stage, it appears that it is better to consider that the personal object pronoun has an inherent gender: masculine or feminine. This gender does not affect the pronoun itself which remains ‘te’ (you, ti) independently of the gender, but it does have an effect on the words that depend on it, i.e. the adjective caru/cara in Corsican, in the locution ti tengu caru/cara. The upshot is: in this case, ‘te’ (you, ti) is a personal object pronoun, masculine or feminine, whose inherent ambiguity can be solved according to the context.

A specific kind of superlative

Let us consider a specific kind of superlative. Such form specific to Corsican language is notably mentioned by grammarian and author Santu Casta, in his  Punteghju, who recommends the following translation of “C’était le village le plus riche du canton” (It was the richest village of the canton):  Era u più paese riccu di stu cantone (pages 26 & 54-55). The structure is original in the sense that the comparative (più) precedes the noun (campanile, bell tower) that precedes the adjective (altu, high).

Anagrams in Corsican language

Here are some anagrams in Corsican language:

  • Corscia è Corsica
  • Marta è Matra
  • accanitu è uccitana
  • acciliratu, ricciulata è riciculata
  • accirtà è traccià
  • accirtatu, catarticu è tracciatu
  • adriatica è cadariati
  • anacrunisimu è cunsumariani
  • aprarà è pararà
  • arba è bara
  • attaccu è tuccata
  • aumintà è umanità
  • armunizà è rumanizà
  • basatu è sabatu
  • battariu è urbitata
  • calculatu è cullucata
  • cadastrali è riscaldata
  • camaratu è racamatu
  • ciandarmu è ricumanda
  • candidatu è incuddata
  • chjinà è nichjà
  • dicimà è midicà
  • fascià è fiascà
  • qualificativa è qualificavati
  • marchjariani è richjamarani
  • participativa è participavati
  • lavarà è valarà
  • cunsidarà è sicundarà
  • cuntrà è truncà
  • carrià è criarà
  • carità è citarà
  • ilarità è rialità
  • limità è milità
  • neru è renu
  • rinvià è vinarà
  • rinviarà, rivinarà è vinirarà
  • muralità, mutilarà è ultimarà
  • pisà è spià
  • pricisà è ripiscà
  • pristà è stirpà
  • ramintà è tarminà
  • ricciulà è riciculà
  • sacramentu è stancaremu
  • staccariu è sucratica
  • svià è visà
  • pristarà, stirparà è straripà

Rough typology of remaining errors (updated march 2018)

French to Corsican: performing on French wikipedia sample test currently amounts to 94% on average. Below is a rough typology of remaining errors (presumably an average scoring of 95% on the open test should be attainable on the basis of correction of ‘easy’ tagged errors):

  • unknown vocabulary: 40% (easy)
  • basic disambiguation: 25%  (easy or medium difficulty)
  • false positives: 5% (medium difficulty or hard). This type of error  is mostly related to proper nouns, i.e. English termes that should remain un translated. For example: ‘North American Aviation’ translates erroneously into ‘North American Aviazione’. In this case, ‘Aviation’ should remain untranslated.
  • inadequate locution: 10% (medium difficulty or hard)
  • anaphora resolution related to complex sentence’s structure: 5% (hard)
  • semantic disambiguation: 5% (hard). For example, disambiguating French ‘échecs’ = fiaschi/scacchi (failures/chess)
  • erroneous accord related to gender mismatch from French to Corsican, i.e. (i) words that are masculine in French and feminine in Corsican language; and (ii) ) words that are feminine in French and masculine in Corsican language: 1% (medium difficulty).
  • erroneous accord related to number mismatch from French to Corsican, i.e. (i) words that are singular in French and plural in Corsican language; and (ii) ) words that are plural in French and singular in Corsican language (for example French ‘la canicule’ translates into ‘i sulleoni’ in Corsican language: 1% (medium difficulty).
  • specific grammatical case: 2% (hard)
  • anaphora resolution associated with gender or number mismatch: 1% (hard)
  • unknown, unclassified: 6% (hard)

A Special Case of Anaphora Resolution

After improper anaphora resolution

Anaphora resolution usually refers to pronouns. But we face here a special case of anaphora resolution that relates to an adjective. The following sentence: ‘un vase de Chine authentique’ (an authentic vase of China) is translated erroneously as un vasu di China autentica, due to erroneous anaphora resolution. In this sample, the adjective ‘authentique’ refers to ‘vase’ (English: vase) and not to ‘Chine’ (China).

The same goes for ‘une chanson du Portugal mythique’, where ‘mythique’ refers to ‘chanson’ and not to ‘Portugal’.

After appropriate anaphora resolution

Solving fivefold ambiguity: translation for French ‘poste’

French word ‘poste’ has (at least) fivefold ambiguity. For it can designate:

  • ‘poste’ (masculine singular noun) : postu, masculine singular noun (set, i.e. television set)
  • ‘poste’ (masculine singular noun): posta, feminine singular noun (position): erroneously translated as postu in the present case ; it should read a so posta
  • ‘poste’ (feminine singular noun) : posta, feminine singular noun (post office)
  • ‘poste’: impostu (from the verb impustà (‘poster’, to station o.s.) at singular first person)
  • ‘poste’: imposta (from the verb impustà (‘poster’, to station o.s.) at singular third person)

(However, it is more complex than that, since there is another sense of the verb ‘poster’ (to post/to mail).

Chemistry: translating acid names


Translating this series of acid names is not as easy as it could seem at first glance. In effect, each acid name is composed of three consecutive ambiguous names:

  • ‘l’ is ambiguous between the masculine (u, the) or feminine (a, the) definite article
  • ‘acide’ is ambiguous betwwen acidu (acid, masculine singular noun), acitu (acid, masculine singular adjective) or acita (acid, feminine singular noun)
  • ‘daturique’, etc. are all ambiguous since that can be either masculine singular (daturicu, daturic) or feminine singular (daturica, daturic) adjectives.

Interesting case of first name disambiguation

Here is an interesting case of first name disambiguation for machine translation. Consider the following first name ‘Camille’. It can apply to both genders. In Corsican (taravese or sartinese variants) it translates either into Cameddu (masculine) or Camedda (feminine). In some cases, the corresponding disambiguation relies on mere grammatical grounds. For example, ‘Camille était beau’ translates into Cameddu era beddu (Camille was beautiful), on grammatical grounds alone. The same goes for ‘Camille était belle’, that translates straightforwardly into Camedda era bedda (Camille was beautiful), according to the adjective gender.

Now the related disambiguation can result in a hard case, relying only on semantic context. Hence, ‘Camille était pacifique” can translate either into Cameddu era pacificu or into Camedda era pacifica, depending on the context (which can be text or even an image…). In effect, it cannot be translated merely on grammatical grounds, since ‘pacifique’ (pacific) is gender-ambiguous: it can translate either into pacificu of pacifica.

Now the same goes for French first name ‘Dominique’ (Dominic), which translates either into ‘Dumenicu (masculine) or ‘Dumenica‘ (feminine). Hence, ‘Dominique était pacifique’ (Dominic was pacific) can translate either into ‘Dumenicu era pacificu‘ or into ‘Dumenica era pacifica‘, depending on the context.

Writing differences between Corsican and Gallurese

Here are some writing differences between Corsican and Sardinian gallurese, that result from historical writing habits. These writing differences prevail, even when the words are the same:

  • ghj is replaced by gghj: acciaghju (corsu), acciagghju (gallurese) , steel
  • chj is replaced by cchj: finochju (corsu), finocchju (gallurese), fennel
  • tonic accent is marked systematically in gallurese whereas it is not compulsory in Corsican: apostulu (Corsican), apòstulu (gallurese), apostle
  • cc is prefered in Gallurese language instead of cq in Corsican: acquistu (corsu), accuistu (gallurese), purchase
  • dd in Corsican taravese or sartinese is replaced with ddh in Gallurese: beddu bedda beddi (corsu), beddhu beddha beddhi (gallurese), beautiful
  • final è in Corsican is replaced with é in Gallurese: sapè (corsu), sapé (gallurese), know

How rule-based and statistical machine translation can help each other

Here are a few suggestions on how rule-based and statistical machine translation  can help each other:

To begin with, rule-based and statistical machine translation are often contrasted and compared: it would be oversimplifying to conclude that one is better than the other. From a more objective standpoint, let us consider that each method has its strengths and weaknesses. Let us investigate on how one could make them collaborate in order to add up their respective strengths

in the case of an endangered language, the lack of good quality corpora has been pointed out. But one way for rule-based and statistical machine translation to collaborate would be to use rule-based translation for building a better quality corpus for statistical machine translation

suppose we begin with a statistical machine translation software that performs 50% on average with regard to French to Corsican translation

let us sketch the process of creating these better corpora: let us take the example of the French-Corsican diglossic pair (the Corsican language being considered by Unesco as a definitely endangered language). Now presently we lack a quality French-Corsican corpus or to say it more accurately, the corpus at our disposal is a low-quality one. The idea would be to use rule-based machine translation to create a much better corpus to use with statistical machine translation.

let us sketch now the different steps of this collaborative process: (i) create a French-Corsican corpus with the help of rule-based machine translation: if the software has some average 90% performance, then the corpus would be on average 90% reliable. With appropriate training, statistical MT should now perform some, say, 80% on average (to be compared with the previous 50% performance)
(ii) from this French-Corsican corpus, other corpora pairs can be created, such as Italian-Corsican, English-Corsican, etc. since French-Italian, English-Italian, etc. corpora of excellent quality already exist. The performance gain should then extend to other language pairs such as Italian-Corsican, English-Corsican, etc.

with the help of this process, we re finally in a position to combine and add up the strengths of the two complementary approaches to MT: on the one hand, rule-based MT is able to translate with good accuracy even in the lack of corpora; on the other hand, statistical machine translation is able to handle successfully and fastly a great many language pairs. To sum up, as the Corsican proverb says: una mani lava l’altra (One hand washes the other).

French ‘fin’ followed by a year number: fixed

Tagger improvement: fixed this issue. French ‘l’Empire allemand’ now translates properly into l’Imperu alimanu (the German Empire). French word ‘fin’ is now identified as a preposition when followed by a year number.

The above excerpt is translated into the ‘sartinesu’ variant of Corsican language.

This issue relates to the more general problem of the grammatical status of numbers, a problem to which we shall return later.

Translation of preposition ‘à’ followed by noun phrase denoting a location

‘au stade de Wembley’ (at the Wembley Stadium) should translate in u stadiu di Wembley.

We face the issue of the translation of preposition ‘à’ since ‘au’ is short for ‘à le’ (to the), in particular when ‘à’ is followed by a noun phrase denoting a location. This occurs in the disambiguation of French ‘à’ which can can either translate into à (to) or into in (in).

Accordance of past participe

Now scoring 1 – 2/129 = 98.44%.

  • The issue of past participe’s accordance again: ‘une session du parlement tenue à Nuremberg’ (a session of the Parliament held in Nuremberg) should translate into una sessione di u parlamentu tenuta in Nuremberg. Past participe tenuta should accord with sessione (feminine, session) and not with parlamentu (masculine, Parliament). This could need dependency parsing, but it could be insufficient. Perhaps (harder) semantic disambiguation is required in this case.
  • One false positive: ‘des’, being a Deutsch word, should remain untranslated.

Past participe or present simple: the disambiguation of French ‘construit’

In the present case, it should read, custruitu à u seculu XII (built in the 12th century). The error relates to the disambiguation of French ‘construit’. It can translate into:

  • custruitu (built): past participe, masculine, singular
  • custruisce (builds): present simple, third person

MT should (i) find the proper reference of ‘construit’, i.e. ‘clocher’ (church tower), but above all (ii) whether  ‘construit’ is a past participe or a present simple. Some kind of dependency parser is in order…

Can translation help teaching an endangered language?

Can translation help self-teaching and endangered language? It seems yes, it the translation is accurate. Let us check with the verb parlà (to speak). In this case, the translation is 100% accurate, so it can help (but we need to check other verb categories and other tenses). Other verbs of the same group are verbs that end with : manghjà (to eat), saltà (to jump), cantà (to sing), etc.

To begin with: conjugations, present simple:

  • je parle (I speak), tu parles (you speak), il/elle parle (he/she speaks),
    nous parlons (we speak), vous parlez (you speak), ils/elles parlent (they speak)
  • je parlais (I was speaking), tu parlais (you were speaking), il/elle parlait (he/she was speaking),
    nous parlions (we were speaking), vous parliez (you were speaking), ils/elles parlaient (they were speaking)
  • je parlerai (I will speak), tu parleras (you will speak), il/elle parlera (he/she will speak), nous parlerons (we will speak), vous parlerez (you will speak), ils/elles parleront (they will speak).

Of interest:

  • French ‘parle’ is ambiguous since it can translate into parlu (I speak) or parla (he/she speaks).
  • French ‘parlais’ is ambiguous since it can translate into parlavu (I was speaking) or parlavi (you were speaking).