Tag Archives: false positive

Proper nouns: handling some false positives

Now handling some kind of false positives related to proper nouns translation. As this type of error is somewhat widespread, it could result in a 0.2% increase in overall accuracy.

Of interest in the present case:

  • recall that ‘détroit’ is French name for strittonu (straight, i.e. the straight of Gibraltar)
  • ‘Tours’ (the French city of) is also left untranslated, also being ambiguous with torri (towers) or ghjiri (turns)
  • 12th Street riot, Michigan are left untranslated
  • self-evaluation finds erroneously 2 vocabulary errors : riot and ‘th’ in 12th

Proper nouns: false positives again

Now we face false positives again: French proper noun ‘Détroit’ is translated erroneously into Strittonu when it shouls have been left untradslated, being a proper noun.  The ambiguity of ‘Détroit’ lies in the fact that it can be translated either into:

  • Détroit, the city
  • Strittonu, the Corsican word strittonu/strittone being the corresponding word for French noun ‘détroit’ (strait, i.e. the strait of Messina).

This raises the general issue of the proper disambiguation of proper nouns.

Evaluation of machine translation: why not self-evaluation?

Evaluation of machine translation is usually done via external tools (to cite some instances: ARPA, BLEU, METEOR, LEPOR, …). But let us investigate the idea of self-evaluation. For it seems that the software itself is capable of having an accurate idea of its possible errors.

In the above example, human evaluation yields a score of 1 – 5/88 = 94.31%. Contrast with self-evaluation which sums its possible errors: unknown words and disambiguation errors, thus entailing a self-evaluation of 92,05%, due to 7 hypothesized errors. In this case, self-evaluation computes the maximum error rate. But even here, there are some false positives: ‘apellation’ is left untranslated, being unrecognized. In effect, the correct spelling is ‘appellation’. To sum up: the software identifies an unknown word (and lefts it untranslated) and counts it as a possible error.

Let us sketch what could be the pros and cons of MT self-evaluation. To begin with, the pros:

  • it could provide a detailed taxonomy of possible errors: unknown words, unresolved grammatical disambiguation, unresolved semantical disambiguation, …
  • it could identify precisely the suspected errors
  • evaluation would be very fast and uncostly
  • self-evaluation would work with whatever text or corpus
  • self-evaluation could pave the way to further self-improvement and self-correction of errors
  • its reliability could be good

And the cons:

  • MT may be unaware of some types of errors, i.e. errors related to expressions and locutions
  • MT may be unaware of some types of errors, i.e. errors related to expressions and locutions
  • MT self-evaluation could be especially blind to grammatical errors
  • it would sometimes count as unknown words some foreign words that should remain untranslated
  • MT would be unaware of erroneous disambiguations