Natural Language Processing for low resource languages:Part 2

Do go through lilians blogs for all the discussion happening about generative language models, beam search approaches, and the whole thing. https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html

How do language models power the machine translation systems?

Any time a request for translate to English is made . or there is use of Google translate, these systems are powering them.

The translation program most of the times would not need to go over an actual dictionary. What you can instead do is to give it a huge corpus of text written parallel for both languages.

Apart from translation, if you have a single language corpus, even that is helpful to learn

What is a low resource language ?

These translations can be seen as “Auto-regressive” as the model is generating the trasnlated target word, conditioned on its own previous output.

Getting the data

Purpose of this article:

1) Use this data and parser code 2) Try it on a low resource language.

I want to highlight an important point at this momement. 4Mb of text data is nothing compared to what transformer based models actually need.

common crawl dataset is a giant repository of free text data in more than 40 languages. If you actually want to train a transformer model from scratch, you would need data in order of millions of text files. And for that it would be best to start with one of these big data tools.

With that already said, here is the link to my github repository

Why is this interesting ? Urdu is a low resource language in NLP. Compared to English, which could have hundreds of thousands of articles floating around on the internet, there is not much content for Urdu, to train ML language models .

Ghazal is a form of poetry popular in South Asia. In terms of NLP, it provides interesting possiblities for future testing of language models. Wikipedia:

The ghazal is a short poem consisting of rhyming couplets, called Sher or Bayt.
Most ghazals have between seven and twelve shers. For a poem to be considered a true ghazal, it must have no fewer than five couplets.
Almost all ghazals confine themselves to less than fifteen couplets (poems that exceed this length are more accurately considered as qasidas). Ghazal couplets end with the same rhyming pattern and are expected to have the same meter.
The ghazal’s uniqueness arises from its rhyme and refrain rules, referred to as the ‘qaafiyaa’ and ‘radif’ respectively.
Each sher is self-contained and independent from the others, containing the complete expression of an idea.

===============================================

All data credits belong to the wonderful work done by Rekhta foundation

This is a sort of, kind of , mini XLU dataset for me. The text is available in English, Urdu and Hindi. So I could run and test the strength of this multi-lingual model in these 3 different ‘token’ approaches.

pre-training one nlp model on data from multiple languages . then we can attempt to fine tune. Fine tuning is another way of saying that if we already have a “trained” model , we train it a bit more, to make sure the model is successful for the task at hand.

mT5 model from Google There can be various common properties between different languages. We can check its performance on these two non-latin scripts.

languages can share subwords or roots. “splits the input into the most common sub-words across all languages,” The better results seem obvious this way.

Facebook is interesting , because we as users , can see how the quality of the translations available on their platform has improved over the years. So what have been these key moments ? Some terms on the timeline. LASER(2 years old now) from facebook (removed the need of having English as a pivot language) FairSeq production system MT CCMatrix dataset Fairscale

Goal is to make a single model which can understand all languages, instead of preparing individual models for different languages. Why does this help ? “The claim is that single language models (or English centric models)” may not scale well. The improvements possible might be limited, and efforts over increasing quality in other languages can be redundant. For example facebook which has to deal with hundreds of languages being used by its billions of users all over the world.

I am leaving aside mBart and CRISS , which are the newest releases at the moment.

For XLM , “Instead of using FastText embeddings, the initial embeddings of the tokens are taken from a pretrained MLM and fed into the translation model.” But then where is BPE in XLM paper ??

How does this model work ?

there exists an encoder and decoder.

Vocabulary is not shared between languages.

BPE helps for reducing vocabulary size, between related languages.

and eliminate presence of unknown words in the output translation.???? (what does it mean?? )

from the paper on phrase based NMT “instead of learning an explicit mapping between BPEs in the source and target languages, we define BPE tokens by jointly processing both monolingual corpora. If languages are related, they will naturally share a good fraction of BPE tokens, which eliminates the need to infer a bilingual dictionary. In practice, we i) join the monolingual corpora, ii) apply BPE tokenization on the resulting corpus, and iii) learn token embeddings (Mikolov et al., 2013) on the same corpus, which are then used to initialize the lookup tables in the encoder and decoder.” ??????

I had a nice time reading these papers. But then why does this matter ?

Here are some possible use cases:

The conventional assumption in the MT community seems to be use a large mono-lingual corpus , to prepare for MT in case there is less amount of parallel corpus available for a low resource language. So as a side though, How about a language like Hinglish ? Or can Hinglish be tackled by BPE tokenization, and then training with English + Hindi + Hinglish ?

size of sentences in my data corpus a previous NLP corpus had trained on 5.5 mill monloingual dataset and tested on 1800 sentences.(phrased based neural NMT paper)

Moses scripts for tokenization ?? NMT is trained with 60,000 BPE tokens across En, Fr, Ger, Romanian, Rus, Ur

confirm this: For NMT , FastText is applied on concatenation of source and target corpora

So is Fast Text == BPE ??

MUSE library for phrase tables ???

Use of this multilingual transformer model.

If you want to check an explanation of transformers, you can also refer to this previous blog post of mine.(link)

add image of the performance table of XLM

use this to talk about why you think accuracy could be low?

Wikipedia is corpus for monolingual data. language model learns the underlying structure of the text/sentences. The parallel corpus data (for machine translation) is mentioned as having been taken from Tanzil dataset The expression I had when I checked this data source was Okkaaaaaaaaaaaay. Its a site having translations of religious text. I am sure it won’t be able to capture a lot of meaning in daily life.

contextual representation of words. earlier problem of banks: if it means a river bank, or a money bank.

How do these generative models work ?

What you would need to prepare for your particular language.

Conditional generation of language based on your language model.

Possibly, in theory , a language model should also be able to auto-complete code. since most code follows a syntax structure, and what people usually write after 3 lines of a given code.

Domain specific text generation

back translation seems to be the technique for low resource language generation. “ If our goal is to train a Chinese-to-French translation model, for instance, we’d first train a model for French to Chinese and translate all of the monolingual French data to create synthetic, back-translated Chinese. We’ve found that this method is particularly effective at large scale, when translating hundreds of millions of monolingual sentences into parallel data sets.”

??Shared BPE vocabulary , of 100k subword units?? “We use fastBPE6 to learn BPE codes and split words into subword units. The BPE codes are learned on the concatenation of sentences sampled from all languages, following the method presented in Section 3.1.”

From the XLM paper: “For low-resource languages, it is often beneficial to leverage data in similar but higher-resource languages, especially when they share a significant fraction of their vocabularies. For instance, there are about 100k sentences written in Nepali on Wikipedia, and about 6 times more in Hindi. These two languages also have more than 80% of their tokens in common in a shared BPE vocabulary of 100k subword units.”

Of all the papers I read, XLM, mT5 or Phrase based neural MT,mBart made the biggest claims. “For example, fine-tuning on bi-text in one language pair (e.g., Korean-English) creates a model that can translate from all other languages in the monolingual pre-training set (e.g., Italian-English), with no further training. We also show that languages not in pre-training corpora can benefit from mBART, strongly suggesting that the initialization is at least partially language universal. Finally, we present a detailed analysis of which factors contribute the most to effective pre-training, including the number of languages and their overall similarity…… We tokenize with a sentencepiece model (SPM, Kudo and Richardson, 2018) learned on the full CC data that includes 250, 000 subword tokens………..– low resource (<1M sentence pairs)” “Unseen Vocabularies Arabic is distantly related to the languages in mBART02 and mBART06, and its use of a disjoint character set means that it word embeddings will be largely untrained. However, we obtain similar improvements on Ar-En pairs to those on Nl-En. This result suggests that the pretrained Transformer layers learn universal properties of language that generalize well even with minimal lexical overlap.”

balanced corpus sampling formula

References

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

The first AI model that translates 100 languages without relying on English data

mc4 dataset from Google for mT5 mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape C4 was purely English

mBART

Moses Moses offers phrase-based and tree-based Moses: “Let us look at the first line of the phrase translation table (file phrase-table):

der			the			0.3
This entry means that the probality of translating the English word the from the German der is 0.3. Or in mathematical notation: p(the	der)=0.3. Note that these translation probabilities are in the inverse order due to the noisy channel model.

The translation tables are the main knowledge source for the machine translation decoder. The decoder consults these tables to figure out how to translate input in one language into output in another language.

Being a phrase translation model, the translation tables do not only contain single word entries, but multi-word entries. These are called phrases, but this concept means nothing more than an arbitrary sequence of words, with no sophisticated linguistic motivation.

Here is an example for a phrase translation entry in phrase-table:

das ist

this is

0.8

”

The first AI model that translates 100 languages without relying on English data This blog from facebook is very helpful “Next, we introduced a new bridge mining strategy, in which we group languages into 14 language groups based on linguistic classification, geography, and cultural similarities. We did this because people living in countries with languages of the same family tend to communicate more often and would benefit from high-quality translations. For instance, one group would include languages spoken in India, like Bengali, Hindi, Marathi, Nepali, Tamil, and Urdu. We systematically mined all possible language pairs within each group.

To connect the languages of different groups, we identified a small number of bridge languages, which are usually one to three major languages of each group. In the example above, Hindi, Bengali, and Tamil would be bridge languages for Indo-Aryan languages. We then mined parallel training data for all possible combinations of these bridge languages. Using this technique, our training data set ended up with 7.5 billion parallel sentences of data, corresponding to 2,200 directions. Since the mined data can be used to train two directions of a given language pair (e.g., en->fr and fr->en), our mining strategy helps us effectively sparsely mine to best cover all 100x100 (a total of 9,900) directions in one model.”

Fairscale looks cool for the future task of scaling for NLP datasets.

“mT5 paper: Following XLM-R (Conneau et al., 2018), we increase the vocabulary size to 250,000 wordpieces. As in T5, we use SentencePiece (Kudo and Richardson, 2018; Kudo, 2018) wordpiece models that are trained with the same language sampling rates used during training. To accommodate languages with large character sets like Chinese, we use a character coverage of 0.99999, but also enable SentencePiece’s “byte-fallback” feature to ensure that any string can be uniquely encoded”