Natural Language Processing for low resource languages Part 1

Text must change to vectors

1. Code for crawler

2. Code for finding most similar authors in the dataset

At the end of this blog, we would know how to model a simple content based recommendation system. We would be working towards a content based approach , which would develop a topic level understanding of the authors written work, in this example their ghazals/poems.

Hypothetically, suppose that you like ghazals written by “Mirza Ghalib” maybe because of his choice of topics and content. Based on our model, it seems, you would also like the works of Meer Taqi Meer and Firaq Gorakhpuri. *(hehe would need to validate this by actually asking some readers)

===============================

Author is ---- mirza-ghalib
Most similar authors are ----

meer-taqi-meer        0.89
firaq-gorakhpuri      0.86
noon-meem-rashid      0.73
bahadur-shah-zafar    0.60

===============================

*similarity values are in descending order, implying lesser similarity to the original author.

What is a low resource language ?

Many of the popular transformer based models use data available on internet, such as Wikipedia or reddit. While the English Wikipedia alone has 3.7 billion words , the same can not be said about all other human languages of the world. As more of the world chooses to publish and blog in English, there is less and less data output being produced in other languages.

Why is this interesting ? Urdu is a low resource language in NLP. Compared to English, which could have hundreds of thousands of articles floating around on the internet, there is not much content for Urdu, to train ML language models .

Ghazal is a form of poetry popular in South Asia. In terms of NLP, it provides interesting possiblities for future testing of language models. Wikipedia:

All data credits belong to the wonderful work done by Rekhta foundation

==============================================

Purpose of this blog:

1) Crawl a dataset for a low resource language 2) Try a toy nlp model for this data.

I want to highlight an important point at this momement. 4Mb of text data is nothing compared to what transformer based models actually need. 1300 text files in total.

common crawl dataset is a giant repository of free text data in more than 40 languages. If you actually want to train a transformer model from scratch, you would need data in order of millions of text files. And for that it would be best to start with one of these big data tools.

===============================================

Here is the same crawled dataset being put to use for finding out two most similar poets based on their written texts.

urdu_corpus=[] for root, dirn, files in os.walk(dir): for name in files: base=root.split('/')[-1] x=os.path.join(root, name) if 'ur' in base: f = open(x, "r") urdu_corpus.append(f.read())

Stop words removal

Stopwords are words which occur with a very high frequency but usually don’t have a lot of meaning. Think articles ‘a’,’an’,’the’ in English, prepositions or repeatitive verbs like ‘is’,’are’.

def remove_urdu_stopwords(stopwords_urdu,words): for stopword in stopwords_urdu: if stopword in words: words= list(filter(lambda x: x != stopword, words)) return words count_words_before=0 count_words_after=0 corp=[] for paper in urdu_corpus: paperlines= paper.lstrip().splitlines() words=[sentence.split(' ') for sentence in paperlines] words=[j for sub in words for j in sub] count_words_before+= len(words) words= remove_urdu_stopwords(stopwords_urdu,words) count_words_after+=len(words) corp.append(words) print('================================') print('Sample of a tokenized document:') print(corp[5]) print('length of this document ',len(corp[5])) print('================================') print('Total number of documents/ghazals ',len(corp)) print('number of token before stopwords removal ', count_words_before) print('after removal of stopwords ', count_words_after)

    ================================
    Sample of a tokenized document:
    ['سامنے', 'ستائش', 'چاہا', 'ہونٹوں', 'جنبش', 'اہل', 'محفل', 'کب', 'احوال', 'کھلا', 'اپنا', 'خاموش', 'پرسش', 'قدر', 'تعلق', 'چلا', 'جاتا', 'رنج', 'خواہش', 'کم', 'دونوں', 'بھرم', 'قائم', 'بخشش', 'گزارش', 'ادب', 'آداب', 'پیاسا', 'رکھا', 'محفل', 'صراحی', 'گردش', 'دکھ', 'اوڑھ', 'خلوت', 'پڑے', 'رہتے', 'بازار', 'زخموں', 'نمائش', 'ابر', 'کرم', 'ویرانۂ', 'جاں', 'دشت', 'بارش', 'کٹ', 'قبیلے', 'حفاظت', 'مقتل', 'شہر', 'ٹھہرے', 'جنبش', 'ہمیں', 'بھول', 'عجب', 'فرازؔ', 'میل', 'ملاقات', 'کوشش']
    length of this document  60
    ================================
    Total number of documents/ghazals  1314
    number of token before stopwords removal  194282
    after removal of stopwords  89404

Stopwords interestingly follow zipf’s law, where going from the most common word (most frequent) to least common word, would follow an approximately exponential graph.

*image source wikipedia

#checking the frequency distribution of tokens after removing stopwords tokens= [j for sub in corp for j in sub] fdist=FreqDist(tokens) print('========================') print('Token frequency distribution') print('========================') for f in fdist: if cnt<10: print(f, ' ', fdist[f])

    ========================
    Token frequency distribution
    ========================
    عشق   523
    غم   467
    سر   360
    نظر   349
    بات   311
    سب   292
    گل   291
    یاد   269
    گھر   256
    رہ   248



id2word = corpora.Dictionary(corp) id2word.filter_extremes(no_below=2, no_above=0.7) corpus = [id2word.doc2bow(doc) for doc in corp] lda_model= gensim.models.ldamodel.LdaModel(corpus= corpus, num_topics=6, random_state=100, update_every=0, passes=15 ,alpha='auto',eta='auto', per_word_topics= True) lda_model.save('./ghazal/lda_ghazals_rekhta') id2word.save_as_text("./ghazal/wordlist_ghazals_rekhta") np.save('./ghazal/ghazal_topic_words',topics) topics = lda_model.show_topics(num_topics=-1,num_words = 3, formatted = False) topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in topics] #Below Code Prints Topics and Words for topic,words in topics_words: print('topic ' + str(topic)+ "::") for w in words : print(str(id2word[int(w)])) print('=======================')

Different topics here could possibly represent which words often occur together in a document/poem.We have trained our LDA model here for 6 topics.


    topic 0::
    ساتھ
    دیکھتے
    بات
    =======================
    topic 1::
    جاتا
    ملتا
    ہوتے
    =======================
    topic 2::
    ہوگا
    آتے
    عشق
    =======================
    topic 3::
    خدا
    غم
    بات
    =======================
    topic 4::
    غم
    جاتے
    یوں
    =======================
    topic 5::
    عشق
    گل
    نظر
    =======================



def make_vector_from_topic_distribution(document_topics,num_topics): v= np.zeros(num_topics,dtype=float) for dc in document_topics: j=dc[0] v[j]= dc[1] return v def get_vector_document(x,lda): token_and_freq= id2word.doc2bow(x) doc_topics= lda.get_document_topics( token_and_freq, per_word_topics = False) s=0 for dc in doc_topics: s+=dc[1] v= make_vector_from_topic_distribution(doc_topics,num_topics=6) #print('sum of prob: ', s) return v

Building basic topic vector representations for each of the 30 authors

#Toy example
#Which two authors are most similar in terms of choice of topics or words used?
    ======================
    sample document vector for an author 
    50
    tensor([0.0000, 0.0000, 0.3098, 0.0000, 0.6847, 0.0000], dtype=torch.float64)
    ======================

Mean representation of all works of an author For now taking a mean representation for each author across his different works can be the most trivial approach, to get a single representation vector summarizing all their work. Of course this shall be an approximate representation

author_mean_vectors={} for author in authors: vectors=authors[author] x=torch.stack(vectors,dim=0).mean(dim=0) author_mean_vectors[author]=x
author1='mirza-ghalib'
vector1=author_mean_vectors[author1]
max_sim=0
most_sim=''
for author in author_mean_vectors:
    if author!=author1:
        sim=torch.dot(vector1,author_mean_vectors[author])
        if sim>=max_sim:
            max_sim=sim
            most_sim=author
print(most_sim)
            

    firaq-gorakhpuri

A more expanded list of similar authors for some author poets:

===============================
Author is ---- kaifi-azmi

Most similar authors are ----
ahmad-faraz       0.891351
waseem-barelvi    0.698940
javed-akhtar      0.694203
habib-jalib       0.635275

===============================

This ranking of similarity seems to make sense to me ,as Kaifi Azmi wrote songs for Bollywood movies and so did Javed Akhtar and Waseem Barelvi.

Author is ---- mirza-ghalib

Most similar authors are ----
meer-taqi-meer        0.897031
firaq-gorakhpuri      0.867845
noon-meem-rashid      0.732428
bahadur-shah-zafar    0.604054

===============================

Mirza Ghalib was a contemporary of bahadur shah zafar, hence it seems the nature of topics might have been similar. Also interestingly do these results say Ghalib was influenced a lot by the style of Meer Taqi Meer who came before him?

Author is ---- allama-iqbal

Most similar authors are ----
noon-meem-rashid    0.788498
nazm-tabatabai      0.781452
meer-anees          0.715229
meer-taqi-meer      0.623361
===============================================

Allama Iqbal wrote extensively on topics related to religion and self identity.

In my next blog I will try to build upon this crawled dataset. This is a sort of, kind of , mini finetuning dataset for a transformer model.

The text is available in English, Urdu and Hindi. So I could run and test the strength of this multi-lingual model in these 3 different ‘token’ approaches.