New Interested in a scientist / researcher / intern position @ Wipro AI! Drop me an email with your CV.

TOOL CONTEST ON POS TAGGING FOR CODE-MIXED INDIAN SOCIAL MEDIA (FACEBOOK, TWITTER, AND WHATSAPP) TEXT @ ICON 2016

RATIONALE

The evolution of social media texts such as blogs, micro-blogs (e.g., Twitter), WhatsApp, and chats (e.g., Facebook messages) has created many new opportunities for information access and language technology, but also many new challenges, making it one of the prime present-day research areas. Non-English speakers, especially Indians, do not always use Unicode to write something in social media in ILs. Instead, they use phonetic typing/ roman script/ transliteration and frequently insert English words or phrases through code-mixing and anglicisms (see the following example [1]), and often mix multiple languages to express their thoughts. While it is clear that English still is the principal language for social media communications, there is a growing need to develop technologies for other languages, including Indian languages. India is home to several hundred languages. Language diversity and dialect changes instigate frequent code-mixing in India. Hence, Indians are multi-lingual by adaptation and necessity, and frequently change and mix languages in social media contexts, which poses additional difficulties for automatic Indian social media text processing. Part-of-speech (POS) tagging is an essential prerequisite for any kind of NLP applications. This year we will continue the last year’s POS tagging shared-task on three widely spoken Indian languages (Hindi, Bengali, and Telugu), mixed with English.

Example 1: ICON 2016 Varanasi me hold hoga! Great chance to see the pracheen nagari!

THE CONTEST 
Participants will be provided training, development and test data to report the efficiency of their POS tagging system. English-Hindi, English-Bengali, and English-Telugu language mixing will be explored. The datasets may be provided with some additional information like the languages of each word. Efficiency will be measured in terms of Precision, Recall, and F-measure. Shortlisted candidates will present their techniques and results in a special session at ICON 2016. 

The contest will have three prizes: 

FIRST PRIZE: Rs.10,000/- 
SECOND PRIZE: Rs.7,500/- 
THIRD PRIZE: Rs.5,000/-

NewWHATS NEW THIS YEAR 
We are releasing code-mixed WhatsApp data for 3 language pairs: English-Hindi, English-Bengali, and English-Telugu. Possibly this is the first time NLP related issue on WhatsApp messages is being discussed. WhatsApp messages are relatively much smaller than Facebook and Twitter messahes, therefore more challenging. Hopefully it will be a exciting!

THE TASK 
The contest task is to predict POS tags at word level, whereas language tags (en, hi/bn/te, univ {symbols, @ mentions, hashtags}, mixed {word level mixing like jugading}, acro {lol, rofl, etc}, ne, undef) at word level will be given. There will be two tracks: fine grained a coarse-grained tagset (Google universal tagset). Fine-grained tagset and their mapping to coarse-grained tagset is mentioned in the Table 1. More details about the tagset could be found in our RANLP paper.

Table 1: POS Tagset

Each team may submit up to 4 runs, one constrained (*2 for fine-grained and coarse-grained) and one unconstrained (*2 for fine-grained and coarse-grained).

Constrained: Means the participant team is only allowed to use our corpus for the training. No external resource is allowed.

Unconstrained: Means the participant team can use any external resource (available POS tagger, NER, Parser, and any additional data) to train their system. Accordingly they have to mention those resources explicitly in their task-report.

WINNER SELECTION 
Team will be doing best in all the language pairs using only our data (constrained) will be the winner. All the unconstrained submission will used for the academic discussion during the session.

** Note: teams can use ICON 2015 data as additional resource, but the submission will be considered as unconstrained.

DATA 
Training data for Twitter (1K), Facebook (1K) and WhatsApp (1K) will be release for all the 3 language pairs: English-Hindi, English-Bengali, and English-Telugu. Although for bi- or multi-linguals code-mixing is a natural practice, but what is the actual distribution of code-mixing in any social-media corpus is an important question. We have observed that monolingual English and romanized Indian languages (ILs) messages are also equally prevalent in social media. For this cotest we discarded almost all the monolingual English messages, as there are other research efforts and forums, where the actual research problem with English social media has been discussed extensively. Here we will be concentrating only on code-mixed En-ILs and monolingual ILs.

While two languages are blending, another important question might be raised is which language is mixing in what. To keep our data balanced we keep an equal distribution of utterances where English mixed in ILs and ILs mixed in English.

Although our corpus is mostly bi-lingual mix but there are utterances with tri- quad-lingual mix. For example in the English-Bengali corpus there are significant Hindi word mix, whereas in the English-Telugu data there are significant Tamil and Hindi mix.

NewData Release 

Language/Source Facebook Twitter WhatsApp
Fine-Grained Coarse-Grained Fine-Grained Coarse-Grained Fine-Grained Coarse-Grained
Hindi-English FB_HI_EN_FN.txt FB_HI_EN_CR.txt TWT_HI_EN_FN.txt TWT_HI_EN_CR.txt WA_HI_EN_FN.txt WA_HI_EN_CR.txt
Bengali-English FB_BN_EN_FN.txt FB_BN_EN_CR.txt TWT_BN_EN_FN.txt TWT_BN_EN_CR.txt WA_BN_EN_FN.txt WA_BN_EN_CR.txt
Telugu-English FB_TE_EN_FN.txt FB_TE_EN_CR.txt TWT_TE_EN_FN.txt TWT_TE_EN_CR.txt WA_TE_EN_FN.txt WA_TE_EN_CR.txt

A manually corrected version of the data is available here - http://www.amitavadas.com/ICON2016/ICON_POS.zip 

* Special Thanks to Monojit Choudhury, Simran Khanuja, and Sunayana Sitaram from Microsoft Research India.

NewInvited Speakers 
Monojit Choudhury

Bio: Monojit Choudhury is a Researcher at Microsoft Research Lab India. Prior to this, he did his PhD (2007) and B.Tech (2002), both in Computer Science and Engineering, from Indian Institute of Technology Kharagpur. His research interests include NLP for low resource languages, technologies for multilingual communities, and computational approaches to linguistics, sociolinguistics, evolutionary linguistics and cognition. Monojit is very actively involved with the organization of the International Linguistics Olympiad http://www.ioling.org and its Indian national counterpart – the Panini Linguistics Olympiad http://plo-in.org – programs that try to attract the brightest high school kids to linguistics and NLP through challenging yet interesting and thought-provoking puzzles.

Kalika Bali

Bio: Kalika Bali is a Researcher at Microsoft Research Lab India. A linguist and an acoustic phonetician by training, she has worked for the last 15 years in the area of Speech and Language Technology, especially for resource poor languages. Her brief stint as a lecturer in the University of the South Pacific, Fiji, has left her with a lasting interest in how technology can be used to enhance and further education and some of her current research lies at the intersection of ICT and Education, for primary school students to Adults learning new skills. The primary focus of her research is on how Natural Language systems can help Human-Computer Interaction, including computer-mediated interaction, in the domain of education and social media.

UNDERSTANDING THE DATA: COMPARING UTTERANCE THE LEVEL OF CODE-MIXING IN CORPORA

When comparing different code-mixed corpora to each other, it is desirable to have a measurement of the level of mixing between languages. To this end we introduced the Code-Mixing Index, CMI, in (Gambäck & Das, 2016; Gambäck & Das, 2014b; Das and Gambäck, 2014a). At the utterance level, this amounts to finding the most frequent language in the utterance and then counting the frequency of the words belonging to all other languages present.

If an utterance x only contains language independent tokens, its code-mixing is zero; for other utterances, the level of mixing depends on the fraction of language dependent tokens that belong to the matrix language (the most frequent language in the utterance) and on N, the number of tokens in x except the language independent ones (i.e., all tokens that belong to any language Li):

(, the set of all languages in the corpus; ). Notably, for mono-lingual utterances Cu = 0 (since then ).

This initial measure has several short-comings. In particular, it does not reflect what fraction of a corpus’ utterances contain code-switching, nor take into account the number of code alternation points: arguably, a higher number of language switches in an utterance increases its complexity, while a corpus with a larger fraction of mixed utterances is (on average) more complex.

Two main sources of information will be utilized to fully account for the code alternation at utterance level: the ratio of tokens belonging to the matrix language ( as in Equation 1) and the number of code alternation points per token (fp = P/N, where P is the number of code alternation points; 0 ≤ PN).

There are many ways to combine two (or several) information sources, in particular if they are independent; see, e.g., Genest and McConway (1990) for an overview. However, P partially depends on , which, for example, rules out the common logarithmic opinion poll:

Instead we will use the linear opinion poll:

Combining fm(x) and fp(x) gives a revised utterance level measure for N(x) > 0:

where wm and wp are weights (wm + wp = 1). Again, Cu = 0 for mono-lingual utterances (since then = N and P = 0).

USING CMI IN PRACTICE: Indeed none of these corpus is perfectly code-mixed all the time. There are monolingual utterances as well and even there are only universal utterances like a message only containing smiley. Therefore we use two kinds of CMI measures: average over all utterances, called CMI-ALL and average over the utterances having a non-zero CMI called CMI-MIXED. CMI-ALL is a measures to understand how much mixed the corpus is whereas CMI-MIXED is a measure to understand how much mixed all the Code-Mixed utterances are in any corpora.

Table 2: Code-Mixing in various Corpus

Testing the idea that the Code-Mixing Index can describe the complexity of code-switched corpora, we used it to compare the level of language mixing in our English–Hindi corpus (in total, and each of the Facebook and Twitter parts in isolation) to that of the English-Hindi corpus of Vyas et al. (2014), the Dutch-Turkish corpus introduced by Nguyen and Do˘gruöz (2013), and the corpora used in the 2014 shared tasks at FIRE and EMNLP. Table 2 shows the average CMI values for these corpora, both over all utterances and over only the utterances having a non-zero CMI (i.e., the utterances that contain some code-mixing). The last column of the table gives the fraction of mixed utterances in the respective corpora.

Obviously, code-mixing is more common in geographical regions with a high percentage of bilingual individuals, such as in Texas and California in the US, Hongkong and Macao in China, many European and African countries, and the countries in South-East Asia. Multi-linguality (and hence code-mixing) is very common in India, that has close to 500 spoken languages (or over 1600, on some accounts), with about 30 languages having more than 1 million speakers. Language diversity and dialect changes trigger Indians to frequently change and mix languages, in particular in speech and in social media contexts. More importantly Indians and people in Oriental sub-continentmix more rigorously than others. Clear from the Table 1 EMNLP corpus section. ICON 2015 tool contest data was in the range of 13.38 (CMI-AVG), 21.90 (CMI-MIXED). It is expected 2016 data will be in the same/above range of mixing.

IMPORTANT DATES (all dates are tentative)

Registration for the task begins: 7th Aug 2016

Training/Dev data release: 10th Aug 2016

Test Set release: 28th Sep 2016

Submit Run: Within 24 hours of the test data receive

Results announced: 3rd Oct 2016

Working Notes submission deadline: 15th Oct 2016

Working Notes reviews: 1st Nov 2016

Working Notes final versions due: 15th Nov 2016

REGISTRATION

Please fill up this form: Link to express your interest in taking part in this contest. Due to pravacy policies of Facebook and WhatsApp we will not be able to release the data publicly. Once you submit the form please write to me: Request for the Data.

PREVIOUS YEAR'S RESULTS AND PAPERS

Total 8 team participated. We calculated the average score of each system on three different languages and found the folowing rank order.

IIITH: 1st (76.79%)
AMRITA_CEN: 2nd (75.79%)
KS_JU: 3rd (75.6%)

Detailed previous year result is downloadable.

Previous year's data: download 2015 data.

Reports of all the top 4 teams.

IIITH: Arnav Sharma, and Raveesh Motlani. POS Tagging For Code-Mixed Indian Social Media Text : Systems from IIIT-H for ICON NLP Tools Contest.

AMRITA: Anand Kumar M, and Soman K P. AMRITA_CEN@ICON-2015: Part-of-Speech Tagging on Indian Language Mixed Scripts in Social Media.

JU: Kamal Sarkar. Part-of-Speech Tagging for Code-mixed Indian Social Media Text at ICON 2015.

CDAC MUMBAI: Prakash B. Pimpale, and Raj Nath Patel. Experiments with POS Tagging Code-mixed Indian Social Media Text.

AFTER ICON 2016

We will be releasing the data for research.

CONTACT

amitava {DOT} das {AT} iiits {DOT} in

OTHER FORUMS ON CODE-MIXING

FIRE Shared Task on Mixed Script Information Retreival

First Workshop on Computational Approaches to Code Switching with EMNLP 2014

First Workshop on Language Technologies for Indian Social Media Text (सOCIAL-ईNDIA)

REFERENCES