Decoding Texting Language to Standard Language

Overview:

Language usage over computer mediated discourses, like chats, emails and SMS texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language. We  formally investigate the nature and type of compressions used in SMS texts, and based on the findings develop a decoder from texting language to the standard language. Given an input text in the texting language, the decoder converts it to the corresponding standard form. A Hidden Markov Model based machine learning approach has been adopted to construct the decoder, where the model parameters are learnt from a parallel corpus between the texting and the standard languages. The word level accuracy of the system has been found to be 92%.

Resources:

Around  900 SMS texts were collected from the website http://www.treasuremytext.com. The website hosts a large number of SMS texts in various languages anonymously uploaded by the users. These SMS were manually translated to their standard form (thanks to the staffs and researchers of the Communication Empowerment Laboratory, Indian Institute of Technology, Kharagpur)  and automatically aligned at the word level using a heuristic algorithm. The accuracy of the algorithm is around 80%.

We also extracted a list of English words and their corresponding variations in the SMS texts along with the frequencies. The list was manually cleaned, so that it can be directly used for training purpose.

The aligned corpora can be downloaded from hereFor information on the alignment format click here.
The word-variation file can be downloaded from here. For instruction on format, click here.

 

Test Sets:
TS2 can be downloaded from here. This is the 1228 randomly selected unseen tokens. The format of the file is as follows:

<TL_Token freq_of_TL_token>{\n = <SL_Token freq_of_SL_Token>}+, where the SL_Tokens are the translation of the TL_Token and freq_of_SL_Token refers to the number of times this SL_token occurs as a translation of the TL_Token and NOT the absolute token frequency of the SL_Token. Also note that the TS3 is just a subset of TS2, where TL_Token not the same as the  SL_Token.

TS4 can be downloaded from here. This is the 138 randomly selected sms messages. The format of the file is as follows:

<SMS>sms message</SMS> \n <TRANS>Translated English message</TRANS>

Contact:

For any queries or clarifications, please write to:

Monojit Choudhury
Post Doctoral Researcher,
Microsoft Research India
"Scientia" 196/36 2nd main
Sadashivnagar, Bangalore
560 080 India

Ph No. +91-80-6658 6000 (6214)

    Email: monojit.choudhury [AT] gmail.com
    Homepage: http://www.cel.iitkgp.ernet.in/~monojit/

OR

Sudeshna Sarkar
Professor,
Department of Computer Science and Engineering,
Indian Institute of Technology, Kharagpur
India -- 721302

    Email: sudeshna [AT] cse.iitkgp.ernet.in
    Homepage: http://www.facweb.iitkgp.ernet.in/~sudeshna/