New I have couple of RA / TA positions, could be turned into PhD positions dependent on satisfactory performance. Contact me for details.

CS 491: Natural Language Processing (Odd Semester)

Lecture 1



8th Aug, 2015



Regular Expression for tokenization and sentence boundary identification.


Find out data set for your respective languages on Twitter/Facebook

Run tokenization and sentence boundary identification on the data



Lecture 2

Minimum Edit Distance - Spell Correction


22nd Aug, 2015



Write codes for MED: normal, alignment, and weighted.


Run on the data I provided

More details in mail


Lecture 3

Web Page Indexing - Basics


5th Sept, 2015



Create Positional Index File for you language

Will give individual query to each group and the group has to send me the output.

Download Corpus: Choose your language. Choose any 5K docs.







Steps: Tokenize, Stop-Word Remove



Lecture 4

Language Modelling - Smoothing - Language Identification


12th Sept, 2015



Download Google bigram corpus: link

Assignment 1: Predict Next Word corpus: Twitter Corpus (given in the first assignment)


  • 1. Take random 5000 tweets. Divide them into 10 sets. Each set will be consisting of 500 words.
  • 2. Automatically delete all the ith words from each tweet from the ith set. for example delete 5 the word from the set 5.
  • 3. Predict the all the words using bigram, tri-gram and quad grams using Google n-grams.
  • 4. Report me the average accuracy of the experiemts: using bigram, trigram and quadgram.


Assignment 2: Create your own Language Model: bigram and trigram from the Twitter Corpus


  • Use the same 5K tweets for trainging
  • Use Laplace Smoothing
  • Calculate Perplexity on a new 5K set
  • Repeat the steps of Assigment 1 and report me the accuracy of your language model in comparison with Google n-gram


Assignment 3: Create Language Identifier


  • Download the Corpus: link Divide it into training (60%), development (20%), and test(20%) set.
  • Create Language Profiles from the training set
  • unerstand the distance meausure: threshold from the development set.
  • Apply the same on the test set and report me the accuracy.


Assignment 4: Create word-level Language Identifier for Code-Mixed text


  • Collect the Corpus from me via email Divide it into training (60%), development (20%), and test(20%) set.
  • Create Language Profiles from the training set: using character level ngrams
  • unerstand the distance meausure: threshold from the development set.
  • Apply the same on the test set and report me the accuracy.



Course Description

This is a introductory natural language processing (NLP) course. The broader goal is to understand how NLP tasks are carried out in the real world (e.g., Web, social media) and how to build tools for solving practical text mining problems. Throughout the course, emphasis will be placed on a understand NLP concepts and tying NLP techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in statistical machine learning and touches upon topics in social media text processing and sentiment analysis and a bit of Big Data Analysis.

Theory to be covered

  • Introduction: Why NLP
  • Regular Expression
  • Tokenization
  • Stemming
  • Sentence Boundary Detection
  • Spell Correction
  • Minimum Edit Distance
  • Language Modeling
  • N-Gram
  • Smoothing
  • Language Identification
  • POS Tagging
  • Text Classification
  • Sentiment Analysis
  • Dependency Parsing
  • Information Retrieval
  • Practical Aspects to be covered

  • WEKA: Machine Learning Toolkit
  • Hidden Markov Models
  • Naive Bayes
  • Support Vector Machines
  • Stanford NLP
  • Social Media and NLP
  • References

    Text Book

  • Jurafsky, Dan and Martin, James, Speech and Language Processing, Second Edition, Prentice Hall, 2008.
  • References

  • Manning, Christopher and Heinrich, Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.
  • Charniack, Eugene, Statistical Language Learning, MIT Press, 1993.
  • Open access study materials

  • link by Jurafsky and Manning, Stanford
  • link by Pushpak Bhattacharya, IIT Bombay
  • Evaluation and Grading


    Expect multiple mini projects on various aspects of NLP and there will be a final project which will be assigned group wise.

    Scores obtained in all the components of evaluation shall be totaled and the final score will be converted into letter grades (A, B, C, D, E, or NC) as per NIIT University policy.

    Attendance Policy

    Attendance will be taken everyday and missing class can be expected to significantly reduce your chances of success. There will be no repetition.

    Missing Exams

  • If you miss a exam due to an unexcused absence, you will receive a grade of 0 for that quiz/exam.
  • If you miss a exam due to an excused absence, you must provide appropriate verification within one week of the quiz/exam. You will then be allowed to take the make-up exam at a date/time to be decided later. The make-up exam may be SIGNIFICANTLY MORE DIFFICULT than the original exam.
  • If you cannot be at the final exam, let me know as soon as you know.
  • No excuses will be entertained for the final project. If you do not work for the project or miss to submit report the will a grade of 0.
  • A Few Obligatory Points

  • You mush have a NU email account: Somethime I will communicate via email. I will ask students to create a mail group for easy group commucation.
  • By enrolling in this course, you agree to the NIIT University Policies.
  • Electronic Devices: Remember to turn off all electronic communication devices at the beginning of each class. Hope you will be coperative to me and to other fellow students.