NLP@NU-2015

Interested in a scientist / researcher / intern position @ Wipro AI! Drop me an email with your CV.

CS 491: Natural Language Processing (Odd Semester)

Lecture 4			Language Modelling - Smoothing - Language Identification

Date			12th Sept, 2015

Lab/Assignment			Lab/Assignment Download Google bigram corpus: link Assignment 1: Predict Next Word corpus: Twitter Corpus (given in the first assignment) Steps: 1. Take random 5000 tweets. Divide them into 10 sets. Each set will be consisting of 500 words. 2. Automatically delete all the ith words from each tweet from the ith set. for example delete 5 the word from the set 5. 3. Predict the all the words using bigram, tri-gram and quad grams using Google n-grams. 4. Report me the average accuracy of the experiemts: using bigram, trigram and quadgram. Assignment 2: Create your own Language Model: bigram and trigram from the Twitter Corpus Steps: Use the same 5K tweets for trainging Use Laplace Smoothing Calculate Perplexity on a new 5K set Repeat the steps of Assigment 1 and report me the accuracy of your language model in comparison with Google n-gram Assignment 3: Create Language Identifier Steps: Download the Corpus: link Divide it into training (60%), development (20%), and test(20%) set. Create Language Profiles from the training set unerstand the distance meausure: threshold from the development set. Apply the same on the test set and report me the accuracy. Assignment 4: Create word-level Language Identifier for Code-Mixed text Steps: Collect the Corpus from me via email Divide it into training (60%), development (20%), and test(20%) set. Create Language Profiles from the training set: using character level ngrams unerstand the distance meausure: threshold from the development set. Apply the same on the test set and report me the accuracy.

Course Description

This is a introductory natural language processing (NLP) course. The broader goal is to understand how NLP tasks are carried out in the real world (e.g., Web, social media) and how to build tools for solving practical text mining problems. Throughout the course, emphasis will be placed on a understand NLP concepts and tying NLP techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in statistical machine learning and touches upon topics in social media text processing and sentiment analysis and a bit of Big Data Analysis.

Theory to be covered

Introduction: Why NLP

Regular Expression

Tokenization

Stemming

Sentence Boundary Detection

Spell Correction

Minimum Edit Distance

Language Modeling

N-Gram

Smoothing

Language Identification

POS Tagging

Text Classification

Sentiment Analysis

Dependency Parsing

Information Retrieval

Practical Aspects to be covered

WEKA: Machine Learning Toolkit

Hidden Markov Models

Naive Bayes

Support Vector Machines

Stanford NLP

Social Media and NLP

References

Text Book

Jurafsky, Dan and Martin, James, Speech and Language Processing, Second Edition, Prentice Hall, 2008.

References

Manning, Christopher and Heinrich, Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.

Charniack, Eugene, Statistical Language Learning, MIT Press, 1993.

Open access study materials

link by Jurafsky and Manning, Stanford

link by Pushpak Bhattacharya, IIT Bombay

Evaluation and Grading

Project

Expect multiple mini projects on various aspects of NLP and there will be a final project which will be assigned group wise.

Scores obtained in all the components of evaluation shall be totaled and the final score will be converted into letter grades (A, B, C, D, E, or NC) as per NIIT University policy.

Attendance Policy

Attendance will be taken everyday and missing class can be expected to significantly reduce your chances of success. There will be no repetition.

Missing Exams

If you miss a exam due to an unexcused absence, you will receive a grade of 0 for that quiz/exam.

If you miss a exam due to an excused absence, you must provide appropriate verification within one week of the quiz/exam. You will then be allowed to take the make-up exam at a date/time to be decided later. The make-up exam may be SIGNIFICANTLY MORE DIFFICULT than the original exam.

If you cannot be at the final exam, let me know as soon as you know.

No excuses will be entertained for the final project. If you do not work for the project or miss to submit report the will a grade of 0.

A Few Obligatory Points

You mush have a NU email account: Somethime I will communicate via email. I will ask students to create a mail group for easy group commucation.

By enrolling in this course, you agree to the NIIT University Policies.

Electronic Devices: Remember to turn off all electronic communication devices at the beginning of each class. Hope you will be coperative to me and to other fellow students.

Lecture 2			Minimum Edit Distance - Spell Correction

Date			22nd Aug, 2015

Lab/Assignment			Lab Write codes for MED: normal, alignment, and weighted. Assignment Run on the data I provided More details in mail

Lecture 3			Web Page Indexing - Basics

Date			5th Sept, 2015

Lab/Assignment			Lab/Assignment Create Positional Index File for you language Will give individual query to each group and the group has to send me the output. Download Corpus: Choose your language. Choose any 5K docs. English Hindi Bengali Tamil Gujarati Marathi Steps: Tokenize, Stop-Word Remove

Menu

Updates: Organizations

CS 491: Natural Language Processing (Odd Semester)

Introduction

Minimum Edit Distance - Spell Correction

Web Page Indexing - Basics

Language Modelling - Smoothing - Language Identification

Course Description

Theory to be covered

Practical Aspects to be covered

References

Text Book

References

Open access study materials

Evaluation and Grading

Project

Attendance Policy

Missing Exams

A Few Obligatory Points

Lecture 1			Introduction

Date			8th Aug, 2015

Lab/Assignment			Lab Regular Expression for tokenization and sentence boundary identification. Assignment Find out data set for your respective languages on Twitter/Facebook Run tokenization and sentence boundary identification on the data