Echelon

Tabs - jQuery plugin for accessible, unobtrusive tabs
EVENT HEAD
Mallikarjun Karra
mallikarjun@robotix.in
+91-9735311596

Final submission deadline extended till the midnight of 28th January.

"Intelligence is the art of good guesswork."

--H.B. BARLOW, The Oxford Companion to the Mind

Over the past few decades, the concept of Artificial Intelligence has been one of the most celebrated and documented in the fields of science and technology. The famous British scientist Alan Turing revolutionized the concept of machines being able to think, which had earlier strayed into philosophy, by introducing the notion of machines that could act in a manner quite indistinguishable from thinking entities. The search has been frantic to find a way to combine the awesome computing power of machines with the ability to infer and imply autonomously. The real world applications of such developments are enormous, ranging from internet security to database processing, from industrial efficiency to medical accuracy. Security agencies like the NSA constantly monitor all electronic communications in order to prevent terrorist attacks. In fact, data scientists are the most sought after professionals today!

One of the pivotal aspects of Artificial Intelligence, is Natural Language Processing (NLP), which, as the name suggests, involves a range of computational techniques to achieve human-like processing of the languages we use to communicate. In keeping with its tradition of having problem statements that are innovative and practical, the online event of Robotix, for its 2012 edition Echelon, aims at the extensive usage of basic NLP for practical purposes.

Why should I participate?

  • Because it is a unique problem statement that you probably won't find elsewhere.
  • Because if you can implement an approach that yields a good result, you could present it at a decent AI conference.
  • Because you stand a chance of winning exciting cash prizes.
  • As said earlier, this is an NLP based event. Each participant is required to submit a computer program that is capable of processing electronic communication between multiple, anonymous netizens.

    From a random IRC (Internet Relay Chat) log, a small percentage of the user nicknames will be randomly deleted. The participant’s code must be able to fill in the missing user nicks. In other words, this event requires you to come up with an intelligent program that can figure out “who could have said what”.

    It may be assumed that all communication will be in modern English, with or without punctuation.

    For example -

    Original Chat:
    ..
    <user1>: Hey!
    <user2>: Hello! Where are you from, user1?
    <user3>: Can anybody help me out with Gnome installation?
    <user1>: India. user3, do you have the X Windows System installed?
    <user2>: Cool. What is Gnome, user3?
    <user3>: I don’t know. How do I check?
    <user3>: Its a desktop environment, user2.
    <user2>: Oh yeah! Just googled.
    <user1>: Type “startx” on the command line. Login as root and type “apt-get install gnome”.
    <user3>: Thanks!
    <user5>: I’m root, obey me!
    <user2>: Huh?!
    <user3>: user2, you better start using Linux!
    ...

    The following only will be given to the participant.

    Chat log with some nicks deleted:

    ..
    <user1>: Hey!
    <user2>: Hello! Where are you from, user1?
    <user3>: Can anybody help me out with Gnome installation?
    <user1>: India. user3, do you have the X Windows System installed?
    <user2>: Cool. What is Gnome, user3?
    <$$$>: I don’t know. How do I check?
    <$$$>: Its a desktop environment, user2.
    <user2>: Oh yeah! Just googled.
    <user1>: Type “startx” on the command line. Login as root and type “apt-get install gnome”.
    <user3>: Thanks!
    <$$$>: I’m root, obey me!
    <$$$>: Huh?!
    <user3>: user2, you better start using Linux!
    ...

    The participant’s code will have the task of replacing <$$$> with the appropriate user nicks. In ambiguous cases, like the random comment by <user5> in the above example (which could have been said by any other user too!), it may be better to leave a blank in order to avoid a negative mark (see the section on input and output in the submitting code section).

    1. C = Total number of <$$$>s replaced correctly.
      I = Total number of <$$$>s replaced incorrectly.

      Your final score = 3C - I

    2. The team with the highest score shall be declared the winner. In case of a persistent tie, execution time will be used as a tie-breaker (the lesser the better!).
    3. Up to 3 members per team are allowed.
    4. Use of third-party libraries is allowed (and encouraged).
    5. Participants are free to use any programming language (or a combination of two or more).
    6. Use of internet is not allowed. Team Robotix's computer will _not_ be connected to the internet at the time of evaluation.
    7. Restriction on RAM usage is 300 MB. Also, the application must not be larger than 3 MB, nor should it take more than 10 minutes for execution.
    8. If a particular submission generates files or makes changes to existing files after each run, the source will be re-extracted and re-built after each execution in a preliminary round. However, no such thing will be done for the 5 or so runs during the final round. The same source will be used run after run (even if there are changes within the program directory). Changes outside the program directory are strictly prohibited.
    9. The submission should be compatible with at least one of the following - a.) Linux - 32 bit b.) Linux - 64 bit c.) Windows7 - 32 bit d.) Windows7 - 64 bit.
    10. Decision of Team Robotix's computer is final and binding.


    Input and output

    Your project must take a raw text file name (which would be a chat log, obviously!) as a command line argument, and return a text file (output.txt, housed in the same directory as the project) containing a list of nicks, each separated by a comma (in the order of missing <$$$>s). Leave a space (between the preceding and following commas) if you are not sure and don't want to be awarded a negative mark.


    How do I submit my solution?


    Bundle your project as a .rar/.zip/.tar.gz, and mail it to echelon@robotix.inMention your team name in the subject. 

    Apart from project related files, the attachment must contain the following two files:

    1. team_id.txt - This file is supposed to contain the name(s), phone number(s), e-mail id(s) and institute/college affiliation(s) of the participating team member(s).
    2. essentials.txt - This file must contain a list of third-party libraries (with a link to the source) used to build the project, so that we can run your code. Other instructions, if any, may also be included.

    What is the deadline for submission?

    You can send a preliminary (even a partially working version) of your code before the midnight of Jan. 28, 2012. The latest version (mailed on the same e-mail thread as the preliminary submission) of your project will be evaluated on Jan. 29, 2012. And the results will be declared on 30th.

    This is a sample chat log for practice, with 10 deleted nicks. Note that the chat log that will be used to evaluate your submission will also be in the same format. You may find more chat logs to download and use here.

    Sample Chat log

    Solution


    Getting Started Guide


    The author of this article believes that the first step towards understanding any subject is to read Wikipedia articles on the same. Natural Language Processing (NLP), Parsing, Part of Speech Tagging, Named Entity Recognition, Question Answering, Information Extraction, Relationship Extraction and Sentiment Analysis are some of the terms that you should get acquainted with.

    The next step is to pick a programming language you are comfortable with and an NLP toolkit written in the same, and start testing the toolkit’s functionalities to perform various NLP tasks on a sample text. Once you are comfortable playing with them, you can start coding. The coding will then mainly involve knowing when and where to use the various toolkit functions and handling the retrieved information.

    The sample approach described here, is a pretty naive one and will not always produce correct results. However, this is something to help you get started on. To implement it, you will need Python and Python NLTK (Natural Language Toolkit) installed. Installation instructions may be found here. The NLTK getting started guide will give you a quick start.

    Don’t worry if you have no idea about the python language. It is easy to learn (won’t take more than a few days if you already know some other programming language). Also, the functions that we borrow from NLTK will be somewhat similar in other NLP libraries/toolkits.

    This approach covers just two techniques:
    1. Named Entity Recognition (NER) and Relation Extraction.
    2. Checking if the unknown user is directly involved in a question and answer exchange with another user.



    Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There exist 5 different criteria to determine the essence of named entities:

    Orthographic criteria: named entities are usually capitalized.

    Translation criteria: named entities usually do not translate from one language to the next. However, the transcribed names of places, monarchs, popes, and non-contemporary authors often share spelling, and are sometimes universal.

    Generic/specific criteria: named entities usually refer to single individuals. A mention of “John Smith” refers to an individual, but the gene “P53-wt” or the product “Canon EOS Rebel Xti” refer to multiple instances of entities. [source]

    Rigid designation criteria: named entities usually designate rigidly. Proper names and certain natural terms-including biological taxa and types of natural substances (most famously “water” and “H2O”) are rigidly designated. [source]

    Information Extraction (IE) criteria: named entities fill some predefined “Who? What? Where? When?” template. This surely includes money, measure, date, time, proper names, and themes such as accident, murder, election, etc.

    NER focuses on 4 things - time/age/date, names of people, names of organisations and locations/places. Reading chapter 7 of the nltk book should give ample idea about how to execute chunking, NER and relation extraction.

    Consider a chat snippet like -
    ...
    <user5>: Just back from a holiday. Dal lake was amazing.

    <$$$>: I’ve never been to J&K.

    <user4>: Anybody know how to tunnel a proxy here?

    <user8>: Srinagar is frequently hit by terror attacks.

    <user1>: Yes, user4. PM* me.

    * PM = Private Message.  

    Thus, in spite of many people being present in the chatroom, using NER and relation extraction, we can narrow down our search to a few nicks only, who are discussing the particular topic. In the above case, an NER and relation extraction would indicate that Dal lake is in Srinagar, J&K. Thus our code should infer that there is a strong possibility that the discussion is limited to <user5> and <user8>.

    Effectively, we are checking if there is any relationship between the named entities entered by the users and making an educated guess as to who <$$$> could be.

    This approach is not an exact science. “I’ve never been to J&K”, may have been entered by a random user who isn’t a part of the exchange between <user5> and <user8>. Can you think of ways to detect such a possibility? Or of picking the more likely user amongst 2-3 users discussing the same topic?

    The second technique that can be easily implemented is to check if there is a direct exchange of a question and an answer, as in case of <user1> and <user4>. Consider another chat snippet -

    <$$$>: How can I set a key for example ctrl+K to print a message? Please help.

    <user16>: user21: This question is addressed under section 1.10 of the FAQ.
    <user21>: Thanks.
    ...

    The algorithm would go like -
    1. Parse the <$$$> statement.
    2. If it is a question, check if someone has addressed a user directly in the messages immediately following the <$$$> statement. If so, <$$$> = <the addressee in the following statement>.
    3. If it is a statement addressed to someone, check if that someone has directed a question at a particular nick. If so, the nick to whom the question is directed, is <$$$>.


    The above two approaches were pretty naive. When combined together, will they be less naive?

    No matter how naive, if you can successfully implement an approach, it is sure to boost your confidence and spur you towards better implementations.

    Here are a few tips that you could consider incorporating:
    1. Keep track of time-stamps against each message. Why? How do you think they could help?
    2. Check for smileys (character emoticons). Do you think they could be helpful?
    3. Use a spell corrector (probably something like this one). Should you run the Named Entity Recognition before or after the spelling corrections are made?
    4. Keep track of incorrect spellings. Some people spell the same word in a particular way each time they use it.
    5. If you think smileys are helpful, you might find “sentiment analysis” even more helpful.
    6. Regular Expressions. Have a look at this.


    NB: NLP is an evolving field and there is no single solution that may be deemed perfect. Your solution may score 60% on one chat log and a mere 10% on another. The more the number of ideas you incorporate, the better will be your score.

    Links and references:
    1. List of NLP toolkits in various programming languages.
    2. Python NLTK source code.
    3. HOWTO guide for NLP tasks.
    4. The NLTK book.
    5. Named entity chunker (NLTK).
    6. Authorship Attribution in the Wild.
    7. Disentangling Chat.
    8. Named Entity Recognition in tweets.
    9. The Ling-Pipe Blog.

    Untitled Document