NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets

Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu
{Saif.Mohammad,Svetlana.Kiritchenko,Xiaodan.Zhu}@nrc-cnrc.gc.ca

In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), June 2013, Atlanta, Georgia, USA.
Paper     BibTeX     Poster     Slides

Official Rankings: Our team (NRC-Canada) ranked first in detecting sentiment of tweets (task 2B - tweets), first in detecting sentiment of SMS messages (task 2B - SMS), first in detecting sentiment of terms within a tweet (task 2A - tweets), and second in detecting sentiment of terms within an SMS message (task 2A - SMS).
About 44 teams participated.

ABSTRACT: In this paper, we describe how we created two state-of-the-art SVM classifiers, one to detect the sentiment of messages such as tweets and SMS (message-level task) and one to detect the sentiment of a term within a message (term-level task). Among submissions from 44 teams in a competition (SemEval-2013 Task 2), our submissions stood first in both tasks on tweets, obtaining an F-score of 69.02 in the message-level task and 88.93 in the term-level task. We implemented a variety of surface-form, semantic, and sentiment features. We also generated two large word--sentiment association lexicons, one from tweets with sentiment-word hashtags, and one from tweets with emoticons. The automatically generated lexicons were particularly useful. In the message-level task, the lexicon-based features provided a gain of 5 F-score points over and above that obtained using all other features. Both of our systems can be replicated using freely available resources.

FEATURES:

For tweet-level sentiment detection:

- all-caps: the number of words with all characters in upper case;
- clusters: presence/absence of tokens from each of the 1000 clusters (provided by Carnegie Mellon University's Twitter NLP tool);
- elongated words: the number of words with one character repeated more than 2 times, e.g. 'soooo';
- emoticons:
      - presence/absence of positive and negative emoticons at any position in the tweet;
      - whether the last token is a positive or negative emoticon;
- hashtags: the number of hashtags;
- negation: the number of negated contexts. A negated context also affects the ngram and lexicon features: each word and associated with it polarity in a negated context become negated (e.g., 'not perfect' becomes 'not perfect_NEG', 'POLARITY_positive' becomes 'POLARITY_positive_NEG');
- POS: the number of occurrences for each part-of-speech tag;
- punctuation:
      - the number of contiguous sequences of exclamation marks, question marks, and both exclamation and question marks;
      - whether the last token contains exclamation or question mark;
- sentiment lexicons: automatically created lexicons (NRC Hashtag Sentiment Lexicon, Sentiment140 Lexicon), manually created sentiment lexicons (NRC Emotion Lexicon, MPQA, Bing Liu Lexicon). For each lexicon and each polarity we calculated:
      - total count of tokens in the tweet with score greater than 0;
      - the sum of the scores for all tokens in the tweet;
      - the maximal score;
      - the non-zero score of the last token in the tweet;
      The lexicon features were created for all tokens in the tweet, for each part-of-speech tag, for hashtags, and for all-caps tokens.
- word ngrams, character ngrams.

For term-level sentiment detection:

- character ngrams: two- and three-character prefixes and suffixes of all the words in a target term (note that the target term may be a multi-word sequence);
- elongated words: whether a term contains an enlonged word (e.g., 'sooo');
- emoticons: the numbers and types of emoticons that a term contains;
- stopwords: whether a term contains only stop-words. If so, separate features indicate whether there are 1, 2, 3, or more stop-words;
- lengths: the length of a target term (number of words); the average length of words (number of characters) in a term; a binary feature indicating whether a term contains long words.
- negation: whether a term contains a negation word;
- positions:
whether a term is located at the beginning, end, or the other portion of the tweet;
- punctuations: whether a term contains punctuation sequences such as ”?!” and ”!!!”;
- sentiment lexicons: automatically created lexicons (NRC Hashtag Sentiment Lexicon, Sentiment140 Lexicon), manually created lexicons (NRC Emotion Lexicon, MPQA, Bing Liu Lexicon).
      - total count of tokens in the tweet with score greater than 0;
      - the sum of the scores for all tokens in the tweet;
      - the maximal score;
      - the non-zero score of the last token in the tweet;
- term splitting: when a term contains a hashtag that is composed of multiple words (e.g., #biggestdaythisyear), we split the hashtag into multiple words;
- upper case:
      - whether all the words in the target start with an upper case letter followed by lower case letters.
      - whether the target words are all capitalized (to capture a potential named entity);
- word ngrams: unigrams, bigrams, and the full word string of a target term and also of the four words on either side of the target. We generated separate features for the leading and ending unigrams and bigrams of the target term;
- others:
      - whether a term contains a Twitter user name.
      - whether a term contains a URL.

DOWNLOADS:

Below are the two automatically created sentiment lexicons we used to generate our submissions to SemEval-2013 Task 2. If you use them, please cite this paper.

a. NRC Hashtag Sentiment Lexicon (version 0.1) is a list of words with associations to positive and negative sentiments. The lexicon is distributed in three files: unigrams-pmilexicon.txt, bigrams-pmilexicon.txt, and pairs-pmilexicon.txt. Each line in the three files has the format:

term<tab>sentimentScore<tab>numPositive<tab>numNegative
where:
term is the target word or phrase.
In unigrams-pmilexicon.txt, term is a unigram (single word).
In bigrams-pmilexicon.txt, term is a bigram (two-word sequence). A bigram has the form: "string string". The bigram was seen at least once in the source tweets from which the lexicon was created.
In pairs-pmilexicon.txt, term is a unigram--unigram pair, unigram--bigram pair, bigram--unigram pair, or a bigram--bigram pair. The pairs were generated from a large set of source tweets. Tweets were examined one at a time, and all possible unigram and bigram combinations within the tweet were chosen. Pairs with certain punctuations, @ symbols, and some function words were removed.

sentimentScore is a real number. A positive score indicates positive sentiment. A negative score indicates negative sentiment. The absolute value is the degree of association with the sentiment.
numPositive is the number of times the term co-occurred with a positive marker such as a positive emoticon or a positive hashtag.
numNegative is the number of times the term co-occurred with a negative marker such as a negative emoticon or a negative hashtag.
The hashtag lexicon was created from a collection of tweets that had a positive or a negative word hashtag such as #good, #excellent, #bad, and #terrible. Version 0.1 was created from 775,310 tweets posted between April and December 2012 using a list of 77 positive and negative word hashtags. A list of these hashtags is shown in sentimenthashtags.txt.

The number of entries in:
unigrams-pmilexicon.txt: 54,129 terms
bigrams-pmilexicon.txt: 316,531 terms
pairs-pmilexicon.txt: 480,010 terms

Refer to the publication for more details.

b. Sentiment140 Lexicon (version 0.1) is also a list of words with associations to positive an negative sentiments. It has the same format as the NRC Hashtag Sentiment Lexicon. However, it was created from the sentiment140 corpus of 1.6 million tweets, and emoticons were used as positive and negative labels (instead of hashtagged words).

NOTES:

1. The three authors contributed equally to this paper. Svetlana Kiritchenko implemented and developed the classifier for tweet-level sentiment classification, Xiaodan Zhu implemented and developed the classifier for term-level sentiment classification, and Saif Mohammad co-ordinated efforts in both tasks. All three contributed to feature development.

2. Errata:

a. The paper mentions using 78 seed words in all, 32 positive seed words, and 36 negative seed words. That should be 77 seed words in all; 30 positive seed words and 47 negative seed words.


 

Updated: June 19, 2013