Best-Worst Scaling vs Rating Scale

BACKGROUND: When manually annotating data with quantitative or qualitative information, researchers in many disciplines, including social sciences and computational linguistics, often rely on rating scales (RS). A rating scale provides the annotator with a choice of categorical or numerical values that represent the measurable characteristic of the rated data. For example, when annotating a word for sentiment, the annotator can be asked to choose among integer values from 1 to 9, with 1 representing the strongest negative sentiment, and 9 representing the strongest positive sentiment (Bradley and Lang, 1999; Warriner et al., 2013). Another example is the Likert scale, which measures responses on a symmetric agree–disagree scale, from ‘strongly disagree’ to ‘strongly agree’ (Likert, 1932). The annotations for an item from multiple respondents are usually averaged to obtain a real-valued score for that item. While frequently used in many disciplines, the rating scale method has a number of limitations (Presser and Schuman, 1996; Baumgartner and Steenkamp, 2001). These include:

Inconsistencies in annotations by different annotators: one annotator might assign a score of 7 to the word good on a 1-to-9 sentiment scale, while another annotator can assign a score of 8 to the same word.
Inconsistencies in annotations by the same annotator: an annotator might assign different scores to the same item when the annotations are spread over time.
Scale region bias: annotators often have a bias towards a part of the scale, for example, preference for the middle of the scale.
Fixed granularity: in some cases, annotators might feel too restricted with a given rating scale and may want to place an item inbetween the two points on the scale. On the other hand, a fine-grained scale may overwhelm the respondents and lead to even more inconsistencies in annotation.

Paired Comparisons (Thurstone, 1927; David, 1963) is a comparative annotation method, where respondents are presented with pairs of items and asked which item has more of the property of interest (for example, which is more positive). The annotations can then be converted into a ranking of items by the property of interest, and one can even obtain real-valued scores indicating the degree to which an item is associated with the property of interest. The paired comparison method does not suffer from the problems discussed above for the rating scale, but it requires a large number of annotations—order N², where N is the number of items to be annotated.

Best–Worst Scaling (BWS) is a less-known, and more recently introduced, variant of comparative annotation. It was developed by Louviere (1991), building on some groundbreaking research in the 1960s in mathematical psychology and psychophysics by Anthony A. J. Marley and Duncan Luce. Annotators are presented with n items at a time (an n-tuple, where n > 1, and typically n = 4). They are asked which item is the best (highest in terms of the property of interest) and which is the worst (lowest in terms of the property of interest). When working on 4-tuples, best–worst annotations are particularly efficient because by answering these two questions, the results for five out of six item–item pair-wise comparisons become known. All items to be rated are organized in a set of m 4-tuples (m ≥ N, where N is the number of items) so that each item is evaluated several times in diverse 4-tuples. Once the m 4-tuples are annotated, one can compute real-valued scores for each of the items using a simple counting procedure (Orme, 2009). The scores can be used to rank items by the property of interest.

EXPERIMENT SUMMARY: BWS is claimed to produce high-quality annotations while still keeping the number of annotations small (1.5N–2N tuples need to be annotated) (Louviere et al., 2015; Kiritchenko and Mohammad, 2016a). However, the veracity of this claim has never been systematically established. In this paper, we pit the widely used rating scale squarely against BWS in a quantitative experiment to determine which method provides more reliable results. We produce real-valued sentiment intensity ratings for 3,207 English terms (words and phrases) using both methods by aggregating responses from several independent annotators.

Reliability of the Annotations: One cannot use standard inter-annotator agreement to determine quality of BWS annotations because the disagreement that arises when a tuple has two items that are close in scores is a useful signal for BWS. For a given 4-tuple, if respondents are not able to consistently identify the word that has highest (or lowest) item by property of interest, then the disagreement will lead to the two items obtaining scores that are close to each other, which is the desired outcome. Thus a different measure of quality of annotations must be utilized.

A useful measure of quality is reproducibility of the end result—if repeated independent manual annotations from multiple respondents result in similar scores, then one can be confident that the scores capture the true measure of the propert of interest. To assess this reproducibility, we calculate average split-half reliability (SHR) over 100 trials. SHR is a commonly used approach to determine consistency in psychological studies, that we employ as follows. All annotations for an item (in our case, tuples) are randomly split into two halves. Two sets of scores are produced independently from the two halves. Then the correlation between the two sets of scores is calculated. If the annotations are of good quality, then the correlation between the two halves will be high. The same procedure is used for rating scale annotations as well, where the ratings for each item are split into two halves.

RESULTS SUMMARY: We show that BWS ranks terms more reliably, that is, when comparing the term rankings obtained from two groups of annotators for the same set of terms, the correlation between the two sets of ranks produced by BWS is significantly higher than the correlation for the two sets obtained with RS. The difference in reliability is more marked when about 5N (or less) total annotations are obtained, which is the case in many NLP annotation projects (Strapparava and Mihalcea, 2007; Socher et al., 2013; Mohammad and Turney, 2013). Furthermore, the reliability obtained by rating scale when using ten annotations per term is matched by BWS with only 3N total annotations (two annotations for each of the 1.5N 4-tuples).