Agreement Analysis for Gesture Elicitation: A Case Study

The page demonstrates how to use the R code provided here to analyze agreement for a gesture elicitation study. The statistical methods that we discuss are described in more depth by the following papers:

Theophanis Tsandilas. Fallacies of Agreement: A Critical Review of Consensus Assessment Methods for Gesture Elicitation. ACM Transactions on Computer-Human Interaction (TOCHI), 25, 3, Article 18, June 2018, 49 pages [doi.org/10.1145/3182168] [bibtext] [project page]

Theophanis Tsandilas and Pierre Dragicevic. Accounting for Chance Agreement in Gesture Elicitation Studies. Research Report 1584, LRI - CNRS, University Paris-Sud, Feb 2016, 5 pages
[pdf] [bibtext]

Bailly et al. (2013) investigated gestural shortcuts for their Métamorphe keyboard. Métamorphe is a keyboard with actuated keys that can sense user gestures, such as pull, twist, and push sideways. In this study, 20 participants suggested a keyboard shortcut for 42 referents on a Métamorphe mockup. Proposing a shortcut required choosing (i) a key and (ii) the gesture applied to the key. Bailly et al. (2013) treated shortcuts as a whole but also analyzed keys and gestures separately. Here, we analyze them separately. Participants produced a total of 71 unique signs for keys and 27 unique signs for gestures.

This is the original dataset as provided by the authors.

Reading the Dataset

As a first step we need to read the data frame by using the appropriate format:

source("coefficients/agreement.CI.R")
source("coefficients/agreement.coefficients.R")

data <- read.csv("data/bailly et al 2013 - dataset.csv", stringsAsFactors=F)

# For each participant, there are five columns, where the first captures the key and the second captures the gesture  
keys <- data[, seq(2, ncol(data), by=5)] # These are participants' proposals of keys
gestures <- data[, seq(3, ncol(data), by=5)] # These are participants' proposals of key gestures

# Replace the column names by the participant IDs
names(keys) <- paste0("P", 1:ncol(keys))
names(gestures) <- paste0("P", 1:ncol(gestures))

The resulting gestures data frame is as follows:

P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16	P17	P18	P19	P20
pull	top	towards	top	top	top	top	top	towards	top	pull	left	top	towards	top	top	top	LR	top	LR
towards	towards	towards	towards	towards	towards	towards	top+towards	towards	towards	towards	towards	towards	towards	towards	towards	towards	towards	top	towards
pull	LR	pull	LR	away	top	FB	top	left+right(wiggle)	right	pull	push	top	pull	pull	pull	LR	LR	right	pull
left	left	left	left	left	left	left	top+left	left	left	left	left	left	left	left	left	left	left	left	left
FB	FB	top	FB	top	towards	LR	top	pull	FB	FB	FB	FB	top	top	top	LR	top	top	FB
right	right	right	right	right	right	right	top+right	right	right	right	right	right	right	right	right	right	right	right	right
away	away	away	away	away	top	away	top+away	right	away	away	away	away	away	away	away	away	away	pull	away
top	pull	LR	away	CCW	top	top	top	away	top	LR	top	towards	top	top	top	CCW	CCW	top	top
top	top	towards	FB	away	towards	top	top	towards	LR	top	FB	top	top	top	LR	top-double	FB	top	top
away	top	top	LR	away	away	LR	top	away	pull	top	LR	top	LR	top	pull	left	pull	away	top
CCW	towards	CCW	CCW	away	towards	towards	towards	CCW	CCW	towards	CCW	top	CCW	top	CCW	CCW	left	towards	CCW
pull	top	top	away	top	top	top	top	pull	top	pull	top	top	top	LR	top	right	top	top	CCW
directional	directional	CW	LR	away	away	CW	top+directional	pull	top(double)	top	pull	top	FB	top	LR	top-double	right	pull	top
CW	away	pull	towards	pull	top	right	away	CCW	pull	pull	left	CW	pull	CW	right	CW	pull	CW	CW
top	top	top	towards	away	top	top	top	FB	CCW	left	top	top	top	top	top	FB	top	top	top
right	right	right	right	CW	away	right	right	right	towards	towards	CW	left	CW	right	right	away	towards	right	right
left	left	left	CCW	CCW	right	left	left	left	away	away	CCW	right	CCW	left	left	left	away	left	left
away	top	pull	top	away	top	top	top	pull	pull	top	left	top	FB	top	FB	pull	pull	pull	top
CW	away	CW	CW	away	top	away	away	CW	CW	away	CW	top	CW	top	CW	CW	right	away	CW
top	towards	top	top	away	top	LR	away	left+right(wiggle)	towards	top	left	top	FB	top	towards	top-double	CW	top	FB
away	away	away	towards	CW	top	right	away	pull	pull	left	pull	CW	CW	pull	top	CW	pull	pull	pull
top	towards	top	top	away	top	LR	top	top	LR	top	top	top	pull	top	top	FB	away	top	CW
towards	towards	towards	away	CCW	towards	left	towards	LR	top	left	CCW	CCW	CCW	top	top	CCW	top	top	FB
left	directional	directional	LR	left/right	towards	directional	directional	directional	directional	top	left/right	directional	directional+LR	directional	right	right	pull	left(pulse)/right(pulse)	directional
away	directional	directional	FB	pull	away	directional	directional	directional	directional	pull	left/right	directional	directional+LR	directional	LR	right	FB	left(long)/right(long)	directional
top	right	right	right	top	top	right	right	right	right	right	right	top	right	CW	right	right	right	right	left/right
top	top	top	pull	away	top	top	top	towards	pull	top	top	away	pull	top	pull	CW	CW	pull	top
left/right	directional	left/right	FB	left/right	left/right	directional	directional	directional	directional	CW	left/right	directional	directional+LR	directional	directional	FB	left	left/right	CW/CCW
towards	towards	top	top	away	top	LR	top	towards	top	top	left	top	top	top	top	top-double	LR	top	top
top	top	top	top	towards	towards	top	top	FB	top	pull	LR	top	pull	top	away	top	push	towards	FB
top	top	top	right	away	top	top	top	LR/FB	LR	top	top	top	top	top	towards	top	CW	away	top
top	left	left	left	top	top	left	left	left	left	left	left	top	left	CCW	left	left	left	left	left
pull	top	away	away	top	away	top	top	towards	FB	pull	top	top	away	top	top	CCW	CCW	top	top
CW/CCW	CW/CCW	CW/CCW	FB	CW/CCW	top	CW/CCW	CW/CCW	CW/CCW	CW/CCW	CW	CW/CCW	CW/CCW	CW/CCW	CW/CCW	CW/CCW	FB+CW	CW/CCW	CW/CCW	CW/CCW
top	pull	top	top	away	top	top	top	FB	top	away	right	top	top	top	top	top	left	top	top
FB	FB	pull	pull	CW	top	LR	top+away	pull	top	left	top	pull	LR	pull	left	CW	top	LR	CW
pull	top	top	pull	right	left	top	towards	towards	pull	right	left	right	FB	directional	right	top-double	right	top	pull
CCW	towards	LR	away	top	top	left	towards	CCW	FB	top	top	CCW	top	CCW	left	CCW	CCW	CCW	CCW
left/right	CW/CCW	towards	pull	CW/CCW	top	left/right	left/right	CW/CCW	left	top	left	left/right	LR	top	right	right	LR	CW/CCW	towards
top	top	CCW	CCW	left	top	top	top	pull	CCW	top	left	top	away	top	left	CCW	left	pull	CCW
away	away	top	towards	CW	top	right	top	right	pull	CW	top	CW	pull	top	CW	CW	CW	CW	FB
towards	towards	pull	away	CCW	away	left	pull	left	top	CCW	pull	CCW	top	pull	CCW	CCW	CCW	CCW	pull

Overall Agreement

We estimate the overall agreement observed in this study. We construct interval estimates for percent agreement, Fleiss’ Kappa, and Krippendorf’s alpha, where percent agreement is equivalent to the AR index (Vatavu and Wobbrock 2015). We use the jackknife technique to construct 95% confidence intervals.

Agreement on Keys

percent <- jack.CI.random.raters(keys, percent.agreement)
kappa <- jack.CI.random.raters(keys, fleiss.kappa)
alpha <- jack.CI.random.raters(keys, krippen.alpha)
  
printCI("  Percent agreement", percent)
printCI("      Fleiss' Kappa", kappa)
printCI("Krippendorf's alpha", alpha)

##   Percent agreement = 0.284, 95% CI [0.172, 0.397]
##       Fleiss' Kappa = 0.260, 95% CI [0.148, 0.371]
## Krippendorf's alpha = 0.261, 95% CI [0.149, 0.372]

We observe that Krippendorf’s alpha is almost identical to Fleiss’ Kappa. According to Gwet (2014), when there are no missing data and the number of participants is greater than five, the results of the two indices are generally very close.

Agreement on Key Gestures

percent <- jack.CI.random.raters(gestures, percent.agreement)
kappa <- jack.CI.random.raters(gestures, fleiss.kappa)
alpha <- jack.CI.random.raters(gestures, krippen.alpha)
  
printCI("  Percent agreement", percent)
printCI("      Fleiss' Kappa", kappa)
printCI("Krippendorf's alpha", alpha)

##   Percent agreement = 0.336, 95% CI [0.287, 0.386]
##       Fleiss' Kappa = 0.240, 95% CI [0.192, 0.289]
## Krippendorf's alpha = 0.241, 95% CI [0.193, 0.289]

We observe now a larger discripency between percent agreement and chance-corrected indices, which is clearly due to a higher chance agreement.

Chance Agreement and Bias

Let’s estimate the chance agreement for both keys and their gestures:

chance.keys <- jack.CI.random.raters(keys, chance.agreement)
chance.gestures <- jack.CI.random.raters(gestures, chance.agreement)
  
printCI("    Chance agreement (Keys)", chance.keys)
printCI("Chance agreement (Gestures)", chance.gestures)

##     Chance agreement (Keys) = 0.033, 95% CI [0.025, 0.041]
## Chance agreement (Gestures) = 0.126, 95% CI [0.098, 0.155]

We observe that chance agreement becomes substantial for gestures. It can be partly explained by the smaller number of unique signs observed for gestures: 27 signs for gestures vs. 71 signs for keys. However, this is not the only explanation. We calculate the frequency of signs across all referents.

signs <- unlist(gestures)
counts <- sort(table(signs), decreasing = T)
freq <- counts/sum(counts)

print("Frequency (%) of the six most frequent signs:")
print(head(freq)*100)

## [1] "Frequency (%) of the six most frequent signs:"
## signs
##       top      away      left      pull     right   towards 
## 28.095238  8.809524  8.809524  8.690476  8.214286  7.857143

We observe that the top sign (regular key press) accounted alone for 28% of all gesture proposals. It is likely that this sign served as a “default” sign, when participants could not come up with a meaningful gesture. This bias towards a single sign increases the likelihood of agreement by chance. Also notice that the six most frequent signs accounted for 70.5% of all gesture proposals.

Agreement Specific to Signs

For the rest of our analysis, we focus on the key gestures. In addition to overall agreement, we calculate agreement specific to signs (or specific agreement):

source("coefficients/specific.agreement.CI.R")
specific <- specific.agreement.CI(gestures) # A data frame with interval estimates and sign frequencies
specific <- specific[with(specific, order(-Freq)),] # Sort the signs by their frequencies

We then plot specific agreements with their 95% jackknife CIs. We also display the sign frequencies (percentages in red):

library(ggplot2)

signs <- specific$Category # Get the names of the signs

plot <- ggplot(specific, aes(x = Category, y = Specific)) + 
  geom_point(size = 1.8, color = "#0000aa", shape = 1) +
  scale_x_discrete(name ="Sign", limits = signs) +
  scale_y_continuous(breaks=c(0, 0.2, 0.4, 0.6, 0.8, 1)) + theme_bw() +
  geom_errorbar(aes(ymax = Upper, ymin = Lower), width=0.01, color = "#0000aa") +
  geom_text(aes(label = sprintf("%2.1f%%", Freq), y = -.15), color="red", size=3.1, vjust=0) +
  theme(axis.title.x = element_blank(), 
        axis.text.x = element_text(angle = 40, vjust = 1, hjust = 1, size = 10), 
        axis.text.y = element_text(size = 12)) +
  theme(panel.border = element_blank(), panel.grid.major.x = element_blank(), 
        panel.grid.minor = element_blank(), 
        panel.grid.major = element_line(colour = "#000000", size=0.03)) + 
  ylab("Specific Agreement")

print(plot)

We observe that agreement has been only observed for 13 signs, other signs have appeared with no consensus among participants. The above calculation does not account for chance agreement. Spitzer and Fleiss (1974) argue that agreement specific to categories should be also corrected for chance agreement. To this end, we can use a simple formulation described by Uebersax (1982), which assumes that the bias distribution is common across all raters (participants in our case).

specific.corrected <- chance.corrected.specific.agreement.CI(gestures) 
specific.corrected <- specific.corrected[with(specific.corrected, order(-Freq)),]

We can again plot the chance-corrected specific agreement as before to produce the following graph:

Notice that chance correction penalizes the very frequent signs, such as the top sign, but has little effect on less frequent signs. We observe a particularly high agreement on the use of the CW/CCW sign (turn the key clockwise or counter-clockwise). Such agreement did not emerge by chance, as its use was not arbitrary but rather selective and consistent across participants.

For the analyses reported in the TOCHI article, specific agreement has not been corrected for chance agreement. However, agreement values are interpreted by taking into account their observed sign frequencies. Both approaches are valid as long as the authors are clear about their analyses and also report the bias distributions that they observe.

Agreement over Individual Referents

We can also use Fleiss’ Kappa for individual (or groups of) referents by assuming a common chance agreement across all referents:

refs <- list() # referents list
k <- list() # Kappa estimates
l <- list() # Lower bounds of 95% CI
u <- list() # Upper bounds of 95% CI

for(index in 1:nrow(gestures)){ 
    # Construct the jackknife CI for Fleiss' Kappa of each referent
    ci <- jack.CI.random.raters.fleiss.kappa.for.item(gestures, index)
    
    refs[index] <- data[index, 1]
    k[index] <- ci[1] 
    l[index] <- ci[2] 
    u[index] <- ci[3] 
}

# Create a data frame with all the estimates
fleiss.df <- data.frame(Referent = as.character(refs), Kappa = as.double(k), Low = as.double(l), Upper = as.double(u))

# Sort the referents by their agreement values
fleiss.df <- fleiss.df[with(fleiss.df, order(-Kappa)),]

We can then plot our results as follows:

referents <- fleiss.df$Referent

plot <- ggplot(fleiss.df, aes(x = Referent, y = Kappa)) +
  geom_hline(yintercept = 0, color = "red", size = .2) +
  geom_point(size = 1.8, color = "#0000aa", shape = 1) +
  scale_x_discrete(name ="Referent", limits = referents) +
  scale_y_continuous(breaks=c(-0.2, 0, 0.2, 0.4, 0.6, 0.8, 1)) + 
  theme_bw() +
  geom_errorbar(aes(ymax = Upper, ymin = Low), width=0.01, color = "#0000aa") +
  theme(axis.title.x = element_blank(), axis.text.x = element_text(angle=48, vjust=1, hjust=1, size=10), 
        axis.text.y = element_text(size = 10)) +
  theme(panel.border = element_blank(), panel.grid.major.x = element_blank(), 
        panel.grid.minor = element_blank(), panel.grid.major = element_line(colour="#000000", size=0.03))

print(plot)

Notice that Kappa can take negative values, where a negative value implies “disagreement.”

Warning: Jackknife confidence intervals for individual referents may not be precise, especially when the number of participants is not sufficiently large.

Within-Participant Comparisons

We can also use confidence intervals to back up the authors’ claim on the effect of directional referents on agreement: “highly directional commands […] tended to have a high gesture agreement” (Bailly et al. 2013). The difference in Kappa between the eight directional referents containing the terms stop, bottom, left, right, previous or next and allother referents can be estimated as follows:

directional <- c("top", "bottom", "left", "right", "previous", "next", "Previous", "Next")
referents.all <- data[,1]
matched.rows <- grep(paste(directional,collapse="|"), referents.all)

directional <- gestures[matched.rows,] # gesture proposals for directional referents
other <- gestures[-matched.rows,] # other proposals

# This assumes that chance agreement is the same for the two referent groups 
# For this, we use a pooled pe -- calculated for the full dataset 
fleiss.pe <- fleiss.chance.agreement.raw.noinference(gestures)
fleiss.pooled <- function(ratings) {generic.kappa(ratings, fleiss.pe)}

diff.Kappa <- jack.CI.diff.random.raters(directional, other, fleiss.pooled)
printCI("Difference in Fleiss' Kappa (pooled pe)", diff.Kappa)

## Difference in Fleiss' Kappa (pooled pe) = 0.408, 95% CI [0.237, 0.579]

The difference in Fleiss’ Kappa between the two groups of references is \(\Delta \kappa =.41\), \(95\%\) CI \(= [.24, .58]\), so there is strong evidence to support the authors’ claim.

Comparisons between Independent Groups

Bailly et al. (2013) collected proposals from 11 women and 9 men. Vatavu and Wobbrock (2016) reanalyzed their dataset to test differences in agreement between genders but observed similar overall agreement rates between women and men (.353 vs. .322). We can use Fleiss’ Kappa to estimate this difference and construct its 95% confidence interval with the percentile bootstrap method (this can take several minutes):

# This mask identifies the ID of male participants
men_mask <- c(1, 5, 6, 8, 9, 13, 15, 18, 20)
gestures_men <- gestures[men_mask]
gestures_women <- gestures[-men_mask]

kappa.men <- jack.CI.random.raters(gestures_men, fleiss.kappa)
printCI("Fleiss' Kappa for Men", kappa.men)

kappa.women <- jack.CI.random.raters(gestures_women, fleiss.kappa)
printCI("Fleiss' Kappa for Women", kappa.women)

# The first argument specifies the referents of interest 
# Here, we want to compare agreement over all the referents
kappa.delta <- fleiss.kappa.bootstrap.diff.ci(1:nrow(gestures), 
                                              gestures_women, gestures_men, 
                                              R = 3000) # Num of bootstrap samples
printCI("Difference in Fleiss' Kappa between Women and Men", kappa.delta)

## Fleiss' Kappa for Men = 0.207, 95% CI [0.087, 0.327]
## Fleiss' Kappa for Women = 0.268, 95% CI [0.207, 0.330]
## Difference in Fleiss' Kappa between Women and Men = 0.061, 95% CI [-0.115, 0.165]

Clearly, there is no evidence to support that there is a difference in agreement between women and men.

However, Vatavu and Wobbrock (2016) continued their analysis and used the Vb statistic to compare agreement differences between genders for individual referents. They found ”significant differences (p < .05) for 7 referents.” Based on this finding, they concluded: ”these results show that women and men reach consensus over gestures in different ways that depend on the nature of the referent […]”. The TOCHI article shows that such differences are random, due to the high Type I error rate of the Vb statistic.

Can we then use the bootstrap method to test gender differences for individual referents? We discourage such practices for several reasons. Making comparisons between women and men was out of the scope of the original study of Bailly et al. (2013). The size of the two groups was particularly low, while the study did not control for confounding variables that might correlate with gender. We argue against making unplanned post-hoc comparisons over uncontrolled samples of such small sizes, as those can result in misleading conclusions. Furthermore, jackknife and bootstrap confidence intervals for individual referents may not be precise, especially when sample sizes are low. Therefore, using them to test such hypotheses may not be appropriate.

Other Remarks

The same agreement indices are used for the analysis of inter-rater reliability studies, such as to assess how independent raters agree on their classification of design outcomes, e.g., see Bousseau et al. (2016). However, the underlying assumptions for statistical inference may not be the same. In gesture elicitation studies, referents are fixed: any conclusion typically only applies to these referents. Participants, in contrast, are chosen randomly, and investigators may need to generalize their conclusions to the entire population of potential users.

Therefore, the above methods of inference will not apply if items classified by raters are not fixed but are rather sampled from a larger population. Gwet (2014) discusses solutions for a range of scenarios. His AgreeStat software deals with many of them.

Consider that chance-corrected coefficients have received a lot of criticism by many authors (including Gwet (2014) who has introduced his own measure). Despite such criticisms, Fleiss’ Kappa and Krippendorf’s alpha are still widely used for very good reasons. For a strong argumentation in favor of these measures, I refer the readers to Krippendorff (2016).

Acknowledgments

Pierre Dragicevic has contributed insights and code for the above analyses. I am grateful to Gilles Bailly and his co-authors for giving access to their anonymized dataset.

References

Bailly, Gilles, Thomas Pietrzak, Jonathan Deber, and Daniel J. Wigdor. 2013. “Métamorphe: Augmenting Hotkey Usage with Actuated Keys.” In Proceedings of the Sigchi Conference on Human Factors in Computing Systems, 563–72. CHI ’13. New York, NY, USA: ACM. doi:10.1145/2470654.2470734.

Bousseau, Adrien, Theophanis Tsandilas, Lora Oehlberg, and Wendy E. Mackay. 2016. “How Novices Sketch and Prototype Hand-Fabricated Objects.” In Proceedings of the 2016 Chi Conference on Human Factors in Computing Systems, 397–408. CHI ’16. New York, NY, USA: ACM. doi:10.1145/2858036.2858159.

Gwet, Kilem Li. 2014. Handbook of Inter-Rater Reliability, 4th Edition: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC. https://books.google.fr/books?id=fac9BQAAQBAJ.

Krippendorff, Klaus. 2016. “Misunderstanding Reliability.” Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 12 (4): 139–44. doi:10.1027/1614-2241/a000119.

Spitzer, Robert L., and Joseph L. Fleiss. 1974. “A Re-Analysis of the Reliability of Psychiatric Diagnosis.” The British Journal of Psychiatry 125 (587). The Royal College of Psychiatrists: 341–47. doi:10.1192/bjp.125.4.341.

Uebersax, John S. 1982. “A Design-Independent Method for Measuring the Reliability of Psychiatric Diagnosis.” Journal of Psychiatric Research 17 (4). Elsevier: 335–42. doi:http://dx.doi.org/10.1016/0022-3956(82)90039-5.

Vatavu, Radu-Daniel, and Jacob Wobbrock. 2016. “Between-Subjects Elicitation Studies: Formalization and Tool Support.” In Proceedings of the 34th Annual Acm Conference on Human Factors in Computing Systems, 3390–3402. CHI ’16. New York, NY, USA: ACM. doi:10.1145/2858036.2858228.

Vatavu, Radu-Daniel, and Jacob O. Wobbrock. 2015. “Formalizing Agreement Analysis for Elicitation Studies: New Measures, Significance Test, and Toolkit.” In Proceedings of the 33rd Annual Acm Conference on Human Factors in Computing Systems, 1325–34. CHI ’15. New York, NY, USA: ACM. doi:10.1145/2702123.2702223.