The page demonstrates how to use the R code provided here to analyze agreement for a gesture elicitation study. The statistical methods that we discuss are described in more depth by the following papers:
Theophanis Tsandilas. Fallacies of Agreement: A Critical Review of Consensus Assessment Methods for Gesture Elicitation. ACM Transactions on Computer-Human Interaction (TOCHI), 25, 3, Article 18, June 2018, 49 pages [doi.org/10.1145/3182168] [bibtext] [project page]
Theophanis Tsandilas and Pierre Dragicevic. Accounting for Chance Agreement in Gesture Elicitation Studies. Research Report 1584, LRI - CNRS, University Paris-Sud, Feb 2016, 5 pages
[pdf] [bibtext]
Bailly et al. (2013) investigated gestural shortcuts for their Métamorphe keyboard. Métamorphe is a keyboard with actuated keys that can sense user gestures, such as pull, twist, and push sideways. In this study, 20 participants suggested a keyboard shortcut for 42 referents on a Métamorphe mockup. Proposing a shortcut required choosing (i) a key and (ii) the gesture applied to the key. Bailly et al. (2013) treated shortcuts as a whole but also analyzed keys and gestures separately. Here, we analyze them separately. Participants produced a total of 71 unique signs for keys and 27 unique signs for gestures.
This is the original dataset as provided by the authors.
As a first step we need to read the data frame by using the appropriate format:
source("coefficients/agreement.CI.R")
source("coefficients/agreement.coefficients.R")
data <- read.csv("data/bailly et al 2013 - dataset.csv", stringsAsFactors=F)
# For each participant, there are five columns, where the first captures the key and the second captures the gesture
keys <- data[, seq(2, ncol(data), by=5)] # These are participants' proposals of keys
gestures <- data[, seq(3, ncol(data), by=5)] # These are participants' proposals of key gestures
# Replace the column names by the participant IDs
names(keys) <- paste0("P", 1:ncol(keys))
names(gestures) <- paste0("P", 1:ncol(gestures))
The resulting gestures data frame is as follows:
P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | P15 | P16 | P17 | P18 | P19 | P20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pull | top | towards | top | top | top | top | top | towards | top | pull | left | top | towards | top | top | top | LR | top | LR |
towards | towards | towards | towards | towards | towards | towards | top+towards | towards | towards | towards | towards | towards | towards | towards | towards | towards | towards | top | towards |
pull | LR | pull | LR | away | top | FB | top | left+right(wiggle) | right | pull | push | top | pull | pull | pull | LR | LR | right | pull |
left | left | left | left | left | left | left | top+left | left | left | left | left | left | left | left | left | left | left | left | left |
FB | FB | top | FB | top | towards | LR | top | pull | FB | FB | FB | FB | top | top | top | LR | top | top | FB |
right | right | right | right | right | right | right | top+right | right | right | right | right | right | right | right | right | right | right | right | right |
away | away | away | away | away | top | away | top+away | right | away | away | away | away | away | away | away | away | away | pull | away |
top | pull | LR | away | CCW | top | top | top | away | top | LR | top | towards | top | top | top | CCW | CCW | top | top |
top | top | towards | FB | away | towards | top | top | towards | LR | top | FB | top | top | top | LR | top-double | FB | top | top |
away | top | top | LR | away | away | LR | top | away | pull | top | LR | top | LR | top | pull | left | pull | away | top |
CCW | towards | CCW | CCW | away | towards | towards | towards | CCW | CCW | towards | CCW | top | CCW | top | CCW | CCW | left | towards | CCW |
pull | top | top | away | top | top | top | top | pull | top | pull | top | top | top | LR | top | right | top | top | CCW |
directional | directional | CW | LR | away | away | CW | top+directional | pull | top(double) | top | pull | top | FB | top | LR | top-double | right | pull | top |
CW | away | pull | towards | pull | top | right | away | CCW | pull | pull | left | CW | pull | CW | right | CW | pull | CW | CW |
top | top | top | towards | away | top | top | top | FB | CCW | left | top | top | top | top | top | FB | top | top | top |
right | right | right | right | CW | away | right | right | right | towards | towards | CW | left | CW | right | right | away | towards | right | right |
left | left | left | CCW | CCW | right | left | left | left | away | away | CCW | right | CCW | left | left | left | away | left | left |
away | top | pull | top | away | top | top | top | pull | pull | top | left | top | FB | top | FB | pull | pull | pull | top |
CW | away | CW | CW | away | top | away | away | CW | CW | away | CW | top | CW | top | CW | CW | right | away | CW |
top | towards | top | top | away | top | LR | away | left+right(wiggle) | towards | top | left | top | FB | top | towards | top-double | CW | top | FB |
away | away | away | towards | CW | top | right | away | pull | pull | left | pull | CW | CW | pull | top | CW | pull | pull | pull |
top | towards | top | top | away | top | LR | top | top | LR | top | top | top | pull | top | top | FB | away | top | CW |
towards | towards | towards | away | CCW | towards | left | towards | LR | top | left | CCW | CCW | CCW | top | top | CCW | top | top | FB |
left | directional | directional | LR | left/right | towards | directional | directional | directional | directional | top | left/right | directional | directional+LR | directional | right | right | pull | left(pulse)/right(pulse) | directional |
away | directional | directional | FB | pull | away | directional | directional | directional | directional | pull | left/right | directional | directional+LR | directional | LR | right | FB | left(long)/right(long) | directional |
top | right | right | right | top | top | right | right | right | right | right | right | top | right | CW | right | right | right | right | left/right |
top | top | top | pull | away | top | top | top | towards | pull | top | top | away | pull | top | pull | CW | CW | pull | top |
left/right | directional | left/right | FB | left/right | left/right | directional | directional | directional | directional | CW | left/right | directional | directional+LR | directional | directional | FB | left | left/right | CW/CCW |
towards | towards | top | top | away | top | LR | top | towards | top | top | left | top | top | top | top | top-double | LR | top | top |
top | top | top | top | towards | towards | top | top | FB | top | pull | LR | top | pull | top | away | top | push | towards | FB |
top | top | top | right | away | top | top | top | LR/FB | LR | top | top | top | top | top | towards | top | CW | away | top |
top | left | left | left | top | top | left | left | left | left | left | left | top | left | CCW | left | left | left | left | left |
pull | top | away | away | top | away | top | top | towards | FB | pull | top | top | away | top | top | CCW | CCW | top | top |
CW/CCW | CW/CCW | CW/CCW | FB | CW/CCW | top | CW/CCW | CW/CCW | CW/CCW | CW/CCW | CW | CW/CCW | CW/CCW | CW/CCW | CW/CCW | CW/CCW | FB+CW | CW/CCW | CW/CCW | CW/CCW |
top | pull | top | top | away | top | top | top | FB | top | away | right | top | top | top | top | top | left | top | top |
FB | FB | pull | pull | CW | top | LR | top+away | pull | top | left | top | pull | LR | pull | left | CW | top | LR | CW |
pull | top | top | pull | right | left | top | towards | towards | pull | right | left | right | FB | directional | right | top-double | right | top | pull |
CCW | towards | LR | away | top | top | left | towards | CCW | FB | top | top | CCW | top | CCW | left | CCW | CCW | CCW | CCW |
left/right | CW/CCW | towards | pull | CW/CCW | top | left/right | left/right | CW/CCW | left | top | left | left/right | LR | top | right | right | LR | CW/CCW | towards |
top | top | CCW | CCW | left | top | top | top | pull | CCW | top | left | top | away | top | left | CCW | left | pull | CCW |
away | away | top | towards | CW | top | right | top | right | pull | CW | top | CW | pull | top | CW | CW | CW | CW | FB |
towards | towards | pull | away | CCW | away | left | pull | left | top | CCW | pull | CCW | top | pull | CCW | CCW | CCW | CCW | pull |
We estimate the overall agreement observed in this study. We construct interval estimates for
percent <- jack.CI.random.raters(keys, percent.agreement)
kappa <- jack.CI.random.raters(keys, fleiss.kappa)
alpha <- jack.CI.random.raters(keys, krippen.alpha)
printCI(" Percent agreement", percent)
printCI(" Fleiss' Kappa", kappa)
printCI("Krippendorf's alpha", alpha)
## Percent agreement = 0.284, 95% CI [0.172, 0.397]
## Fleiss' Kappa = 0.260, 95% CI [0.148, 0.371]
## Krippendorf's alpha = 0.261, 95% CI [0.149, 0.372]
We observe that Krippendorf’s alpha is almost identical to Fleiss’ Kappa. According to Gwet (2014), when there are no missing data and the number of participants is greater than five, the results of the two indices are generally very close.
percent <- jack.CI.random.raters(gestures, percent.agreement)
kappa <- jack.CI.random.raters(gestures, fleiss.kappa)
alpha <- jack.CI.random.raters(gestures, krippen.alpha)
printCI(" Percent agreement", percent)
printCI(" Fleiss' Kappa", kappa)
printCI("Krippendorf's alpha", alpha)
## Percent agreement = 0.336, 95% CI [0.287, 0.386]
## Fleiss' Kappa = 0.240, 95% CI [0.192, 0.289]
## Krippendorf's alpha = 0.241, 95% CI [0.193, 0.289]
We observe now a larger discripency between percent agreement and chance-corrected indices, which is clearly due to a higher chance agreement.
Let’s estimate the chance agreement for both keys and their gestures:
chance.keys <- jack.CI.random.raters(keys, chance.agreement)
chance.gestures <- jack.CI.random.raters(gestures, chance.agreement)
printCI(" Chance agreement (Keys)", chance.keys)
printCI("Chance agreement (Gestures)", chance.gestures)
## Chance agreement (Keys) = 0.033, 95% CI [0.025, 0.041]
## Chance agreement (Gestures) = 0.126, 95% CI [0.098, 0.155]
We observe that chance agreement becomes substantial for gestures. It can be partly explained by the smaller number of unique signs observed for gestures: 27 signs for gestures vs. 71 signs for keys. However, this is not the only explanation. We calculate the frequency of signs across all referents.
signs <- unlist(gestures)
counts <- sort(table(signs), decreasing = T)
freq <- counts/sum(counts)
print("Frequency (%) of the six most frequent signs:")
print(head(freq)*100)
## [1] "Frequency (%) of the six most frequent signs:"
## signs
## top away left pull right towards
## 28.095238 8.809524 8.809524 8.690476 8.214286 7.857143
We observe that the top sign (regular key press) accounted alone for 28% of all gesture proposals. It is likely that this sign served as a “default” sign, when participants could not come up with a meaningful gesture. This
For the rest of our analysis, we focus on the key gestures. In addition to overall agreement, we calculate
source("coefficients/specific.agreement.CI.R")
specific <- specific.agreement.CI(gestures) # A data frame with interval estimates and sign frequencies
specific <- specific[with(specific, order(-Freq)),] # Sort the signs by their frequencies
We then plot specific agreements with their 95% jackknife CIs. We also display the sign frequencies (percentages in red):
library(ggplot2)
signs <- specific$Category # Get the names of the signs
plot <- ggplot(specific, aes(x = Category, y = Specific)) +
geom_point(size = 1.8, color = "#0000aa", shape = 1) +
scale_x_discrete(name ="Sign", limits = signs) +
scale_y_continuous(breaks=c(0, 0.2, 0.4, 0.6, 0.8, 1)) + theme_bw() +
geom_errorbar(aes(ymax = Upper, ymin = Lower), width=0.01, color = "#0000aa") +
geom_text(aes(label = sprintf("%2.1f%%", Freq), y = -.15), color="red", size=3.1, vjust=0) +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 40, vjust = 1, hjust = 1, size = 10),
axis.text.y = element_text(size = 12)) +
theme(panel.border = element_blank(), panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "#000000", size=0.03)) +
ylab("Specific Agreement")
print(plot)
We observe that agreement has been only observed for 13 signs, other signs have appeared with no consensus among participants. The above calculation does not account for chance agreement. Spitzer and Fleiss (1974) argue that agreement specific to categories should be also corrected for chance agreement. To this end, we can use a simple formulation described by Uebersax (1982), which assumes that the bias distribution is common across all raters (participants in our case).
specific.corrected <- chance.corrected.specific.agreement.CI(gestures)
specific.corrected <- specific.corrected[with(specific.corrected, order(-Freq)),]
We can again plot the chance-corrected specific agreement as before to produce the following graph:
Notice that chance correction penalizes the very frequent signs, such as the top sign, but has little effect on less frequent signs. We observe a particularly high agreement on the use of the CW/CCW sign (turn the key clockwise or counter-clockwise). Such agreement did not emerge by chance, as its use was not arbitrary but rather selective and consistent across participants.
For the analyses reported in the TOCHI article, specific agreement has not been corrected for chance agreement. However, agreement values are interpreted by taking into account their observed sign frequencies. Both approaches are valid as long as the authors are clear about their analyses and also report the bias distributions that they observe.
We can also use Fleiss’ Kappa for individual (or groups of) referents by assuming a common chance agreement across all referents:
refs <- list() # referents list
k <- list() # Kappa estimates
l <- list() # Lower bounds of 95% CI
u <- list() # Upper bounds of 95% CI
for(index in 1:nrow(gestures)){
# Construct the jackknife CI for Fleiss' Kappa of each referent
ci <- jack.CI.random.raters.fleiss.kappa.for.item(gestures, index)
refs[index] <- data[index, 1]
k[index] <- ci[1]
l[index] <- ci[2]
u[index] <- ci[3]
}
# Create a data frame with all the estimates
fleiss.df <- data.frame(Referent = as.character(refs), Kappa = as.double(k), Low = as.double(l), Upper = as.double(u))
# Sort the referents by their agreement values
fleiss.df <- fleiss.df[with(fleiss.df, order(-Kappa)),]
We can then plot our results as follows:
referents <- fleiss.df$Referent
plot <- ggplot(fleiss.df, aes(x = Referent, y = Kappa)) +
geom_hline(yintercept = 0, color = "red", size = .2) +
geom_point(size = 1.8, color = "#0000aa", shape = 1) +
scale_x_discrete(name ="Referent", limits = referents) +
scale_y_continuous(breaks=c(-0.2, 0, 0.2, 0.4, 0.6, 0.8, 1)) +
theme_bw() +
geom_errorbar(aes(ymax = Upper, ymin = Low), width=0.01, color = "#0000aa") +
theme(axis.title.x = element_blank(), axis.text.x = element_text(angle=48, vjust=1, hjust=1, size=10),
axis.text.y = element_text(size = 10)) +
theme(panel.border = element_blank(), panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(), panel.grid.major = element_line(colour="#000000", size=0.03))
print(plot)
Notice that Kappa can take negative values, where a negative value implies “disagreement.”
Warning: Jackknife confidence intervals for individual referents may not be precise, especially when the number of participants is not sufficiently large.
We can also use confidence intervals to back up the authors’ claim on the effect of directional referents on agreement: “highly directional commands […] tended to have a high gesture agreement” (Bailly et al. 2013). The difference in Kappa between the eight directional referents containing the terms stop, bottom, left, right, previous or next and allother referents can be estimated as follows:
directional <- c("top", "bottom", "left", "right", "previous", "next", "Previous", "Next")
referents.all <- data[,1]
matched.rows <- grep(paste(directional,collapse="|"), referents.all)
directional <- gestures[matched.rows,] # gesture proposals for directional referents
other <- gestures[-matched.rows,] # other proposals
# This assumes that chance agreement is the same for the two referent groups
# For this, we use a pooled pe -- calculated for the full dataset
fleiss.pe <- fleiss.chance.agreement.raw.noinference(gestures)
fleiss.pooled <- function(ratings) {generic.kappa(ratings, fleiss.pe)}
diff.Kappa <- jack.CI.diff.random.raters(directional, other, fleiss.pooled)
printCI("Difference in Fleiss' Kappa (pooled pe)", diff.Kappa)
## Difference in Fleiss' Kappa (pooled pe) = 0.408, 95% CI [0.237, 0.579]
The difference in Fleiss’ Kappa between the two groups of references is \(\Delta \kappa =.41\), \(95\%\) CI \(= [.24, .58]\), so there is strong evidence to support the authors’ claim.
Bailly et al. (2013) collected proposals from 11 women and 9 men. Vatavu and Wobbrock (2016) reanalyzed their dataset to test differences in agreement between genders but observed similar overall agreement rates between women and men (.353 vs. .322). We can use Fleiss’ Kappa to estimate this difference and construct its 95% confidence interval with the percentile bootstrap method (this can take several minutes):
# This mask identifies the ID of male participants
men_mask <- c(1, 5, 6, 8, 9, 13, 15, 18, 20)
gestures_men <- gestures[men_mask]
gestures_women <- gestures[-men_mask]
kappa.men <- jack.CI.random.raters(gestures_men, fleiss.kappa)
printCI("Fleiss' Kappa for Men", kappa.men)
kappa.women <- jack.CI.random.raters(gestures_women, fleiss.kappa)
printCI("Fleiss' Kappa for Women", kappa.women)
# The first argument specifies the referents of interest
# Here, we want to compare agreement over all the referents
kappa.delta <- fleiss.kappa.bootstrap.diff.ci(1:nrow(gestures),
gestures_women, gestures_men,
R = 3000) # Num of bootstrap samples
printCI("Difference in Fleiss' Kappa between Women and Men", kappa.delta)
## Fleiss' Kappa for Men = 0.207, 95% CI [0.087, 0.327]
## Fleiss' Kappa for Women = 0.268, 95% CI [0.207, 0.330]
## Difference in Fleiss' Kappa between Women and Men = 0.061, 95% CI [-0.115, 0.165]
Clearly, there is no evidence to support that there is a difference in agreement between women and men.
However, Vatavu and Wobbrock (2016) continued their analysis and used the Vb statistic to compare agreement differences between genders for individual referents. They found ”significant differences (p < .05) for 7 referents.” Based on this finding, they concluded: ”these results show that women and men reach consensus over gestures in different ways that depend on the nature of the referent […]”. The TOCHI article shows that such differences are random, due to the high Type I error rate of the Vb statistic.
Can we then use the bootstrap method to test gender differences for individual referents? We discourage such practices for several reasons. Making comparisons between women and men was out of the scope of the original study of Bailly et al. (2013). The size of the two groups was particularly low, while the study did not control for confounding variables that might correlate with gender. We argue against making unplanned post-hoc comparisons over uncontrolled samples of such small sizes, as those can result in misleading conclusions. Furthermore, jackknife and bootstrap confidence intervals for individual referents may not be precise, especially when sample sizes are low. Therefore, using them to test such hypotheses may not be appropriate.
The same agreement indices are used for the analysis of inter-rater reliability studies, such as to assess how independent raters agree on their classification of design outcomes, e.g., see Bousseau et al. (2016). However, the underlying assumptions for statistical inference may not be the same. In gesture elicitation studies, referents are fixed: any conclusion typically only applies to these referents. Participants, in contrast, are chosen randomly, and investigators may need to generalize their conclusions to the entire population of potential users.
Therefore, the above methods of inference will not apply if items classified by raters are not fixed but are rather sampled from a larger population. Gwet (2014) discusses solutions for a range of scenarios. His AgreeStat software deals with many of them.
Consider that chance-corrected coefficients have received a lot of criticism by many authors (including Gwet (2014) who has introduced his own measure). Despite such criticisms, Fleiss’ Kappa and Krippendorf’s alpha are still widely used for very good reasons. For a strong argumentation in favor of these measures, I refer the readers to Krippendorff (2016).
Pierre Dragicevic has contributed insights and code for the above analyses. I am grateful to Gilles Bailly and his co-authors for giving access to their anonymized dataset.
Bailly, Gilles, Thomas Pietrzak, Jonathan Deber, and Daniel J. Wigdor. 2013. “Métamorphe: Augmenting Hotkey Usage with Actuated Keys.” In Proceedings of the Sigchi Conference on Human Factors in Computing Systems, 563–72. CHI ’13. New York, NY, USA: ACM. doi:10.1145/2470654.2470734.
Bousseau, Adrien, Theophanis Tsandilas, Lora Oehlberg, and Wendy E. Mackay. 2016. “How Novices Sketch and Prototype Hand-Fabricated Objects.” In Proceedings of the 2016 Chi Conference on Human Factors in Computing Systems, 397–408. CHI ’16. New York, NY, USA: ACM. doi:10.1145/2858036.2858159.
Gwet, Kilem Li. 2014. Handbook of Inter-Rater Reliability, 4th Edition: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC. https://books.google.fr/books?id=fac9BQAAQBAJ.
Krippendorff, Klaus. 2016. “Misunderstanding Reliability.” Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 12 (4): 139–44. doi:10.1027/1614-2241/a000119.
Spitzer, Robert L., and Joseph L. Fleiss. 1974. “A Re-Analysis of the Reliability of Psychiatric Diagnosis.” The British Journal of Psychiatry 125 (587). The Royal College of Psychiatrists: 341–47. doi:10.1192/bjp.125.4.341.
Uebersax, John S. 1982. “A Design-Independent Method for Measuring the Reliability of Psychiatric Diagnosis.” Journal of Psychiatric Research 17 (4). Elsevier: 335–42. doi:http://dx.doi.org/10.1016/0022-3956(82)90039-5.
Vatavu, Radu-Daniel, and Jacob Wobbrock. 2016. “Between-Subjects Elicitation Studies: Formalization and Tool Support.” In Proceedings of the 34th Annual Acm Conference on Human Factors in Computing Systems, 3390–3402. CHI ’16. New York, NY, USA: ACM. doi:10.1145/2858036.2858228.
Vatavu, Radu-Daniel, and Jacob O. Wobbrock. 2015. “Formalizing Agreement Analysis for Elicitation Studies: New Measures, Significance Test, and Toolkit.” In Proceedings of the 33rd Annual Acm Conference on Human Factors in Computing Systems, 1325–34. CHI ’15. New York, NY, USA: ACM. doi:10.1145/2702123.2702223.