Reliability Checks > KAPPA - Inter-Rater-Reliability > Kappa Results

Calculating Kappa often produces more questions than answers.

Especially if your coding system is not mutually exclusive, your data contains time gaps or if you have just 1 or 2 codes within a class.

Negative Kappa Value

Negative Kappa’s occur if the data analyzed is not suitable for the Kappa Formula.

Usually, this happens when a Class contains only 1 or 2 Codes.

The Kappa Formula is largely based on probability and for that you need a pool of different codes within a single Class.
Classes with just 1 or 2 different codes simply won’t work for Kappa, nor do DataSets with just 10 Events or so. This is not an INTERACT issue, but due to the nature of the Kappa formula developed by Mr. Cohen.

Because many of our customers do work with coding systems like yours, we implemented the overlapping-percentage in the Kappa results.
This percentage is the simple math of the number of Codes coded and the number of pairs found, without taking the probability into account.

TIP:	If your Class contains two Codes and there are gaps between these Codes, filling the gaps with a third 'NILL' Code, or something like that, improves the outcome of the Kappa calculation. Use the Fill Gaps routine to fill all gaps at once.

Kappa Value = 0

Usually, this happens when a Class contains only 1 or 2 Codes.

oWhen your Class holds 2 Codes and gaps in-between, use the Fill Gaps routine to add a third Code.

oWhen your Class holds only 1 Code, use the Move & Combine command to combine the Codes of as many Classes as possible, as is pointed out in the Overall-Kappa section.

oWhen there is a perfect match and your data results in a Pexpected = Pobserved, the final Kappa value might also be zero.
Simply because in K = (Pobs – Pexp) / (1 – Pexp) the result of Pobs – Pexp will in this case be 0.

This does not only make sense mathematically, because Pexp is the proportion expected by chance. Which means, that if this value exactly matches Pobs, the coder hit the right codes purely by chance...

Overall-Kappa

Kappa is calculated per Class because of the rules designed by Mr. Cohen Kappa. But, if your coded data always has just a single Code per Event-line your can use a little trick to get an overall-Kappa:

▪Select the Move & Combine command.

▪Select all relevant Classes.

▪Select the Combine Classes option.

▪Create a new Class by entering a name like 'ReliabilityCheck' or similar into the Target Class field.

▪This new Class now holds all Codes that used to be spread over multiple Classes.

▪Run Kappa and pay only attention to the results for the new Class.

Note: There is no such thing, as an overall Kappa for multiple Classes. Mr. Cohen designed the Kappa formula for sequential, exhaustive codings. INTERACT data, when split over multiple Classes, is usually not sequential nor exhaustive.

Kappa per Code

There is no such thing as a Code-based Kappa.

That is because of the nature of the original Kappa routine, as it was developed by Mr. Cohen Kappa. He developed this routine for exhaustive, continues codings, that do not overlap in time, at least per Class.

It is not suitable for single Codes.

The Kappa routine largely depends on probabilities, and probabilities can only be trusted in large pools of data.

We do provide the Pobserved and Pexpected per Code, but those values are not really related to Kappa,
so it is not possible to use them for a manual Kappa calculation.

The Code based values for Pobs and Pexp, are based on our own implementation for the calculation of a marginal cumulative probability.

Our % based agreement is listed per Code. This is an unweighted % calculation in which the number of matches is divided by the total number of occurrences for that one code. As is explained in Kappa Results.

General Kappa Weaknesses

The points listed here are general issues for Kappa, and not of INTERACT!

oThere is no weight for “how difficult a behavior is to code”. Meaning some behaviors are trivial to detect which should result in a time accurate match/mismatch calculation. Other behaviors are extremely difficult to detect which means calculation should be less strict in terms of identifying matches/mismatches.

oInterval based scale ratings are not suitable either, because the difference between 1 and 2 is weighted the same as between 1 and 6.

oThe semantics of the code is not included in the calculation. Example: A dog barks five times in succession. Observer 1 logs five events, observer 2 only one long event. In this case you have 4 mismatches (the breaks between the codes of observer 1). Now, depending on how often this happens during your study, the mismatches can sum up quickly, whilst they might be irrelevant relating to the “meaning” of the code. Again, this is a very simple example and there are many situations where the “real world meaning” (semantic) of a code should have indeed an impact on how the calculation of matches/mismatches is made.

oOverlapping Events need to have their Codes in different Classes, otherwise the routine does not know what to match.

oThe duration variance of behaviors is not taken into account. We think, that the length of a recorded behavior should add a weight to each found match/mismatch.

oUsing an event based algorithm (INTERACT) or a time sequences algorithm (GSEQ) can quickly lead in totally different results. Mainly because it is undefined how long each interval should be, in case you break down your event based coded data into time sequenced intervals. Suddenly a single dog’s bark result in N matches/mismatches whilst it was only one event in the real world. Also the interval width should vary for different behaviors, regarding the above points. etc. etc. etc.

Selection Order of Files

Should your Kappa result differs, whenever you switch the order of your data files:

This issue is a direct result from your data.

E.g. if Code "A" in the 'master' file has a duration of 4 seconds and the second file contains the same code "A" around the same time, but with a duration of 8 seconds and a slight time offset.

If you run the Kappa routine with an 80% overlapping required, the 80% of the 4 seconds event is easily covered by the 8 seconds event.

The other way around though, this combination might not be considered a match!
Simply because 80% from 8 seconds = 6,4 seconds, so the 4 second event will not be able to fulfill that criteria.
But, if there is no other matching or mis-matching event, and the 4 second event starts within the time frame you defined as the second parameter, compared to the start of the 8 second event, this will cause it to be considered a match again...

FAQ about Kappa Results

FAQ about Kappa Results