Using Fleiss’ Kappa Coefficient to Measure the Intra and Inter-Rater Reliability of Three AI Software Programs in the Assessment of EFL

Date

2024-1

Type

Article

Journal title

International Journal of Educational Sciences and Arts (IJESA)

Issue

Vol. 3 No. 1

Author(s)

Iman Muftah Albakkosh

Pages

69 - 96

Abstract

This study aimed to compare the intra- and inter-rater reliability of three AI tools for assessing EFL learners’ story writing: Poe.com, Bing, and Google Bard. The study utilized quantitative tools to answer the research questions, namely, calculating the Fleiss’ Kappa coefficient using the Datatab software program (available on datatab.com). The study sampled 14 written pieces by EFL Libyan adult learners, the pieces used were stories built around a prompt provided by the teacher. The assessment was done using two criteria, one including the measurement of students’ creativity, and the second was done focusing only on the linguistic aspect of the students’ writings. The three applications performed in a reliable way to a certain extent without the exclusion of the creativity criterion, this goes against the common belief that AI software cannot assess creativity. Still, the results of the reliability measurements with the creativity criterion show that the assessment scores are not statistically significant, and there’s a high probability that the observed agreement is due to random chance. Some limitations of this study were the small sample size, the limited number of criteria, and the lack of human raters for comparison. Future research could involve more participants, criteria, AI tools, and human raters to provide a more comprehensive and reliable evaluation of AI tools for assessing EFL story writing.

Fulltext

View

Publisher's website

View