A recent survey of UK teachers found that 73% think that the curriculum does not place enough focus on teaching soft skills for employment. With employers increasingly saying they value creativity, problem solving skills and unconventional thinking, education systems’ ability to foster these attributes is critical.
The growing prevalence of generative AI means that using written work as a test of student’s learning and creativity is under threat. Given this apparent problem, how can assessment methods support cultivation of the skills valued in the workplace?
Establishing whether we can apply the stated intention of rewarding submissions that display those skills is difficult, given that it requires bulk comparison of written work. The complexity only increases when adding the necessity to judge originality and detect AI use across thousands of scripts.
RM and the Independent Schools Examinations Board (ISEB) conducted an exercise to test two hypotheses. Firstly, whether human assessors would value content showing originality more highly. Secondly, whether the assessors could spot AI-generated entries among the submitted scripts. The comparison exercise used 3,017 entries to the ISEB’s annual ‘Time to Write’ competition - a creative writing exercise giving students of mixed ages complete freedom to produce original stories.
Human assessors judged the competition entries using RM’s Adaptive Comparative Judgement tool, RM Compare. This process created a reliable and stable ‘rank order’ for the entries. The entries were then further evaluated using RM Echo, a content analysis tool which can identify areas of similarity between large volumes of written content.
We added 18 AI-generated stories to the human-created competition entries. The judges did not know that the entries they were reviewing contained AI-generated stories. If they suspected any entry may have been created using AI, the judges were invited to ‘flag’ them. The assessors only spotted a small percentage of the AI-generated “cuckoos”. However, the vast majority of them were judged less favourably.
Both RM Compare and RM Echo are ideal tools for bulk text analysis as each can handle large numbers of documents. In addition, RM Compare can reduce the time needed to assess large numbers of scripts. RM Echo performs bulk document analysis of written content for originality, integrity and references.
Judging work for originality can be a subjective process, even with strong rubrics to guide assessors. At the same time, maintaining the desired level of consistency across a cohort of examiners is challenging. If we are to value and encourage originality, then text comparison tools that analyse in bulk will enable it. The ability to reliably spot AI-derived content will help too. That ability becomes even more valuable since the AI-derived content is likely to be a distillation of that already created by humans and used to train the large language models (LLMs).
You can read the full story of the ISEB ‘Time to Write’ competition judgements by downloading the case study.