Large language models (LLMs) such as ChatGPT have grown so advanced that they can even pass the US Medical Licensing Exam. But how good are peer reviewers at AI detection, and how does the use of AI affect their perceptions of the work?
A team led by Lee Schwamm, MD, associate dean for digital strategy and transformation at Yale School of Medicine, attempted to answer these questions by hosting an essay contest for the journal Stroke that included both AI and human submissions. The researchers found that reviewers struggled to accurately distinguish human from AI essays when authorship was blinded. However, when reviewers attributed an essay as being written by AI, they were significantly less likely to rate it as the best on a given topic.
Schwamm hopes the findings highlight the need for developing policies on the appropriate use of AI in scientific manuscripts. His team published its findings in Stroke on September 3.
“This study is a wakeup call to editorial boards, and educators as well, that we can’t sit around waiting for someone else to figure this out,” Schwamm says. “We need to start thinking about what the right guardrails are within these spheres for where we should encourage the use, where should we be neutral, and where we should ban it.”
Reviewers struggle with AI detection
Schwamm’s team invited readers of Stroke to submit persuasive essays on one of three controversial topics in the stroke field—e.g., do statins increase the risk of hemorrhagic stroke? Essays were to be up to 1,000 words and contain no more than six references. In total, the researchers received 22 human submissions.
Then, the researchers used four different LLMs—ChatGPT 3.5, ChatGPT 4, Bard, and LLaMA-2—to each write one essay per topic. While they didn’t edit the AI essays themselves, they reviewed and made corrections to literature citations. “References are one of those places where AI is known to make a lot of errors,” Schwamm explains, “And we didn’t want that to give the AI away—we wanted the reviewers to really focus on the quality of the writing.”
The reviewers were all members of the Stroke editorial board and were all asked to attribute human versus AI authorship of the essays, rate them for quality and persuasiveness, and select a best essay on a topic for each of the prompts. Strikingly, the study found that the reviewers correctly identified authorship only 50% of the time. “It was like a flip of a coin,” Schwamm says.
In terms of quality, reviewers rated AI essays higher than human submissions. Interestingly, after conducting a multi-variable analysis, the team found that the only factor independently associated with greater odds of a reviewer correctly assigning AI as the author type was persuasiveness. “The more persuasive the reviewer perceived the article to be, the more that it was associated with AI authorship,” says Schwamm.
The team also found that when reviewers believed that an essay was written by AI, they rated it best in topic only 4% of the time. “The reviewers weren’t able to tell human- and AI-generated essays apart, but when they decided that an essay was written by AI, they almost never chose it as best in class,” says Schwamm.
LLMs could be a game-changing tool in scientific writing
The study suggests that as LLMs advance, peer reviewers will have a diminishing capability to detect content written by AI. It also revealed the negative bias held by reviewers toward machine-generated content. As more content becomes AI-generated or a hybrid of human and AI writing, the study poses important questions about the role of AI in scientific content.
When LLMs first emerged, some scientific journals—such as Science—banned their use altogether. Later on, the publication adjusted its stance to allow researchers to include a declaration of how they were using AI. “We have to fight the natural tendency to view the use of LLMs as unfair—that you somehow didn’t put in the hard work that you needed to,” Schwamm says. “We now use AI to actually do the science. So, it would be ironic to say you can’t have it be involved in the write-up of results.”
While there will be a greater onus on the writer to fact-check any AI output, the growing utilization of AI does not need to be negative. “We need to start thinking about AI as more of a tool that can be harnessed,” Schwamm says. “We have all sorts of ways that we have technology help us write, like spell checkers and word processors. This is a new iteration of that.”
For example, the technology will be a game changer for researchers in the U.S. who are not native English speakers. “I think it’s going to level the playing field in a good way,” Schwamm says.