Science funding agencies say no to using AI for peer review

News from

Neuroscientist Greg Siegle was at a conference in early April when he heard something he found “very scary.” Another scientist was gushing that ChatGPT, the artificial intelligence (AI) tool released in November 2022, had quickly become indispensable for drafting critiques of the thick research proposals he had to wade through as a peer reviewer for the National Institutes of Health (NIH). Other listeners nodded, saying they saw ChatGPT as a major time saver: Drafting a review might entail just pasting parts of a proposal, such as the abstract, aims, and research strategy, into the AI and asking it to evaluate the information.

NIH and other funding agencies, however, are putting the kibosh on the approach. On 23 June, NIH banned the use of online generative AI tools like ChatGPT “for analyzing and formulating peer-review critiques”—likely spurred in part by a letter from Siegle, who is at the University of Pittsburgh, and colleagues. After the conference they warned the agency that allowing ChatGPT to write grant reviews is “a dangerous precedent.” In a similar move, the Australian Research Council (ARC) on 7 July banned generative AI for peer review after learning of reviews apparently written by ChatGPT.

Other agencies are also developing a response. The U.S. National Science Foundation has formed an internal working group to look at whether there may be appropriate uses of AI as part of the merit review process, and if so what “guardrails” may be needed, a spokesperson says. And the European Research Council expects to discuss AI for both writing and evaluating proposals.

ChatGPT and other large language models train on vast databases of information to generate text that appears to be written by humans. The bots have already prompted scientific publishers concerned about ethics and factual accuracy to restrict their use for writing papers. Some publishers and journals, including Science, are also banning their use by reviewers.

For the funding agencies, confidentiality tops the list of concerns. When parts of a proposal are fed into an online AI tool, the information becomes part of its training data. NIH worries about “where data are being sent, saved, viewed, or used in the future,” its notice states.

Critics also worry that AI-written reviews will be error-prone (the bots are known to fabricate), biased against nonmainstream views because they draw from existing information, and lack the creativity that powers scientific innovation. “The originality of thought that NIH values is lost and homogenized with this process and may even constitute plagiarism,” NIH officials wrote on a blog. For journals, reviewer accountability is also a concern. “There’s no guarantee the [reviewer] understands or agrees with the content” they’re providing, says Kim Eggleton, who heads peer review at IOP Publishing.

In Australia, ARC banned grant reviewers from using generative AI tools 1 week after an anonymous Twitter account, ARC_Tracker, run by a researcher there reported that some scientists had received reviews that appeared to be written by ChatGPT. Some got similar appraisals when they pasted parts of their proposals into ChatGPT, ARC_Tracker says. One review even included a giveaway, the words “regenerate response” that appear as a prompt at the end of a ChatGPT response. (ScienceInsider confirmed ARC_Tracker’s identity but agreed to anonymity so this scientist and others can use the account to freely critique ARC and government policies without fear of repercussions.)

Scientists may think ChatGPT produces meaningful feedback, but it essentially regurgitates the proposal, ARC_Tracker’s owner says. Admittedly, some human reviewers do that, too. But, “There is a very big difference between a proper review–which should provide insight, critique, informed opinion and expert assessment–and a mere summary of what’s already in a proposal,” the scientist wrote in an email to ScienceInsider.

Some researchers, however, say AI offers a chance to improve the peer-review process. The NIH ban is a “technophobic retreat from the opportunity for positive change,” says psychiatric geneticist Jake Michaelson of the University of Iowa. Reviewers could use the tools to check their critique to see whether they’ve overlooked anything in the proposal, help them assess work from outside their own field, and smooth language they did not realize sounds “petty or even mean,” Michaelson says. “Eventually I see AI becoming the first line of the peer-review process, with human experts supplementing first-line AI reviews. … I would rather have my own proposals reviewed by ChatGPT-4 than a lazy human reviewer,” he adds.

The landscape is likely to change over time. Several scientists noted on NIH’s blog that some generative AI models work offline and don’t violate confidentiality—eliminating at least that concern. NIH responded that it expects to “provide additional guidance” for a “rapidly evolving area.”

Mohammad Hosseini, an ethics postdoctoral researcher at Northwestern University who has written about AI in manuscript review, agrees the NIH ban is reasonable, for now: “Given the sensitivity of issues and projects the NIH deals with, and the novelty of AI tools, adopting a cautious and measured approach is absolutely necessary.”