Authorized reproduction from Academic Headlines, courtesy of Big Data Digest.
Compiled by: Xiao Xiao
Large Language Models (LLMs) commonly suffer from the hallucination problem, where they generate content that does not align with objective facts. In response, various techniques for suppressing hallucinations have been proposed.
In scientific research activities, creativity is a key element in formulating scientific hypotheses and constructing scientific concepts. As LLMs become increasingly deeply integrated into scientific research assistance, this type of creative thinking, which relies on unconventional associations, often resembles model hallucinations in its form of expression.
However, there is still a lack of systematic understanding regarding whether hallucination-suppressing techniques can undermine model creativity.
Based on this, a research team from Nanyang Technological University utilized two creativity evaluation benchmarks to study the relationship between hallucination suppression methods and model creativity, providing an important reference for how to choose appropriate hallucination suppression techniques in scientific application scenarios.
Paper Link: https://arxiv.org/pdf/2512.11509
They selected two creativity evaluation benchmarks.
NeoCoder, evaluating model creativity through programming tasks with progressively increasing constraints. Its highly regulated task environment is analogous to scientific experiments conducted under the constraints of fixed natural laws.
CS4, focusing on open-ended story generation, emphasizes divergent association and imaginative ability, closely resembling the creative thinking process required for scientific hypothesis generation.
On both benchmarks, they systematically evaluated three hallucination suppression techniques: Chain of Verification (CoVe), Contrastive Layer Decoding (DoLa), and Retrieval-Augmented Generation (RAG).
Figure: Schematic of the Experimental Framework
CoVe Enhances Divergent Creativity
Experimental results show that CoVe significantly enhances the divergent creativity of large models, meaning it allows the model to produce more novel and diverse answers when thinking about problems.
Experiments across different models and benchmarks showed that CoVe performs stably overall, with particularly notable improvements on some smaller models.
This indicates that by introducing a "verification-interrogation" style decoding process, CoVe guides the model to explore more potential reasoning paths rather than quickly reaching a conclusion along a single direction.
The effect of CoVe is not only reflected in performance metrics but also demonstrates the value of human-like divergent thinking training. Continuous questioning and multi-directional thinking help break fixed mindsets and stimulate "brainstorming-style" creative thinking.
Figure: Impact of Decoding Methods on Divergent Creativity (NeoCoder)
RAG Has Limited Impact on Divergent Creativity
On the other hand, RAG has virtually no substantial impact on the model's divergent creativity. Regardless of the model size or benchmark, RAG results only show slight and random fluctuations, hovering around the baseline.
However, the research team also explained a potential possibility: if the retrieval system can provide strategies or fresh knowledge beyond the model's training, RAG might play a positive role in creativity. For example, high-quality retrieval content can help the model improve factual judgment, an ability that converges with creativity, and it may also stimulate new ideas by introducing new perspectives, thereby improving divergent creativity.
Figure: Impact of Decoding Methods on Divergent Creativity (CS4)
DoLa Suppresses Divergent Creativity
In contrast to the two techniques above, DoLa systematically reduces the model's divergent creativity. On both benchmarks, most models based on DoLa performed slightly below the baseline, indicating that the decline in creativity comes from DoLa itself, not from differences in model structure.
The research team speculates that DoLa weakens the layer structure responsible for creativity. DoLa's mechanism works by subtracting early-layer predictions from later-layer predictions to enhance factuality. If early layers contain more exploratory and divergent features, then the contrastive operation could erase the information needed for creative generation.
Further experiments showed that by enhancing layers positively correlated with creativity and suppressing negatively correlated layers during decoding, divergent creativity can be improved without compromising convergent creativity. This suggests that the two types of creativity might be separable, and divergent creative potential could be targeted for enhancement in the future.
Figure: Improving Divergent Creativity by Enhancing Layers Positively Correlated with Creativity and Suppressing Negatively Correlated Layers
Limitations
Of course, this research also has certain limitations.
First, creativity evaluation has limitations. Experiments can only indirectly measure scientific hypothesis generation ability and are not equivalent to creativity performance in real-world scientific research, so the extrapolation of results is limited.
Second, the CoVe mechanism is insufficiently explained. Although CoVe can enhance divergent creativity, no ablation experiments were conducted, and the specific mechanism was not revealed, so the causal path of its effect cannot be determined.
Third, the RAG conclusion is not robust enough. The explanation for RAG's weak impact on creativity lacks measurement of retrieval quality and different retrieval strategies, so the conclusion requires more systematic verification.
As LLMs become smarter, releasing their potential in scientific discovery is becoming increasingly important. Looking ahead, researchers hope that LLMs will not just be passive tools, but become active collaborators in scientific work.