Paper Brief | Automated Knowledge Graph Enrichment Using Multi-Agent Large Language Models (NeurIPS 2025)

Knowledge graphs are crucial for structured knowledge and reasoning across numerous domains, but their manual construction and updating have become difficult to scale given the explosive growth of scientific literature. Traditional natural language processing methods are limited in understanding domain terminology and complex semantic relationships. While existing large language models possess powerful text comprehension abilities, they often face issues like hallucination, schema inconsistencies, and high computational costs when building knowledge graphs. To address this, this paper proposes KARMA (Knowledge-graph Augmentation with Reasoning Multi-Agent systems), an automated knowledge graph enrichment framework based on multi-agent LLMs, designed to achieve efficient, accurate, and scalable knowledge extraction and integration through a collaborative, modular agent system.

This paper's main contributions include:

Proposing the KARMA multi-agent framework: For the first time, a multi-agent LLM system is systematically applied to knowledge graph enrichment tasks. By collaborating through nine specialized agents handling document parsing, entity discovery, relationship extraction, schema alignment, and conflict resolution, the accuracy and consistency of knowledge extraction are significantly improved.
Designing a cross-agent verification and iterative optimization mechanism: Through cross-verification among agents (like mutual checks between relationship extraction and schema alignment) and a debate-based conflict resolution strategy, the hallucination problem of LLMs is effectively mitigated, enhancing the credibility of extraction results.
Achieving domain adaptation and modular expansion: It supports dynamically adjusting prompt strategies to adapt to different scientific fields. The modular design facilitates the integration of new entity types, relationships, or updated LLM models, offering good scalability and adaptability.
Experimental validation and open-source implementation: Systematic experiments on 1,200 PubMed articles across three biomedical domains (genomics, proteomics, metabolomics) show that KARMA can identify up to 38,230 new entities, with an LLM-verified accuracy rate of 83.1% and an 18.6% reduction in conflicting edges, significantly outperforming single-agent baseline models.

3.1 Overall Problem Definition

Given an existing knowledge graph G and a set of unstructured scientific literature D, the goal is to automatically extract a set of new knowledge triples T_new from each document d_i ∈ D and merge them into G, generating an enriched knowledge graph G'.

3.2 Agent Core Method and Workflow

KARMA accomplishes this task through a pipeline collaboration of nine agents, each optimized for a specific subtask, as shown in Figure 1.

Figure 1 KARMA Overall Workflow Diagram

1. Document Pre-processing Agents

Ingestion Agent (IA)

Input: Raw documents (PDF/HTML).
Core Operation: Uses an LLM (like GLM-4) to parse document structure, performs normalize(p_i) to handle OCR errors and format inconsistencies, and extracts metadata(p_i) (title, authors, journal, date).
Output: Standardized text and metadata for downstream agents.

Reading Agent (RA)

Core Operation: Splits the document into logical paragraphs P = {p_1, p_2, ...}. Calculates a relevance score for each paragraph p_j: S_rel(p_j) = LLM(p_j, G), where the LLM scores based on the paragraph's content relevance to existing entities in the knowledge graph G.
Filtering: Discards paragraphs with S_rel(p_j) < θ_τ (where θ_τ is a domain-specific threshold) to reduce noise.

Summarization Agent (SA)

Purpose: Reduces computational overhead and provides high signal-to-noise ratio input for downstream extraction tasks.
Core Operation: For each retained paragraph p_j, generates a condensed summary s_j = LLM_summarize(p_j). The prompt P_SA requires the LLM to retain key entities, relationships, and domain terminology.

2. Knowledge Extraction Agents

Entity Extraction Agent (EEA)

Core Operation: Performs LLM-driven Named Entity Recognition (NER) on summaries S = {s_1, s_2, ...} to generate a candidate entity set E_cand = Φ_NER(S). Φ_NER represents filtering and normalization using domain ontologies (like UMLS, MeSH), mapping surface forms (e.g., "acetylsalicylic acid") to canonical entities (e.g., "Aspirin").
Entity Linking: For each new entity e ∈ E_cand, links it to existing nodes v ∈ G by minimizing the embedding space distance: v* = argmin_{v∈G} dist(emb(e), emb(v)). Unmatched entities become candidate new nodes E_new.

Relation Extraction Agent (REA)

Core Operation: For each entity pair (e_h, e_t) within a summary s_j, uses an LLM classifier to predict the probability distribution of their relationship r: P(r | e_h, e_t).
Triple Generation: Selects relationships with a probability exceeding a threshold δ, forming a candidate triple set T_cand. Supports multi-label prediction, meaning an entity pair may have multiple relationships.

3. Knowledge Fusion Agents

Schema Alignment Agent (SAA)

Task: Aligns new entity/relationship types identified by the EEA and REA with the existing schema of the knowledge graph.
Core Operation: For a new entity e_i, SAA classifies it into a predefined type C_{SAA} ∈ T (like Drug, Disease): C_{SAA} = argmax_{t∈T} P(t | e_i). Similarly, it finds the closest existing relationship for new relationship types.

Conflict Resolution Agent (CRA)

Task: Detects logical conflicts between candidate triples T_cand and existing triples T_kg in the knowledge graph.
Core Operation: Defines a conflict function Φ_conflict. When a contradiction is detected, initiates an LLM-based debate mechanism: D = LLM_Debate(T_cand, T_kg). If the result is Contradict, the triple is discarded or submitted for expert review.

Evaluation Agent (EA)

Task: Calculates a global quality score for each candidate triple t_i ∈ T_cand that has passed conflict detection, deciding whether to finally integrate it.
Core Operation: Aggregates validation signals from multiple agents and calculates scores in three dimensions: confidence, clarity, and relevance. A weighted sum using a sigmoid function determines the final score: Score(t_i) = σ(α * Confidence + β * Clarity + γ * Relevance).
Integration Decision: If the average score exceeds a threshold θ_q, the triple is integrated: T_integrated = {t_i | Score(t_i) > θ_q}.

4.1 Experimental Setup

1. Dataset: Collected 1,200 scientific articles from PubMed across three domains:genomics (720 articles, focusing on gene variants, regulatory elements), proteomics (360 articles, focusing on protein structure, interaction networks), and metabolomics (120 articles, focusing on metabolic pathways, metabolite analysis).

2. Baseline Models: A single-agent baseline using a single LLM to directly extract all triples, and a multi-model comparison implementing KARMA on GLM-4, GPT-4o, and DeepSeek-v3 respectively.

3. Evaluation Metrics

Metric Category	Specific Metric	Description
Core Metrics	Average Confidence	Average confidence score of newly added triples
	Average Clarity	Clarity of relationship expression
	Average Relevance	Relevance to the domain topic
Graph Statistics	Coverage Gain	Number of new entities added
	Connectivity Gain	Net growth in node degrees
Quality Metrics	Conflict Ratio	Proportion of edges removed due to contradiction
	LLM-based Correctness	Correctness rate verified by an independent LLM
	QA Coherence	Accuracy of knowledge graph-based question answering
	Human Evaluation Score	Quality score from expert assessment

4.2 Main Results

1. Overall Performance Comparison: In the genomics domain, KARMA (DeepSeek-v3) extracted 58,412 candidate triples from 720 articles. After conflict resolution and quality assessment, it finally integrated 42,187 high-quality triples. Among these, 38,230 triples contained at least one new entity, significantly expanding the graph's coverage. The performance of various models across different domains is shown in Table 1.

Table 1 KARMA Cross-Domain Performance Comparison Table

Key findings are as follows:

KARMA significantly outperforms the single-agent baseline on all metrics.
DeepSeek-v3 performed best in coverage gain (38,230 new entities in genomics).
GPT-4o led in LLM-based correctness (88.0% in genomics).
Multi-agent collaboration reduced conflicting edges by 18.6%.

2. Domain-Specific Analysis

Genomics: The largest dataset, where DeepSeek-v3 showed the best balance of recall and precision.
Proteomics: Performance across models was relatively balanced, with GLM-4 leading in QA coherence.
Metabolomics: Data is sparse, but KARMA could still effectively mine metabolic pathway relationships.

4.3 Ablation Study

To validate the contribution of each agent, a systematic ablation study was conducted:

Table 2 Ablation Study Results

Key conclusions:

Removing the Summarization Agent: noise increased, and accuracy dropped by 22.9%.
Removing Conflict Resolution: logical consistency decreased, and QA coherence dropped by 4.9%.
Removing the Evaluation Agent: low-quality edges were integrated, and overall quality decreased significantly.

KARMA is an innovative multi-agent LLM framework that achieves efficient, accurate, and interpretable knowledge graph enrichment for scientific literature by decomposing the knowledge extraction task into multiple specialized agents. The framework not only improves extraction quality and consistency but also features good domain adaptability and scalability. Experiments demonstrate its superiority over existing methods in various biomedical domains, providing a powerful tool for automated knowledge graph construction and updating. Future work can explore cross-domain generalization, real-time updating mechanisms, and more diverse forms of knowledge representation.

Notes compiled by: Wang Yan, Master's student at Southeast University, research direction: Natural Language Processing.
Paper link: KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment
Published conference: NeurIPS 2025