KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Charilaos I. Kanatsoulis, Sanmi Koyejo

knowledge graph KG synthetic data data generation

Abstract

Recent interest in building foundation models for knowledge graphs has highlighteda fundamental challenge: knowledge graph data is scarce. The best-known knowl-edge graphs are primarily human-labeled, created by pattern-matching, or extractedusing early NLP techniques. While human-generated knowledge graphs are inshort supply, automatically extracted ones are of questionable quality.

We presentKGGen, a novel text-to-knowledge-graph generator that uses language models toextract high-quality graphs from plain text with a novel entity resolution approachthat clusters related entities, significantly reducing the sparsity problem that plaguesexisting extractors. Unlike other KG generators, KGGen clusters and de-duplicatesrelated entities to reduce sparsity in extracted KGs.

Along with KGGen, we releaseMeasure of Information in Nodes and Edges (MINE), the first benchmark to test anextractor’s ability to produce a useful KG from plain text. We benchmark our newtool against leading existing generators such as Microsoft’s GraphRAG; we achievecomparable retrieval accuracy on the generated graphs and better information re-tention. Moreover, our graphs exhibit more concise and generalizable entities andrelations.

Our code is open-sourced at https://github.com/stair-lab/kg-gen/.