Ops & Systems🇰🇷 한국어

Building GraphRAG with Neo4j + LangChain

Automatically convert natural language questions to Cypher queries and generate accurate answers using relationship data from your graph database.

Building GraphRAG with Neo4j + LangChain

Building GraphRAG with Neo4j + LangChain

Automatically convert natural language questions to Cypher queries and generate accurate answers using relationship data from your graph database.

TL;DR

  • Neo4j: Relationship-centric graph database
  • LangChain Neo4jGraph: Connect to Neo4j and auto-extract schema in Python
  • GraphCypherQAChain: Automatic natural language → Cypher query conversion
  • Hybrid Search: Combine Vector Index + Graph Traversal

1. Why Neo4j + LangChain?

Limitations of Traditional RAG

Typical Vector RAG:

text
Question → Embedding → Similar chunk retrieval → LLM answer

Problems:

  • Can't handle multi-hop questions like "What projects does A's manager lead?"
  • Loses entity relationship information
  • Context fragmentation during chunk splitting

Neo4j + LangChain Solution

text
Question → LLM (Cypher generation) → Neo4j query → Precise results → LLM answer

Benefits:

  • Accurate relationship-based traversal
  • Natural multi-hop query handling
  • Schema-based structured answers

2. Environment Setup

Install Neo4j

bash
# Run Neo4j with Docker
docker run -d \
  --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password123 \
  -e NEO4J_PLUGINS='["apoc", "graph-data-science"]' \
  neo4j:5.15.0

Install Python Packages

bash
pip install langchain langchain-openai langchain-community neo4j

3. Connecting to Neo4j and Building Data

Basic Connection

python
from langchain_community.graphs import Neo4jGraph

# Connect to Neo4j
graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="password123"
)

# Check schema
print(graph.schema)

Create Sample Data

python
# Create company organization data
setup_query = """
// Create teams
CREATE (ai:Team {name: 'AI Team', budget: 500000})
CREATE (data:Team {name: 'Data Team', budget: 300000})
CREATE (backend:Team {name: 'Backend Team', budget: 400000})

// Create employees
CREATE (john:Person {name: 'John Smith', role: 'Senior Developer', salary: 120000})
CREATE (sarah:Person {name: 'Sarah Johnson', role: 'Team Lead', salary: 150000})
CREATE (mike:Person {name: 'Mike Chen', role: 'Data Scientist', salary: 130000})
CREATE (david:Person {name: 'David Kim', role: 'Team Lead', salary: 145000})
CREATE (emily:Person {name: 'Emily Brown', role: 'Developer', salary: 95000})

// Create projects
CREATE (rec:Project {name: 'Recommendation System', status: 'active', deadline: '2024-06-01'})
CREATE (pipe:Project {name: 'Data Pipeline', status: 'active', deadline: '2024-04-15'})
CREATE (web:Project {name: 'Web Platform', status: 'completed', deadline: '2024-01-30'})

// Technologies
CREATE (python:Technology {name: 'Python'})
CREATE (pytorch:Technology {name: 'PyTorch'})
CREATE (fastapi:Technology {name: 'FastAPI'})
CREATE (kafka:Technology {name: 'Kafka'})
CREATE (react:Technology {name: 'React'})

// Create relationships
CREATE (john)-[:BELONGS_TO]->(ai)
CREATE (sarah)-[:BELONGS_TO]->(ai)
CREATE (sarah)-[:MANAGES]->(ai)
CREATE (mike)-[:BELONGS_TO]->(data)
CREATE (david)-[:BELONGS_TO]->(data)
CREATE (david)-[:MANAGES]->(data)
CREATE (emily)-[:BELONGS_TO]->(backend)

CREATE (john)-[:REPORTS_TO]->(sarah)
CREATE (mike)-[:REPORTS_TO]->(david)

CREATE (john)-[:WORKS_ON]->(rec)
CREATE (mike)-[:WORKS_ON]->(rec)
CREATE (mike)-[:WORKS_ON]->(pipe)
CREATE (david)-[:WORKS_ON]->(pipe)
CREATE (emily)-[:WORKS_ON]->(web)

CREATE (john)-[:LEADS]->(rec)
CREATE (david)-[:LEADS]->(pipe)

CREATE (rec)-[:USES]->(python)
CREATE (rec)-[:USES]->(pytorch)
CREATE (rec)-[:USES]->(fastapi)
CREATE (pipe)-[:USES]->(python)
CREATE (pipe)-[:USES]->(kafka)
CREATE (web)-[:USES]->(react)
CREATE (web)-[:USES]->(fastapi)
"""

graph.query(setup_query)
print("Data created successfully!")

# Refresh schema
graph.refresh_schema()
print(graph.schema)

4. Building GraphCypherQAChain

Basic Chain Setup

python
from langchain_openai import ChatOpenAI
from langchain.chains import GraphCypherQAChain

# Configure LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Create GraphCypherQAChain
chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True,  # See generated Cypher queries
    return_intermediate_steps=True
)

Test Natural Language Questions

python
# Question 1: Simple query
response = chain.invoke({"query": "Who works on the Recommendation System project?"})
print(response["result"])
# → John Smith and Mike Chen work on the Recommendation System project.

# Question 2: Multi-hop query
response = chain.invoke({"query": "What technologies are used in projects that John works on?"})
print(response["result"])
# → Python, PyTorch, and FastAPI

# Question 3: Aggregation query
response = chain.invoke({"query": "How many people are in each team?"})
print(response["result"])
# → AI Team: 2, Data Team: 2, Backend Team: 1

# Check generated Cypher query
print(response["intermediate_steps"][0]["query"])

5. Improving Accuracy with Custom Prompts

Customize Cypher Generation Prompt

python
from langchain.prompts import PromptTemplate

CYPHER_GENERATION_TEMPLATE = """Task: Generate a Cypher query to answer the question.

Schema:
{schema}

Instructions:
- Use only node labels and relationship types from the schema
- For names, use case-insensitive matching with toLower()
- Return meaningful property values, not just node references
- Use OPTIONAL MATCH for relationships that might not exist

Examples:
Question: Who is John's manager?
Cypher: MATCH (p:Person {{name: 'John Smith'}})-[:REPORTS_TO]->(manager:Person) RETURN manager.name

Question: What projects use Python?
Cypher: MATCH (p:Project)-[:USES]->(t:Technology {{name: 'Python'}}) RETURN p.name

Question: {question}
Cypher:"""

cypher_prompt = PromptTemplate(
    template=CYPHER_GENERATION_TEMPLATE,
    input_variables=["schema", "question"]
)

chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    cypher_prompt=cypher_prompt,
    verbose=True
)

Customize Answer Generation Prompt

python
ANSWER_TEMPLATE = """Based on the query results, provide a natural and complete answer.

Question: {question}
Query Results: {context}

Instructions:
- Answer in a conversational tone
- If results are empty, say "I couldn't find that information"
- Include relevant details from the results
- Be concise but complete

Answer:"""

answer_prompt = PromptTemplate(
    template=ANSWER_TEMPLATE,
    input_variables=["question", "context"]
)

chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    cypher_prompt=cypher_prompt,
    qa_prompt=answer_prompt,
    verbose=True
)

6. Vector + Graph Hybrid Search

Set Up Neo4j Vector Index

python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Neo4jVector

# Add document data (project descriptions, etc.)
documents = [
    "The Recommendation System project uses collaborative filtering and deep learning to suggest products.",
    "Data Pipeline handles real-time data ingestion from multiple sources using Kafka.",
    "The Web Platform provides a React-based dashboard for analytics and reporting.",
]

# Create Vector Index
vector_store = Neo4jVector.from_texts(
    texts=documents,
    embedding=OpenAIEmbeddings(),
    url="bolt://localhost:7687",
    username="neo4j",
    password="password123",
    index_name="project_docs",
    node_label="Document"
)

Implement Hybrid Search

python
class HybridNeo4jRAG:
    def __init__(self, graph, vector_store, llm):
        self.graph = graph
        self.vector_store = vector_store
        self.llm = llm
        self.cypher_chain = GraphCypherQAChain.from_llm(
            llm=llm, graph=graph, verbose=False
        )

    def search(self, question: str) -> dict:
        # 1. Structured info: Graph query
        try:
            graph_result = self.cypher_chain.invoke({"query": question})
            graph_context = graph_result.get("result", "")
        except Exception as e:
            graph_context = ""

        # 2. Unstructured info: Vector search
        vector_results = self.vector_store.similarity_search(question, k=3)
        vector_context = "\n".join([doc.page_content for doc in vector_results])

        # 3. Combine contexts
        combined_context = f"""
## Structured Data (from Knowledge Graph)
{graph_context}

## Related Documents
{vector_context}
"""

        # 4. Generate final answer
        final_prompt = f"""Answer the question based on the following context.

Context:
{combined_context}

Question: {question}

Provide a comprehensive answer combining both structured and unstructured information."""

        response = self.llm.invoke(final_prompt)

        return {
            "answer": response.content,
            "graph_context": graph_context,
            "vector_context": vector_context
        }

# Usage
hybrid_rag = HybridNeo4jRAG(graph, vector_store, llm)
result = hybrid_rag.search("Tell me about the Recommendation System project and who works on it")
print(result["answer"])

7. Production Tips

Error Handling

python
from langchain.chains import GraphCypherQAChain

def safe_query(chain, question: str) -> str:
    try:
        result = chain.invoke({"query": question})
        return result["result"]
    except Exception as e:
        if "syntax error" in str(e).lower():
            return "I couldn't understand that query. Could you rephrase?"
        elif "connection" in str(e).lower():
            return "Database connection issue. Please try again."
        else:
            return f"An error occurred: {str(e)}"

Query Validation

python
def validate_cypher(graph, cypher: str) -> bool:
    """Validate query syntax with EXPLAIN (doesn't execute)"""
    try:
        graph.query(f"EXPLAIN {cypher}")
        return True
    except:
        return False

Caching Strategy

python
from functools import lru_cache
import hashlib

class CachedGraphRAG:
    def __init__(self, chain):
        self.chain = chain
        self.cache = {}

    def query(self, question: str) -> str:
        # Normalize and hash question
        normalized = question.lower().strip()
        cache_key = hashlib.md5(normalized.encode()).hexdigest()

        if cache_key in self.cache:
            return self.cache[cache_key]

        result = self.chain.invoke({"query": question})
        self.cache[cache_key] = result["result"]

        return result["result"]

8. Performance Optimization

Create Indexes

python
# Add indexes for frequently searched properties
graph.query("CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name)")
graph.query("CREATE INDEX project_name IF NOT EXISTS FOR (p:Project) ON (p.name)")
graph.query("CREATE INDEX team_name IF NOT EXISTS FOR (t:Team) ON (t.name)")

Limit Query Results

python
# Limit results when creating chain
chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    top_k=10,  # Return max 10 results
    verbose=True
)

Conclusion

The Neo4j + LangChain combination is a powerful solution for overcoming traditional Vector RAG limitations.

SituationRecommended Approach
Relationship-based questionsGraphCypherQAChain
Semantic searchVector Search
Complex questionsHybrid (use both)

Getting started:

  1. Run Neo4j with Docker
  2. Model your domain data (nodes, relationships)
  3. Implement natural language queries with GraphCypherQAChain
  4. Add Vector Index as needed

References

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts