Generative AI

How We Built a GraphRAG Document Assistant for an Enterprise HR Firm — and Run It for Pennies

June 19, 2026

Summary

  • Plain RAG fails on HR data because entities collide — same first names, near-identical policies, pronouns. We built an entity-aware retrieval layer that routes to the right documents before searching.
  • Cost is engineered at every layer: right-sized model routing, near-free embeddings with hash-based dedup, batch processing, ~90% savings from context caching, and tight capped context.
  • Single-tenant by design for sensitive HR data, with an automated LLM-judge regression suite that proves accuracy holds before any change ships.

How We Built a GraphRAG Document Assistant for an Enterprise HR Firm — and Run It for Pennies

When a UK-based enterprise HR advisory firm came to us, their consultants were drowning in documents. Handbooks, contracts, disciplinary records, grievance files, NDAs, and correspondence — spread across a large book of corporate clients. Answering a simple question like "What's the notice period in this employee's contract?" meant manually hunting through thousands of files.

They wanted a private, secure AI assistant that could answer those questions instantly and accurately — without sending sensitive HR data to a public chatbot, and without an unpredictable five-figure monthly AI bill.

We deployed a dedicated, single-tenant GraphRAG system on their own isolated infrastructure. Here's how it works, and how we engineered it to run for a fraction of what you'd expect.

Why "GraphRAG," not just RAG

Naive retrieval-augmented generation — embed every document, find the closest text chunks, stuff them into a prompt — falls apart on HR data for one specific reason: entities collide. Two employees can share a first name. Two companies can have nearly identical policies. A question about "his salary" depends entirely on who was being discussed two messages ago.

Plain vector similarity has no concept of who or which company — it just matches words, and it will confidently return the wrong person's contract.

So we built an entity-aware retrieval layer: a graph of the relationships between documents, the people and companies they reference, and the live conversation context.

  1. Ingestion & entity extraction. Every document is chunked into overlapping windows (overlap preserves context across boundaries) and passed through a lightweight extraction pass that tags each chunk with structured metadata — employee name, company, document type, date range. This metadata is the graph: it lets us reason about documents by entity, not just by text.

  2. Metadata-routed retrieval. When a question comes in, we first resolve which entities it's about — including pronouns ("his", "her", "that company") by tracking an explicit conversation entity-state across turns. We narrow to the relevant documents before running the vector search, which eliminates the cross-company and same-name confusion that breaks plain RAG.

  3. Hybrid search + reranking. Within the selected documents we run semantic (vector) search and keyword search, then rerank with a blended score. Short, factual data — a pay rate, a date — no longer gets buried under verbose handbook prose.

  4. Bounded, grounded generation. Only the top, most-relevant chunks reach the language model, with a hard cap on context size. The answer is generated strictly from the retrieved evidence, with the source documents cited.

The result is a system that answers "What is this employee's notice period?" with the right clause from the right contract — and says "I don't have that" when the answer genuinely isn't in the corpus, instead of hallucinating.

The harder problem: keeping it cheap

Anyone can wire up an LLM and a vector database. The engineering that actually matters for a client is making the system economical to run indefinitely — because an AI assistant that costs more than the analyst it replaces is a science project, not a product.

Here's how we kept per-query cost low without sacrificing answer quality:

  • Right-sized model routing. The expensive, frontier-class model is reserved for the one thing it's uniquely good at: writing the final grounded answer. Every supporting step — classifying the question, extracting entities, routing to the right documents — runs on a small, fast model that costs a fraction as much. Most of the "thinking" in a RAG system is plumbing, and plumbing doesn't need a frontier model.

  • Embeddings are nearly free — and we don't pay twice. We use a compact embedding model (roughly $0.02 per million tokens, ~50x cheaper than chat models). Every document is content-hashed on ingestion; if a file resurfaces unchanged from the client's document store, we clone its existing vectors instead of re-embedding — so routine syncs cost nothing for unchanged content.

  • Batch, don't stream, the bulk work. Initial document embedding runs through the provider's asynchronous batch API at a significant discount over real-time calls, with database-level locking so a crashed job never double-bills.

  • Context caching cuts repeat cost ~90%. For long-running conversations, the stable parts of the prompt are cached at the provider, so each follow-up question is billed against a fraction of the tokens.

  • Tight, capped context. Adaptive retrieval limits and a hard ceiling on chunks-per-query mean we send the model the least context that fully answers the question — which is both cheaper and, counterintuitively, more accurate (oversized prompts cause models to lose the "needle in the haystack").

  • Bounded conversation memory. A sliding-window context manager keeps only the most recent and most relevant turns, so token usage per question stays flat instead of growing without bound over a long session.

  • Self-hosted, consolidated infrastructure. Vectors live in PostgreSQL (pgvector) alongside the application data — no separate, metered vector-database bill — on lean, isolated single-tenant infrastructure dedicated to the client.

Layered together, these decisions take the recurring cost of a heavily-used internal assistant from "nervously watching the dashboard" to a rounding error on the value it delivers.

Security & isolation

Because this is sensitive HR and employee data, the deployment is single-tenant by design: the client's documents, vectors, and conversations live on infrastructure isolated from every other customer, with per-tenant data partitioning enforced at the database layer and access gated to their own domain. Nothing is co-mingled, and nothing trains a shared model.

How we keep it correct over time

AI systems silently regress. To prevent that, the system ships with an automated response regression suite — a library of representative questions (named-person lookups, cross-document counts, analytical "how many..." queries) each checked against an expected answer by an LLM judge. Before any change to prompts, models, or retrieval logic goes live, we can prove the system still answers the hard questions correctly.

The outcome

The client's consultants now get instant, sourced answers to document questions that previously required manual searching — on a private system, over their own data, that runs at a cost low enough to leave on for the whole team, every day.


Akyla builds private, cost-engineered AI systems on top of enterprise data. If your team is sitting on documents or data that should be answering questions for you, book a discovery call.

Frequently Asked Questions

What is How We Built a GraphRAG Document Assistant for an Enterprise HR Firm — and Run It for Pennies about?

A behind-the-scenes look at the entity-aware GraphRAG system Akyla built for an enterprise HR firm — how it stays accurate on tricky document questions, and...

What are the main takeaways from this article?

Plain RAG fails on HR data because entities collide — same first names, near-identical policies, pronouns. We built an entity-aware retrieval layer that routes to the right documents before searching. Cost is engineered at every layer: right-sized model routing, near-free embeddings with hash-based dedup, batch processing, ~90% savings from context caching, and tight capped context. Single-tenant by design for sensitive HR data, with an automated LLM-judge regression suite that proves accuracy holds before any change ships.

Who should read this article?

This article is most relevant for business leaders, data teams, and enterprise buyers evaluating generative ai strategies and modern analytics platform choices. It takes about 5 minutes to read.

Akyla
Research-driven analytics for data science, technology, and complex systems.
Read more research