Deep Dive: LightRAG Operations, Concurrency & Evaluation

Phân tích chi tiết từ báo cáo Tổng quan LightRAG — những phần quyết định hệ thống có chạy ổn khi ingest/query thật.

Báo cáo cha: ← LightRAG OverviewTopic: OperationsConcurrency knobs: MAX_PARALLEL_INSERT, MAX_ASYNCNgày: 2026-04-22

Tổng quan Intro

LightRAG vận hành khó hơn một RAG vector-only pipeline vì indexing có LLM extraction, graph merge, vector upsert, status tracking và delete/rebuild. Nếu không giới hạn concurrency và không quan sát cache/trace, chi phí và latency sẽ tăng rất nhanh.

LightRAG WebUI screenshot về trạng thái hoặc cấu hình server — Operations không chỉ là backend code; WebUI/status/config giúp quan sát pipeline và workspace trong quá trình ingest/query. ↗ LightRAG API Server docs

Nguyên tắc vận hành: tăng concurrency theo throughput thật của LLM backend, không theo số CPU. Graph extraction dùng context dài và prompt nặng; local LLM hoặc API provider rate limits sẽ là bottleneck chính.

Concurrency map Concurrency

Bốn tầng concurrency trong LightRAG, dựa trên docs/LightRAG_concurrent_explain.md.

flowchart TD
    A["Documents queue"] --> B["Document-level semaphore<br/>max_parallel_insert"]
    B --> C1["Doc A chunks"]
    B --> C2["Doc B chunks"]
    C1 --> D["Chunk extraction semaphore<br/>llm_model_max_async per doc"]
    C2 --> D
    D --> E["Global LLM priority queue<br/>query > merge > extraction"]
    D --> F["Graph merge/rebuild<br/>llm_model_max_async * 2"]
    F --> E
    E --> G["LLM backend/API"]
    F --> H["Graph + vector upsert"]

LightRAG WebUI screenshot minh họa graph và query interface trong server — Khi queue và merge đang hoạt động, WebUI là nơi kiểm tra graph output và query behavior sau từng batch. ↗ HKUDS/LightRAG README

LightRAG architecture diagram liên kết indexing graph và retrieval — Concurrency phải bảo vệ cả hai phía của architecture: graph construction và graph-aware retrieval. ↗ HKUDS/LightRAG README

Operational internals Ops

1. Document-level concurrency giữ insert không phá graph

O.1

docs/LightRAG_concurrent_explain.md; lightrag/lightrag.py:1760-1900

Process nhiều documents cùng lúc có thể làm entity naming conflict tăng. LightRAG giới hạn bằng max_parallel_insert; docs khuyến nghị 2-10 và thường là khoảng llm_model_max_async / 3.

Semaphore giới hạn file processing

semaphore = asyncio.Semaphore(self.max_parallel_insert)

async def process_document(..., semaphore: asyncio.Semaphore) -> None:
    async with semaphore:
        # resolve file path, check cancellation,
        # chunk, extract, merge, update status
        ...

Knob	Default/khuyến nghị	Tác động
`MAX_PARALLEL_INSERT`	Docs khuyến nghị 2-10	Số documents xử lý đồng thời
`MAX_ASYNC`	Default source/docs quanh 4	LLM request concurrency và chunk extraction semaphore
Graph merge concurrency	`llm_model_max_async * 2`	Merge/rebuild entities/relations, không phải bước nào cũng gọi LLM
LLM backend concurrency	Phụ thuộc provider/local server	Giới hạn thật; vượt quá sẽ retry hoặc latency tăng

2. Global LLM priority queue ưu tiên query hơn extraction

O.2

lightrag/lightrag.py:742-753; docs/LightRAG_concurrent_explain.md

Server không được để background indexing làm user query chết đói. LightRAG wrap LLM function bằng priority/concurrency limiter; docs mô tả query được ưu tiên hơn merge và extraction.

LLM function được wrap bằng priority limiter

self.llm_model_func = priority_limit_async_func_call(
    self.llm_model_max_async,
    llm_timeout=self.default_llm_timeout,
    queue_name="LLM func",
)(
    partial(
        self.llm_model_func,
        hashing_kv=hashing_kv,
        **self.llm_model_kwargs,
    )
)

Ưu điểm

Giữ latency query ổn hơn khi đang ingest batch lớn.
Một điểm duy nhất để enforce timeout và concurrency.
Dễ reasoning hơn so với mỗi stage tự gọi LLM không kiểm soát.

Nhược điểm

Nếu MAX_ASYNC quá thấp, indexing sẽ chậm rõ.
Nếu MAX_ASYNC quá cao, provider rate limit hoặc local GPU memory sẽ thành bottleneck.
Priority queue không sửa được model context length không đủ.

3. Delete + rebuild khó hơn insert

O.3

lightrag/lightrag.py:3223+; lightrag/operate.py rebuild helpers

Xóa document trong graph RAG là consistency problem. Nếu chỉ xóa chunks, graph vẫn còn entity/relation cũ. LightRAG phải rebuild affected entity/relation từ source chunks còn lại.

delete doc_id -> acquire delete/pipeline state -> remove full_doc + text_chunks + chunk vectors -> find affected entity/relation source_ids -> rebuild entity descriptions from remaining chunks/cache -> rebuild relationship descriptions and VDB records -> update doc_status/cache if requested

Rebuild relationship cập nhật graph và relationship VDB

updated_relationship_data = {
    **current_relationship,
    "description": final_description,
    "keywords": combined_keywords,
    "weight": weight,
    "source_id": GRAPH_FIELD_SEP.join(limited_chunk_ids),
    "file_path": GRAPH_FIELD_SEP.join(file_paths_list),
    "truncate": truncation_info,
}
await knowledge_graph_inst.upsert_edge(src, tgt, updated_relationship_data)
await relationships_vdb.upsert(vdb_data)

4. Cache + token tracking kiểm soát chi phí

O.4

docs/AdvancedFeatures.md; lightrag/utils.py

Indexing graph gọi LLM nhiều lần, nên cache và token accounting là vận hành bắt buộc. LightRAG có LLM response cache, extraction cache, query cache tools và TokenTracker.

TokenTracker usage từ advanced docs

from lightrag.utils import TokenTracker

token_tracker = TokenTracker()

with token_tracker:
    result1 = await llm_model_func("your question 1")
    result2 = await llm_model_func("your question 2")

print("Token usage:", token_tracker.get_usage())

Cache/tool	Dùng để làm gì	Lưu ý
LLM response cache	Tránh gọi lại prompt giống nhau	Có thể làm debug nhầm nếu không biết cache hit
Entity extraction cache	Hữu ích khi rebuild/delete hoặc retry	Không nên xóa nếu còn cần reconstruct graph
Query cache cleanup tool	Selective cleanup theo mode/cache type	Khác với `aclear_cache()` toàn cục
TokenTracker	Đếm usage trong batch hoặc section	Nên bật trong benchmark/cost estimation
Export data	CSV/Excel/MD/TXT graph backup/audit	Không thay thế backup backend chính thức

5. Tracing + RAGAS evaluation đóng vòng feedback

O.5

docs/AdvancedFeatures.md; lightrag/evaluation

RAG quality không thể đo bằng “answer nghe có vẻ đúng”. LightRAG tích hợp Langfuse tracing cho OpenAI-compatible calls và có RAGAS evaluation script để đánh giá retrieval/generation.

Langfuse env config

LANGFUSE_SECRET_KEY=""
LANGFUSE_PUBLIC_KEY=""
LANGFUSE_HOST="https://cloud.langfuse.com"
LANGFUSE_ENABLE_TRACE=true

Giới hạn hiện tại: Advanced docs ghi Langfuse integration hiện tập trung vào OpenAI-compatible API calls. Nếu dùng Ollama/Azure/AWS Bedrock, cần kiểm tra lại trace coverage hoặc bọc thêm instrumentation ở app layer.

Runbook vận hành Runbook

Triệu chứng	Nguyên nhân khả dĩ	Hành động
Insert batch rất chậm	MAX_ASYNC thấp, LLM backend chậm, chunk quá lớn	Đo token usage, tăng concurrency thận trọng, giảm chunk size hoặc dùng model indexing nhanh hơn
Query latency tăng khi ingest	Background extraction/merge chiếm LLM queue	Giảm MAX_PARALLEL_INSERT, quan sát priority queue/provider rate limits
Graph nhiều duplicate nodes	Entity naming không ổn định	Siết entity types/prompt, review WebUI, thêm normalization domain-specific
Answer thiếu citation/reference	file_paths/source_ids thiếu hoặc chunks bị truncate	Truyền file_paths khi insert, bật include_references, inspect raw context
Delete xong vẫn thấy fact cũ	Rebuild affected entities/relations chưa hoàn tất hoặc cache cũ	Kiểm tra pipeline status, clean query cache nếu cần, export graph audit
Embedding query trả rỗng	Đổi embedding model/dimension sau index	Recreate vector storage hoặc migrate vectors đồng bộ

Tổng kết Wrap

Operations takeaways

LightRAG production bottleneck thường là LLM concurrency/context, không phải CPU.
Delete/rebuild là consistency workflow; không được coi như xóa row đơn giản.
Cache giúp giảm cost nhưng làm debug khó hơn nếu không có cache visibility.
Trace/evaluation cần được setup từ đầu nếu muốn so sánh query modes hoặc rerank configs một cách nghiêm túc.

Tham khảo