Deep Dive: LightRAG Indexing Pipeline

Phân tích chi tiết từ báo cáo Tổng quan LightRAG — document ingestion đến graph/vector index.

Báo cáo cha: ← LightRAG OverviewTopic: IndexingSource snapshot: v1.4.15Ngày: 2026-04-22

Tổng quan Intro

Indexing trong LightRAG là bước biến tài liệu thô thành một graph artifact có thể query. Nếu NaiveRAG coi document là tập chunks, LightRAG coi document là nguồn để sinh entity nodes, relation edges, entity/relation vector records, text chunks và doc processing state.

Sơ đồ kiến trúc LightRAG với graph-based text indexing và dual-level retrieval — Indexing tạo graph từ raw text, sau đó graph này trở thành nền cho low-level và high-level retrieval. ↗ HKUDS/LightRAG README

Quan sát quan trọng: LightRAG không rebuild toàn graph khi có document mới. Nó extract graph fragment của document mới rồi merge vào graph hiện tại bằng source IDs, descriptions và vector records. Đây là phần làm nó phù hợp hơn GraphRAG cho knowledge base động.

Flow tổng thể Flow

Flowchart indexing LightRAG từ input documents đến vector database, JSON KV store và graph — Flowchart indexing cho thấy LightRAG ghi nhiều artifact song song: full docs, chunks, vector records và graph nodes/edges. ↗ LearnOpenCV

insert / ainsert -> apipeline_enqueue_documents() -> doc_status: PENDING -> apipeline_process_enqueue_documents() -> chunking_by_token_size() -> chunks_vdb.upsert() -> extract_entities() -> merge_nodes_and_edges() -> graph.upsert_node / graph.upsert_edge -> entities_vdb / relationships_vdb upsert -> doc_status: PROCESSED

LightRAG WebUI screenshot hiển thị graph exploration sau khi indexing — Sau indexing, WebUI có thể visualize graph artifact. Đây là feedback loop quan trọng để phát hiện entity alias/noisy relation. ↗ HKUDS/LightRAG README

Pipeline internals Code

1. Enqueue + document status là checkpoint layer

I.1

lightrag/lightrag.py:1201-1270, 1760-1872

Indexing không nên là một hàm fire-and-forget. LightRAG tách enqueue và process, có track id, doc status, pipeline busy flag và cancellation flag. Đây là khác biệt lớn khi đưa RAG vào server có upload nhiều file.

insert chỉ bọc ainsert và trả track_id

async def ainsert(
    self,
    input: str | list[str],
    split_by_character: str | None = None,
    split_by_character_only: bool = False,
    ids: str | list[str] | None = None,
    file_paths: str | list[str] | None = None,
    track_id: str | None = None,
) -> str:
    if track_id is None:
        track_id = generate_track_id("insert")

    await self.apipeline_enqueue_documents(input, ids, file_paths, track_id)
    await self.apipeline_process_enqueue_documents(
        split_by_character, split_by_character_only
    )
    return track_id

Ưu điểm

Có thể monitor tiến trình insert qua track id và doc status.
Failed/pending/processing docs có thể được retry bởi pipeline.
Pipeline lock ngăn nhiều worker cùng mutate graph một cách nguy hiểm.

Nhược điểm

Phức tạp hơn một ingestion script đơn giản.
Nếu doc_status bị corrupt, cần consistency check/recovery logic.
Cần expose status rõ ở API/UI để user hiểu file đang ở stage nào.

2. Chunking giữ thứ tự và source path

I.2

lightrag/operate.py:88-126

Chunking là nơi quyết định granularity của graph. Chunk quá lớn làm extraction khó; chunk quá nhỏ làm relation bị mất context. LightRAG default 1200 tokens và overlap 100 tokens.

Token chunking

for index, start in enumerate(
    range(0, len(tokens), chunk_token_size - chunk_overlap_token_size)
):
    chunk_content = tokenizer.decode(tokens[start : start + chunk_token_size])
    results.append(
        {
            "tokens": min(chunk_token_size, len(tokens) - start),
            "content": chunk_content.strip(),
            "chunk_order_index": index,
        }
    )

Parameter	Default/ý nghĩa	Rủi ro nếu sai
`chunk_token_size`	1200 tokens trong docs/paper experiments	Quá lớn làm LLM extraction bỏ sót; quá nhỏ phá relation context
`chunk_overlap_token_size`	100 tokens	Overlap thấp mất continuity; overlap cao tăng chi phí indexing
`file_paths`	Metadata citation/source tracking	Thiếu source path làm references và deletion/rebuild khó audit
`split_by_character_only`	Cho custom documents	Nếu chunk vẫn vượt limit thì pipeline phải fail rõ thay vì âm thầm truncate

3. Entity/relation extraction parser được harden cho output lỗi

I.3

lightrag/operate.py:937-1062

LLM extraction không bao giờ sạch tuyệt đối. Source code LightRAG dành nhiều logic để sửa delimiter corruption, normalize relation/entity marker và warning khi missing completion delimiter. Đây là dấu hiệu hệ thống đã va chạm thực tế.

Parser sửa delimiter và tách entity/relation records

record = fix_tuple_delimiter_corruption(record, delimiter_core, tuple_delimiter)
record_attributes = split_string_by_multi_markers(record, [tuple_delimiter])

entity_data = _handle_single_entity_extraction(
    record_attributes, chunk_key, timestamp, file_path
)
if entity_data is not None:
    truncated_name = _truncate_entity_identifier(
        entity_data["entity_name"],
        DEFAULT_ENTITY_NAME_MAX_LENGTH,
        chunk_key,
        "Entity name",
    )
    maybe_nodes[truncated_name].append(entity_data)
    continue

relationship_data = _handle_single_relationship_extraction(
    record_attributes, chunk_key, timestamp, file_path
)

Prompt engineering chưa đủ. Production graph extraction cần cả parser hardening, length limits, source tracking và observable warnings. Nếu chỉ tin output LLM đúng format, graph sẽ hỏng âm thầm.

4. Merge nodes/edges vừa cập nhật graph vừa cập nhật vector records

I.4

lightrag/operate.py:1623-1770, 1430-1596

Graph và vector index phải nhất quán. Node/edge được merge trong graph nhưng cũng phải có vector record tương ứng để retrieval tìm được entity/relation bằng semantic query.

Relationship merge tạo content cho relationship VDB

rel_vdb_id = compute_mdhash_id(src + tgt, prefix="rel-")
rel_content = f"{combined_keywords}\t{src}\n{tgt}\n{final_description}"
vdb_data = {
    rel_vdb_id: {
        "src_id": src,
        "tgt_id": tgt,
        "source_id": updated_relationship_data["source_id"],
        "content": rel_content,
        "keywords": combined_keywords,
        "description": final_description,
        "weight": weight,
        "file_path": updated_relationship_data["file_path"],
    }
}
await relationships_vdb.upsert(vdb_data)

Ưu điểm

Graph traversal và vector retrieval cùng nhìn một semantic artifact.
Descriptions được dedupe/summarize giúp relation content cô đọng hơn raw chunks.
Source IDs và file paths giữ khả năng trace về document/chunk gốc.

Nhược điểm

Nếu graph write thành công nhưng vector write fail, retrieval có thể mất entity/relation.
Merge nhiều aliases của cùng entity vẫn phụ thuộc naming quality từ LLM.
Description summary dùng LLM nên có thể drift nếu model không ổn định.

5. Incremental update: union graph fragment thay vì rebuild toàn cục

I.5

Paper §3.1; lightrag/lightrag.py delete/rebuild paths

Đây là lợi thế chính so với GraphRAG trong data động. GraphRAG có community reports nên update mới có thể kéo theo regeneration; LightRAG merge entities/relations mới vào graph hiện tại.

Existing graph: G = (V, E) New document: D' Extracted graph: G' = (V', E') LightRAG update: V := merge(V, V') E := merge(E, E') source_ids := source_ids + new_chunk_ids descriptions := summarize_or_append()

Deletion là phép ngược khó hơn insertion. Xóa document phải rebuild affected entities/relations từ chunks còn lại, không chỉ xóa chunk vectors. Vì vậy LightRAG lưu `entity_chunks` và `relation_chunks` để biết entity/relation được chống đỡ bởi nguồn nào.

Implementation recipe Recipe

1
Prototype bằng default storage
Dùng JSON KV, NanoVectorDB, NetworkX để kiểm tra graph quality trước.
```
uv pip install lightrag-hku
export OPENAI_API_KEY=...
python examples/lightrag_openai_demo.py
```
2
Quan sát graph sau vài documents đầu
Chạy server/WebUI, kiểm tra entity aliases, relation noise và source paths.
```
uv tool install "lightrag-hku[api]"
lightrag-server
```
3
Khóa schema trước khi scale
Chọn embedding model, dimension, entity types, chunk size và backend storage trước khi ingest corpus lớn.
```
EMBEDDING_MODEL=BAAI/bge-m3
EMBEDDING_DIM=1024
ENTITY_TYPES=organization,person,location,event,concept
```

Tổng kết Wrap

Indexing takeaways

LightRAG indexing là graph construction pipeline, không phải chỉ embedding chunks.
Parser hardening và source tracking là bắt buộc vì LLM extraction output không ổn định tuyệt đối.
Incremental update là lợi thế kiến trúc, nhưng vẫn cần kiểm soát entity alias và relation noise.
Trước khi production, hãy dùng WebUI để inspect graph quality trên sample corpus.

Tham khảo