lancedb based RAG implementation meant to integrate with obsidian and other memory-service and notetaking plugins.
  • Python 98.6%
  • Shell 1.4%
Find a file
2026-04-27 07:05:02 -05:00
config First working prototype 2026-04-27 07:05:02 -05:00
data First working prototype 2026-04-27 07:05:02 -05:00
docs First working prototype 2026-04-27 07:05:02 -05:00
examples/zeroclaw First working prototype 2026-04-27 07:05:02 -05:00
man/man1 First working prototype 2026-04-27 07:05:02 -05:00
scripts First working prototype 2026-04-27 07:05:02 -05:00
src/ldb_rag First working prototype 2026-04-27 07:05:02 -05:00
tests First working prototype 2026-04-27 07:05:02 -05:00
.gitignore First working prototype 2026-04-27 07:05:02 -05:00
pyproject.toml First working prototype 2026-04-27 07:05:02 -05:00
README.md First working prototype 2026-04-27 07:05:02 -05:00
requirements.txt First working prototype 2026-04-27 07:05:02 -05:00

ldb-rag

ldb-rag is a local-first RAG framework built around LanceDB, a Python SDK, a CLI, and an MCP server. It is intended for developer-facing knowledge bases: source trees, Markdown docs, text notes, git repositories, and web pages that you want to retrieve from or answer questions about with citations.

What This Project Is

This framework gives you four practical surfaces:

  • SDK: use the application services directly from Python
  • CLI: create collections, ingest data, search, answer, and manage sessions
  • MCP server: expose the same capabilities to coding agents and tool-driven clients
  • provider adapters: connect local or cloud embedding/generation backends

The current implementation supports:

  • LanceDB as the vector store
  • filesystem ingestion
  • git repository ingestion
  • URL/web page ingestion
  • Markdown and code-aware parsing/chunking
  • hybrid retrieval with optional reranking
  • persisted conversation sessions
  • live provider test scaffolding for Ollama, OpenAI, Anthropic, Gemini, and Qwen-compatible APIs

What Problem It Solves

ldb-rag is useful when you want a local knowledge base that can:

  • ingest a codebase and answer architectural questions with source citations
  • ingest Markdown docs and retrieve relevant sections quickly
  • ingest a git repo or URL set and unify them into one collection
  • expose the same retrieval and answer capabilities to CLI scripts and MCP clients
  • preserve session transcripts for iterative question-answer workflows

Realistic Use Cases

1. Local code assistant for one repository

Use this when you want to ask questions like:

  • "Where is greet_user implemented?"
  • "Which files define shell bootstrap behavior?"
  • "Summarize the main Python entrypoints and cite them."

Example:

ldb-rag collection create my-repo
ldb-rag ingest path my-repo ./src
ldb-rag answer my-repo "How does the configuration loader work?"

2. Unified docs + code knowledge base

Use this when your project has README.md, docs, scripts, and source files and you want one RAG collection over all of them.

Example:

ldb-rag collection create project-docs
ldb-rag ingest path project-docs ./docs
ldb-rag ingest path project-docs ./src
ldb-rag search project-docs "Where are citations assembled?"

3. Git repo ingestion for remote or cached repos

Use this when the knowledge source is a repository rather than a local folder.

Example:

ldb-rag collection create upstream
ldb-rag ingest git upstream https://github.com/example/project.git --ref main
ldb-rag answer upstream "Summarize the entrypoints."

4. URL/web ingestion for lightweight external docs

Use this when you need a few pages from a documentation site or internal HTTP source.

Example:

ldb-rag collection create web-notes
ldb-rag ingest url web-notes https://example.com/docs/page-a https://example.com/docs/page-b
ldb-rag search web-notes "Which page explains authentication?"

5. Session-backed research or debugging workflow

Use this when you want a persistent conversation session with timestamps and stored messages.

Example:

ldb-rag answer my-repo "Explain the retrieval pipeline." --session-title "retrieval-review"
ldb-rag session list
ldb-rag session show <session-id>

6. MCP integration for agent clients

Use this when you want tools such as Codex-style harnesses or other MCP clients to query the same collections through structured tool calls rather than shelling out manually.

Example:

ldb-rag mcp serve

Integration Model

The simplest way to integrate the framework is to choose one of these paths:

CLI-first integration

Best for shell scripts, dev workflows, and manual use.

Basic lifecycle:

ldb-rag init
ldb-rag collection create sample
ldb-rag ingest path sample ./docs
ldb-rag search sample "What does this project do?"
ldb-rag answer sample "Summarize the architecture and cite sources."

SDK-first integration

Best for Python applications or custom orchestration.

Conceptually:

from ldb_rag.application import build_app
from ldb_rag.domain.models import RetrievalMode, SearchQuery

app = build_app("config/local.toml")
app.collections.create("sample")
app.ingestion.ingest_path("sample", "./docs")
result = app.answers.answer(
    SearchQuery(collection="sample", text="Summarize the docs", mode=RetrievalMode.HYBRID)
)
print(result.answer)

MCP-first integration

Best for agent systems that already speak MCP.

Run:

ldb-rag mcp serve

Then connect your MCP-capable client to the ldb-rag server and use tools such as:

  • collections_list
  • collection_create
  • ingest_path
  • ingest_git
  • ingest_urls
  • search
  • answer
  • session_create
  • session_show

ZeroClaw + Obsidian integration

Best for using an Obsidian vault as long-term knowledge inside ZeroClaw.

Recommended flow:

ldb-rag collection create obsidian-memory --config config/obsidian.toml
ldb-rag ingest sync obsidian-memory /path/to/ObsidianVault --config config/obsidian.toml
ldb-rag mcp serve --config config/obsidian.toml

Then configure ZeroClaw to connect to the ldb-rag MCP server over stdio. Keep ZeroClaw's own sqlite memory backend enabled for conversational autosave, and use ldb-rag for retrieval over your notes and documentation corpus.

See:

  • docs/zeroclaw-integration.md
  • examples/zeroclaw/ldb-rag.obsidian.toml
  • examples/zeroclaw/zeroclaw-mcp.toml
  • scripts/sync_obsidian_vault.sh

Command Reference

Core setup

ldb-rag init
ldb-rag config validate --json

Collection management

ldb-rag collection create <name>
ldb-rag collection list
ldb-rag collection stats <name>
ldb-rag collection delete <name>

Ingestion

ldb-rag ingest path <collection> <path>
ldb-rag ingest sync <collection> <path>
ldb-rag ingest git <collection> <repo> --ref <ref> --subdir <subdir>
ldb-rag ingest url <collection> <url1> [url2 ...]
ldb-rag sync <collection> <path>

Retrieval and answer generation

ldb-rag search <collection> "<query>"
ldb-rag answer <collection> "<query>"
ldb-rag answer <collection> "<query>" --session-title "<title>"
ldb-rag answer <collection> "<query>" --session-id "<session-id>"

Useful flags:

  • --json
  • --config <path>
  • --mode dense|hybrid|docs_only|code_only|mixed
  • --top-k <n>
  • --language <language>

Sources and sessions

ldb-rag source show <collection> --json
ldb-rag session create <collection> "<title>"
ldb-rag session list
ldb-rag session show <session-id>
ldb-rag session delete <session-id>

MCP

ldb-rag mcp serve

Providers

Embeddings

  • dummy
  • ollama
  • openai
  • openai_compatible

Generation

  • dummy
  • ollama
  • openai
  • openai_compatible
  • anthropic
  • gemini

Reranking

  • simple
  • noop

Live-network test targets

  • Ollama
  • OpenAI
  • Anthropic
  • Gemini
  • Qwen through an OpenAI-compatible endpoint

Run the opt-in live tests with:

scripts/run_live_provider_tests.sh

Connectors

Implemented

  • filesystem
  • git
  • web

Connector usage examples

Filesystem:

ldb-rag ingest path my-kb ./project

Git:

ldb-rag ingest git my-kb https://github.com/example/project.git --ref main

Web:

ldb-rag ingest url my-kb https://example.com/docs/intro

Configuration

Configuration is layered in this order:

  1. config/default.toml
  2. ~/.config/ldb-rag/config.toml
  3. config/local.toml
  4. --config <path>
  5. selected environment variables

Important sections:

  • storage
  • embeddings
  • generation
  • reranker
  • retrieval
  • chunking
  • connectors
  • git
  • web
  • sessions

Start with a local config like:

[storage]
db_path = "data/lancedb"

[sessions]
db_path = "data/sessions.sqlite3"

[embeddings]
provider_type = "dummy"
model = "hash-32"

[generation]
provider_type = "dummy"
model = "grounded-summary"

[reranker]
enabled = true
provider_type = "simple"
model = "keyword-boost"

Documentation Architecture

This repo should not rely on a single giant README forever. The documentation should be generated and maintained as a tiered Markdown set.

Tier 0: entrypoint documents

These are the documents most users should read first.

  • README.md Purpose: project overview, installation, common workflows, documentation map
  • docs/cli.md Purpose: command reference and examples
  • docs/configuration.md Purpose: config model, provider setup, environment variables

Tier 1: operator and integration guides

These are the next documents a serious user should read.

  • docs/connectors.md Purpose: connector behavior, supported sources, caveats, sync expectations
  • docs/providers.md Purpose: provider support matrix, auth expectations, live-test environment variables
  • docs/mcp.md Purpose: MCP tool inventory, integration patterns, client expectations
  • docs/storage-abstraction.md Purpose: backend contract boundaries and swap strategy

Tier 2: internal engineering references

These are meant for maintainers and contributors.

  • docs/architecture.md Purpose: high-level system layout and data flow
  • docs/roadmap.md Purpose: future work and implementation direction
  • docs/testing.md Purpose: local test matrix, live-provider tests, fixtures, CI expectations
  • docs/sessions.md Purpose: persisted session schema, timestamps, retention, transcript behavior
  • docs/retrieval.md Purpose: ranking, hybrid search, reranking, context assembly

Tier 3: generated reference documents

These should eventually be produced from source-of-truth structures rather than maintained by hand.

  • docs/reference/cli-commands.md Generated from: Typer command tree Contains: commands, options, arguments, examples
  • docs/reference/mcp-tools.md Generated from: MCP tool definitions Contains: tool names, parameters, response behavior
  • docs/reference/config-schema.md Generated from: Pydantic config models Contains: fields, defaults, types, environment overrides
  • docs/reference/provider-matrix.md Generated from: provider adapter registry Contains: provider names, endpoints, auth model, capabilities

How Documentation Should Be Generated

Documentation generation should follow these rules:

Source of truth

  • command docs come from the CLI command tree
  • MCP docs come from MCP tool definitions
  • config docs come from Pydantic config models and default TOML
  • provider docs come from provider factory mappings and adapter implementations
  • connector docs come from connector classes and tests

Output structure

Generated documents should live under:

  • docs/reference/

Handwritten narrative documents should stay under:

  • docs/

The top-level README.md should remain curated and human-written.

Generation workflow

  1. inspect source code for CLI commands, MCP tools, model schemas, and provider registries
  2. transform those into structured intermediate data
  3. render Markdown templates into docs/reference/*.md
  4. link those generated docs from README.md and narrative docs
  5. validate links and examples as part of CI

What each generated document should contain

docs/reference/cli-commands.md

  • command hierarchy
  • purpose of each command
  • arguments and options
  • JSON-capable flags
  • short examples

docs/reference/mcp-tools.md

  • tool name
  • tool description
  • accepted parameters
  • output shape
  • example agent use

docs/reference/config-schema.md

  • config section name
  • field name
  • type
  • default
  • environment override if any
  • operational notes

docs/reference/provider-matrix.md

  • provider type
  • embedding/generation/reranker capability
  • endpoint shape
  • auth mechanism
  • live test coverage

docs/reference/connectors.md

  • connector name
  • accepted inputs
  • metadata produced
  • sync behavior
  • known limits

Documentation Generation TODO

This is the execution plan for producing the fuller doc set.

  1. Create docs/reference/ and reserve it for generated Markdown only.
  2. Add docs/testing.md, docs/sessions.md, and docs/retrieval.md as curated narrative docs.
  3. Implement a small documentation generator under scripts/ or tools/.
  4. Extract Typer command metadata into a CLI reference renderer.
  5. Extract MCP tool metadata into an MCP reference renderer.
  6. Extract Pydantic config schema into a config reference renderer.
  7. Extract provider/connector registry data into support matrix docs.
  8. Add a link checker and example-command validator to the test workflow.
  9. Add a docs build helper script to regenerate reference docs reproducibly.
  10. Add release-time checks that generated docs are up to date.

Man Page

This repo now includes an installable man page at:

  • man/man1/ldb-rag.1

Install it locally for one user with:

scripts/install_manpage.sh

Or manually:

install -Dm644 man/man1/ldb-rag.1 ~/.local/share/man/man1/ldb-rag.1
mandb ~/.local/share/man
man ldb-rag

If your system uses gzip-compressed man pages, compress after installation:

gzip -f ~/.local/share/man/man1/ldb-rag.1
mandb ~/.local/share/man

Current Document Map