lancedb based RAG implementation meant to integrate with obsidian and other memory-service and notetaking plugins.

Python 98.6%
Shell 1.4%

Find a file

DjYoshmaista bc45be067d First working prototype		2026-04-27 07:05:02 -05:00
config	First working prototype	2026-04-27 07:05:02 -05:00
data	First working prototype	2026-04-27 07:05:02 -05:00
docs	First working prototype	2026-04-27 07:05:02 -05:00
examples/zeroclaw	First working prototype	2026-04-27 07:05:02 -05:00
man/man1	First working prototype	2026-04-27 07:05:02 -05:00
scripts	First working prototype	2026-04-27 07:05:02 -05:00
src/ldb_rag	First working prototype	2026-04-27 07:05:02 -05:00
tests	First working prototype	2026-04-27 07:05:02 -05:00
.gitignore	First working prototype	2026-04-27 07:05:02 -05:00
pyproject.toml	First working prototype	2026-04-27 07:05:02 -05:00
README.md	First working prototype	2026-04-27 07:05:02 -05:00
requirements.txt	First working prototype	2026-04-27 07:05:02 -05:00

README.md

ldb-rag

ldb-rag is a local-first RAG framework built around LanceDB, a Python SDK, a CLI, and an MCP server. It is intended for developer-facing knowledge bases: source trees, Markdown docs, text notes, git repositories, and web pages that you want to retrieve from or answer questions about with citations.

What This Project Is

This framework gives you four practical surfaces:

SDK: use the application services directly from Python
CLI: create collections, ingest data, search, answer, and manage sessions
MCP server: expose the same capabilities to coding agents and tool-driven clients
provider adapters: connect local or cloud embedding/generation backends

The current implementation supports:

LanceDB as the vector store
filesystem ingestion
git repository ingestion
URL/web page ingestion
Markdown and code-aware parsing/chunking
hybrid retrieval with optional reranking
persisted conversation sessions
live provider test scaffolding for Ollama, OpenAI, Anthropic, Gemini, and Qwen-compatible APIs

What Problem It Solves

ldb-rag is useful when you want a local knowledge base that can:

ingest a codebase and answer architectural questions with source citations
ingest Markdown docs and retrieve relevant sections quickly
ingest a git repo or URL set and unify them into one collection
expose the same retrieval and answer capabilities to CLI scripts and MCP clients
preserve session transcripts for iterative question-answer workflows

Realistic Use Cases

1. Local code assistant for one repository

Use this when you want to ask questions like:

"Where is greet_user implemented?"
"Which files define shell bootstrap behavior?"
"Summarize the main Python entrypoints and cite them."

Example:

ldb-rag collection create my-repo
ldb-rag ingest path my-repo ./src
ldb-rag answer my-repo "How does the configuration loader work?"

2. Unified docs + code knowledge base

Use this when your project has README.md, docs, scripts, and source files and you want one RAG collection over all of them.

Example:

ldb-rag collection create project-docs
ldb-rag ingest path project-docs ./docs
ldb-rag ingest path project-docs ./src
ldb-rag search project-docs "Where are citations assembled?"

3. Git repo ingestion for remote or cached repos

Use this when the knowledge source is a repository rather than a local folder.

Example:

ldb-rag collection create upstream
ldb-rag ingest git upstream https://github.com/example/project.git --ref main
ldb-rag answer upstream "Summarize the entrypoints."

4. URL/web ingestion for lightweight external docs

Use this when you need a few pages from a documentation site or internal HTTP source.

Example:

ldb-rag collection create web-notes
ldb-rag ingest url web-notes https://example.com/docs/page-a https://example.com/docs/page-b
ldb-rag search web-notes "Which page explains authentication?"

5. Session-backed research or debugging workflow

Use this when you want a persistent conversation session with timestamps and stored messages.

Example:

ldb-rag answer my-repo "Explain the retrieval pipeline." --session-title "retrieval-review"
ldb-rag session list
ldb-rag session show <session-id>

6. MCP integration for agent clients

Use this when you want tools such as Codex-style harnesses or other MCP clients to query the same collections through structured tool calls rather than shelling out manually.

Example:

ldb-rag mcp serve

Integration Model

The simplest way to integrate the framework is to choose one of these paths:

CLI-first integration

Best for shell scripts, dev workflows, and manual use.

Basic lifecycle:

ldb-rag init
ldb-rag collection create sample
ldb-rag ingest path sample ./docs
ldb-rag search sample "What does this project do?"
ldb-rag answer sample "Summarize the architecture and cite sources."

SDK-first integration

Best for Python applications or custom orchestration.

Conceptually:

from ldb_rag.application import build_app
from ldb_rag.domain.models import RetrievalMode, SearchQuery

app = build_app("config/local.toml")
app.collections.create("sample")
app.ingestion.ingest_path("sample", "./docs")
result = app.answers.answer(
    SearchQuery(collection="sample", text="Summarize the docs", mode=RetrievalMode.HYBRID)
)
print(result.answer)

MCP-first integration

Best for agent systems that already speak MCP.

Run:

ldb-rag mcp serve

Then connect your MCP-capable client to the ldb-rag server and use tools such as:

collections_list
collection_create
ingest_path
ingest_git
ingest_urls
search
answer
session_create
session_show

ZeroClaw + Obsidian integration

Best for using an Obsidian vault as long-term knowledge inside ZeroClaw.

Recommended flow:

ldb-rag collection create obsidian-memory --config config/obsidian.toml
ldb-rag ingest sync obsidian-memory /path/to/ObsidianVault --config config/obsidian.toml
ldb-rag mcp serve --config config/obsidian.toml

Then configure ZeroClaw to connect to the ldb-rag MCP server over stdio. Keep ZeroClaw's own sqlite memory backend enabled for conversational autosave, and use ldb-rag for retrieval over your notes and documentation corpus.

See:

docs/zeroclaw-integration.md
examples/zeroclaw/ldb-rag.obsidian.toml
examples/zeroclaw/zeroclaw-mcp.toml
scripts/sync_obsidian_vault.sh

Command Reference

Core setup

ldb-rag init
ldb-rag config validate --json

Collection management

ldb-rag collection create <name>
ldb-rag collection list
ldb-rag collection stats <name>
ldb-rag collection delete <name>

Ingestion

ldb-rag ingest path <collection> <path>
ldb-rag ingest sync <collection> <path>
ldb-rag ingest git <collection> <repo> --ref <ref> --subdir <subdir>
ldb-rag ingest url <collection> <url1> [url2 ...]
ldb-rag sync <collection> <path>

Retrieval and answer generation

ldb-rag search <collection> "<query>"
ldb-rag answer <collection> "<query>"
ldb-rag answer <collection> "<query>" --session-title "<title>"
ldb-rag answer <collection> "<query>" --session-id "<session-id>"

Useful flags:

--json
--config <path>
--mode dense|hybrid|docs_only|code_only|mixed
--top-k <n>
--language <language>

Sources and sessions

ldb-rag source show <collection> --json
ldb-rag session create <collection> "<title>"
ldb-rag session list
ldb-rag session show <session-id>
ldb-rag session delete <session-id>

MCP

ldb-rag mcp serve

Providers

Embeddings

dummy
ollama
openai
openai_compatible

Generation

dummy
ollama
openai
openai_compatible
anthropic
gemini

Reranking

simple
noop

Live-network test targets

Ollama
OpenAI
Anthropic
Gemini
Qwen through an OpenAI-compatible endpoint

Run the opt-in live tests with:

scripts/run_live_provider_tests.sh

Connectors

Implemented

filesystem
git
web

Connector usage examples

Filesystem:

ldb-rag ingest path my-kb ./project

Git:

ldb-rag ingest git my-kb https://github.com/example/project.git --ref main

Web:

ldb-rag ingest url my-kb https://example.com/docs/intro

Configuration

Configuration is layered in this order:

config/default.toml
~/.config/ldb-rag/config.toml
config/local.toml
--config <path>
selected environment variables

Important sections:

storage
embeddings
generation
reranker
retrieval
chunking
connectors
git
web
sessions

Start with a local config like:

[storage]
db_path = "data/lancedb"

[sessions]
db_path = "data/sessions.sqlite3"

[embeddings]
provider_type = "dummy"
model = "hash-32"

[generation]
provider_type = "dummy"
model = "grounded-summary"

[reranker]
enabled = true
provider_type = "simple"
model = "keyword-boost"

Documentation Architecture

This repo should not rely on a single giant README forever. The documentation should be generated and maintained as a tiered Markdown set.

Tier 0: entrypoint documents

These are the documents most users should read first.

README.md Purpose: project overview, installation, common workflows, documentation map
docs/cli.md Purpose: command reference and examples
docs/configuration.md Purpose: config model, provider setup, environment variables

Tier 1: operator and integration guides

These are the next documents a serious user should read.

docs/connectors.md Purpose: connector behavior, supported sources, caveats, sync expectations
docs/providers.md Purpose: provider support matrix, auth expectations, live-test environment variables
docs/mcp.md Purpose: MCP tool inventory, integration patterns, client expectations
docs/storage-abstraction.md Purpose: backend contract boundaries and swap strategy

Tier 2: internal engineering references

These are meant for maintainers and contributors.

docs/architecture.md Purpose: high-level system layout and data flow
docs/roadmap.md Purpose: future work and implementation direction
docs/testing.md Purpose: local test matrix, live-provider tests, fixtures, CI expectations
docs/sessions.md Purpose: persisted session schema, timestamps, retention, transcript behavior
docs/retrieval.md Purpose: ranking, hybrid search, reranking, context assembly

Tier 3: generated reference documents

These should eventually be produced from source-of-truth structures rather than maintained by hand.

docs/reference/cli-commands.md Generated from: Typer command tree Contains: commands, options, arguments, examples
docs/reference/mcp-tools.md Generated from: MCP tool definitions Contains: tool names, parameters, response behavior
docs/reference/config-schema.md Generated from: Pydantic config models Contains: fields, defaults, types, environment overrides
docs/reference/provider-matrix.md Generated from: provider adapter registry Contains: provider names, endpoints, auth model, capabilities

How Documentation Should Be Generated

Documentation generation should follow these rules:

Source of truth

command docs come from the CLI command tree
MCP docs come from MCP tool definitions
config docs come from Pydantic config models and default TOML
provider docs come from provider factory mappings and adapter implementations
connector docs come from connector classes and tests

Output structure

Generated documents should live under:

docs/reference/

Handwritten narrative documents should stay under:

docs/

The top-level README.md should remain curated and human-written.

Generation workflow

inspect source code for CLI commands, MCP tools, model schemas, and provider registries
transform those into structured intermediate data
render Markdown templates into docs/reference/*.md
link those generated docs from README.md and narrative docs
validate links and examples as part of CI

What each generated document should contain

docs/reference/cli-commands.md

command hierarchy
purpose of each command
arguments and options
JSON-capable flags
short examples

docs/reference/mcp-tools.md

tool name
tool description
accepted parameters
output shape
example agent use

docs/reference/config-schema.md

config section name
field name
type
default
environment override if any
operational notes

docs/reference/provider-matrix.md

provider type
embedding/generation/reranker capability
endpoint shape
auth mechanism
live test coverage

docs/reference/connectors.md

connector name
accepted inputs
metadata produced
sync behavior
known limits

Documentation Generation TODO

This is the execution plan for producing the fuller doc set.

Create docs/reference/ and reserve it for generated Markdown only.
Add docs/testing.md, docs/sessions.md, and docs/retrieval.md as curated narrative docs.
Implement a small documentation generator under scripts/ or tools/.
Extract Typer command metadata into a CLI reference renderer.
Extract MCP tool metadata into an MCP reference renderer.
Extract Pydantic config schema into a config reference renderer.
Extract provider/connector registry data into support matrix docs.
Add a link checker and example-command validator to the test workflow.
Add a docs build helper script to regenerate reference docs reproducibly.
Add release-time checks that generated docs are up to date.

Man Page

This repo now includes an installable man page at:

man/man1/ldb-rag.1

Install it locally for one user with:

scripts/install_manpage.sh

Or manually:

install -Dm644 man/man1/ldb-rag.1 ~/.local/share/man/man1/ldb-rag.1
mandb ~/.local/share/man
man ldb-rag

If your system uses gzip-compressed man pages, compress after installation:

gzip -f ~/.local/share/man/man1/ldb-rag.1
mandb ~/.local/share/man

README.md

ldb-rag

What This Project Is

What Problem It Solves

Realistic Use Cases

1. Local code assistant for one repository

2. Unified docs + code knowledge base

3. Git repo ingestion for remote or cached repos

4. URL/web ingestion for lightweight external docs

5. Session-backed research or debugging workflow

6. MCP integration for agent clients

Integration Model

CLI-first integration

SDK-first integration

MCP-first integration

ZeroClaw + Obsidian integration

Command Reference

Core setup

Collection management

Ingestion

Retrieval and answer generation

Sources and sessions

MCP

Providers

Embeddings

Generation

Reranking

Live-network test targets

Connectors

Implemented

Connector usage examples

Configuration

Documentation Architecture

Tier 0: entrypoint documents

Tier 1: operator and integration guides

Tier 2: internal engineering references

Tier 3: generated reference documents

How Documentation Should Be Generated

Source of truth

Output structure

Generation workflow

What each generated document should contain

Documentation Generation TODO

Man Page

Current Document Map