- Python 98.6%
- Shell 1.4%
| config | ||
| data | ||
| docs | ||
| examples/zeroclaw | ||
| man/man1 | ||
| scripts | ||
| src/ldb_rag | ||
| tests | ||
| .gitignore | ||
| pyproject.toml | ||
| README.md | ||
| requirements.txt | ||
ldb-rag
ldb-rag is a local-first RAG framework built around LanceDB, a Python SDK, a CLI, and an MCP
server. It is intended for developer-facing knowledge bases: source trees, Markdown docs, text
notes, git repositories, and web pages that you want to retrieve from or answer questions about
with citations.
What This Project Is
This framework gives you four practical surfaces:
SDK: use the application services directly from PythonCLI: create collections, ingest data, search, answer, and manage sessionsMCP server: expose the same capabilities to coding agents and tool-driven clientsprovider adapters: connect local or cloud embedding/generation backends
The current implementation supports:
- LanceDB as the vector store
- filesystem ingestion
- git repository ingestion
- URL/web page ingestion
- Markdown and code-aware parsing/chunking
- hybrid retrieval with optional reranking
- persisted conversation sessions
- live provider test scaffolding for Ollama, OpenAI, Anthropic, Gemini, and Qwen-compatible APIs
What Problem It Solves
ldb-rag is useful when you want a local knowledge base that can:
- ingest a codebase and answer architectural questions with source citations
- ingest Markdown docs and retrieve relevant sections quickly
- ingest a git repo or URL set and unify them into one collection
- expose the same retrieval and answer capabilities to CLI scripts and MCP clients
- preserve session transcripts for iterative question-answer workflows
Realistic Use Cases
1. Local code assistant for one repository
Use this when you want to ask questions like:
- "Where is
greet_userimplemented?" - "Which files define shell bootstrap behavior?"
- "Summarize the main Python entrypoints and cite them."
Example:
ldb-rag collection create my-repo
ldb-rag ingest path my-repo ./src
ldb-rag answer my-repo "How does the configuration loader work?"
2. Unified docs + code knowledge base
Use this when your project has README.md, docs, scripts, and source files and you want one RAG
collection over all of them.
Example:
ldb-rag collection create project-docs
ldb-rag ingest path project-docs ./docs
ldb-rag ingest path project-docs ./src
ldb-rag search project-docs "Where are citations assembled?"
3. Git repo ingestion for remote or cached repos
Use this when the knowledge source is a repository rather than a local folder.
Example:
ldb-rag collection create upstream
ldb-rag ingest git upstream https://github.com/example/project.git --ref main
ldb-rag answer upstream "Summarize the entrypoints."
4. URL/web ingestion for lightweight external docs
Use this when you need a few pages from a documentation site or internal HTTP source.
Example:
ldb-rag collection create web-notes
ldb-rag ingest url web-notes https://example.com/docs/page-a https://example.com/docs/page-b
ldb-rag search web-notes "Which page explains authentication?"
5. Session-backed research or debugging workflow
Use this when you want a persistent conversation session with timestamps and stored messages.
Example:
ldb-rag answer my-repo "Explain the retrieval pipeline." --session-title "retrieval-review"
ldb-rag session list
ldb-rag session show <session-id>
6. MCP integration for agent clients
Use this when you want tools such as Codex-style harnesses or other MCP clients to query the same collections through structured tool calls rather than shelling out manually.
Example:
ldb-rag mcp serve
Integration Model
The simplest way to integrate the framework is to choose one of these paths:
CLI-first integration
Best for shell scripts, dev workflows, and manual use.
Basic lifecycle:
ldb-rag init
ldb-rag collection create sample
ldb-rag ingest path sample ./docs
ldb-rag search sample "What does this project do?"
ldb-rag answer sample "Summarize the architecture and cite sources."
SDK-first integration
Best for Python applications or custom orchestration.
Conceptually:
from ldb_rag.application import build_app
from ldb_rag.domain.models import RetrievalMode, SearchQuery
app = build_app("config/local.toml")
app.collections.create("sample")
app.ingestion.ingest_path("sample", "./docs")
result = app.answers.answer(
SearchQuery(collection="sample", text="Summarize the docs", mode=RetrievalMode.HYBRID)
)
print(result.answer)
MCP-first integration
Best for agent systems that already speak MCP.
Run:
ldb-rag mcp serve
Then connect your MCP-capable client to the ldb-rag server and use tools such as:
collections_listcollection_createingest_pathingest_gitingest_urlssearchanswersession_createsession_show
ZeroClaw + Obsidian integration
Best for using an Obsidian vault as long-term knowledge inside ZeroClaw.
Recommended flow:
ldb-rag collection create obsidian-memory --config config/obsidian.toml
ldb-rag ingest sync obsidian-memory /path/to/ObsidianVault --config config/obsidian.toml
ldb-rag mcp serve --config config/obsidian.toml
Then configure ZeroClaw to connect to the ldb-rag MCP server over stdio. Keep ZeroClaw's own
sqlite memory backend enabled for conversational autosave, and use ldb-rag for retrieval over
your notes and documentation corpus.
See:
docs/zeroclaw-integration.mdexamples/zeroclaw/ldb-rag.obsidian.tomlexamples/zeroclaw/zeroclaw-mcp.tomlscripts/sync_obsidian_vault.sh
Command Reference
Core setup
ldb-rag init
ldb-rag config validate --json
Collection management
ldb-rag collection create <name>
ldb-rag collection list
ldb-rag collection stats <name>
ldb-rag collection delete <name>
Ingestion
ldb-rag ingest path <collection> <path>
ldb-rag ingest sync <collection> <path>
ldb-rag ingest git <collection> <repo> --ref <ref> --subdir <subdir>
ldb-rag ingest url <collection> <url1> [url2 ...]
ldb-rag sync <collection> <path>
Retrieval and answer generation
ldb-rag search <collection> "<query>"
ldb-rag answer <collection> "<query>"
ldb-rag answer <collection> "<query>" --session-title "<title>"
ldb-rag answer <collection> "<query>" --session-id "<session-id>"
Useful flags:
--json--config <path>--mode dense|hybrid|docs_only|code_only|mixed--top-k <n>--language <language>
Sources and sessions
ldb-rag source show <collection> --json
ldb-rag session create <collection> "<title>"
ldb-rag session list
ldb-rag session show <session-id>
ldb-rag session delete <session-id>
MCP
ldb-rag mcp serve
Providers
Embeddings
dummyollamaopenaiopenai_compatible
Generation
dummyollamaopenaiopenai_compatibleanthropicgemini
Reranking
simplenoop
Live-network test targets
- Ollama
- OpenAI
- Anthropic
- Gemini
- Qwen through an OpenAI-compatible endpoint
Run the opt-in live tests with:
scripts/run_live_provider_tests.sh
Connectors
Implemented
filesystemgitweb
Connector usage examples
Filesystem:
ldb-rag ingest path my-kb ./project
Git:
ldb-rag ingest git my-kb https://github.com/example/project.git --ref main
Web:
ldb-rag ingest url my-kb https://example.com/docs/intro
Configuration
Configuration is layered in this order:
config/default.toml~/.config/ldb-rag/config.tomlconfig/local.toml--config <path>- selected environment variables
Important sections:
storageembeddingsgenerationrerankerretrievalchunkingconnectorsgitwebsessions
Start with a local config like:
[storage]
db_path = "data/lancedb"
[sessions]
db_path = "data/sessions.sqlite3"
[embeddings]
provider_type = "dummy"
model = "hash-32"
[generation]
provider_type = "dummy"
model = "grounded-summary"
[reranker]
enabled = true
provider_type = "simple"
model = "keyword-boost"
Documentation Architecture
This repo should not rely on a single giant README forever. The documentation should be generated and maintained as a tiered Markdown set.
Tier 0: entrypoint documents
These are the documents most users should read first.
README.mdPurpose: project overview, installation, common workflows, documentation mapdocs/cli.mdPurpose: command reference and examplesdocs/configuration.mdPurpose: config model, provider setup, environment variables
Tier 1: operator and integration guides
These are the next documents a serious user should read.
docs/connectors.mdPurpose: connector behavior, supported sources, caveats, sync expectationsdocs/providers.mdPurpose: provider support matrix, auth expectations, live-test environment variablesdocs/mcp.mdPurpose: MCP tool inventory, integration patterns, client expectationsdocs/storage-abstraction.mdPurpose: backend contract boundaries and swap strategy
Tier 2: internal engineering references
These are meant for maintainers and contributors.
docs/architecture.mdPurpose: high-level system layout and data flowdocs/roadmap.mdPurpose: future work and implementation directiondocs/testing.mdPurpose: local test matrix, live-provider tests, fixtures, CI expectationsdocs/sessions.mdPurpose: persisted session schema, timestamps, retention, transcript behaviordocs/retrieval.mdPurpose: ranking, hybrid search, reranking, context assembly
Tier 3: generated reference documents
These should eventually be produced from source-of-truth structures rather than maintained by hand.
docs/reference/cli-commands.mdGenerated from: Typer command tree Contains: commands, options, arguments, examplesdocs/reference/mcp-tools.mdGenerated from: MCP tool definitions Contains: tool names, parameters, response behaviordocs/reference/config-schema.mdGenerated from: Pydantic config models Contains: fields, defaults, types, environment overridesdocs/reference/provider-matrix.mdGenerated from: provider adapter registry Contains: provider names, endpoints, auth model, capabilities
How Documentation Should Be Generated
Documentation generation should follow these rules:
Source of truth
- command docs come from the CLI command tree
- MCP docs come from MCP tool definitions
- config docs come from Pydantic config models and default TOML
- provider docs come from provider factory mappings and adapter implementations
- connector docs come from connector classes and tests
Output structure
Generated documents should live under:
docs/reference/
Handwritten narrative documents should stay under:
docs/
The top-level README.md should remain curated and human-written.
Generation workflow
- inspect source code for CLI commands, MCP tools, model schemas, and provider registries
- transform those into structured intermediate data
- render Markdown templates into
docs/reference/*.md - link those generated docs from
README.mdand narrative docs - validate links and examples as part of CI
What each generated document should contain
docs/reference/cli-commands.md
- command hierarchy
- purpose of each command
- arguments and options
- JSON-capable flags
- short examples
docs/reference/mcp-tools.md
- tool name
- tool description
- accepted parameters
- output shape
- example agent use
docs/reference/config-schema.md
- config section name
- field name
- type
- default
- environment override if any
- operational notes
docs/reference/provider-matrix.md
- provider type
- embedding/generation/reranker capability
- endpoint shape
- auth mechanism
- live test coverage
docs/reference/connectors.md
- connector name
- accepted inputs
- metadata produced
- sync behavior
- known limits
Documentation Generation TODO
This is the execution plan for producing the fuller doc set.
- Create
docs/reference/and reserve it for generated Markdown only. - Add
docs/testing.md,docs/sessions.md, anddocs/retrieval.mdas curated narrative docs. - Implement a small documentation generator under
scripts/ortools/. - Extract Typer command metadata into a CLI reference renderer.
- Extract MCP tool metadata into an MCP reference renderer.
- Extract Pydantic config schema into a config reference renderer.
- Extract provider/connector registry data into support matrix docs.
- Add a link checker and example-command validator to the test workflow.
- Add a
docs buildhelper script to regenerate reference docs reproducibly. - Add release-time checks that generated docs are up to date.
Man Page
This repo now includes an installable man page at:
man/man1/ldb-rag.1
Install it locally for one user with:
scripts/install_manpage.sh
Or manually:
install -Dm644 man/man1/ldb-rag.1 ~/.local/share/man/man1/ldb-rag.1
mandb ~/.local/share/man
man ldb-rag
If your system uses gzip-compressed man pages, compress after installation:
gzip -f ~/.local/share/man/man1/ldb-rag.1
mandb ~/.local/share/man