Ollama & RAG¶
MDU uses Ollama for local AI inference, powering embeddings for the RAG (Retrieval-Augmented Generation) pipeline and chat capabilities.
Ollama Setup¶
- Binary:
/usr/local/bin/ollama - Service:
ollama.service(systemd) - Port:
127.0.0.1:11434 - Memory limit: 6GB (
MemoryMax=6G) - Max loaded models: 1
Models¶
| Model | Size | Purpose |
|---|---|---|
nomic-embed-text |
274MB | 768-dim embeddings for RAG |
mistral:7b-instruct-q4_K_M |
4.4GB | Chat/reasoning (RAG answers) |
phi3:mini |
2.2GB | Lightweight alternative |
Client Module¶
/opt/mdu-api/ollama.cjs — CommonJS module with graceful degradation:
const { embed, chat, healthCheck } = require('./ollama.cjs');
// Generate embeddings
const vector = await embed("some text to embed");
// Returns: Float64Array(768)
// Chat completion
const answer = await chat("What is a FDM printer?");
// Returns: string
// Health check
const ok = await healthCheck();
// Returns: boolean
All functions return null/false on error — never throws.
RAG Pipeline¶
Architecture¶
Document → Chunker → Embeddings (nomic-embed-text)
|
v
pgvector (rag_documents table)
|
Query embedding
|
v
Cosine similarity search
|
v
Top-K chunks → Mistral 7B → Answer
Components¶
- Chunker (
/opt/mdu-api/chunker.cjs): Splits text into chunks with token estimation - Embeddings: 768-dimensional vectors via
nomic-embed-text - Storage:
rag_documentstable with HNSW index onvector(768) - Search: Cosine similarity via pgvector
<=>operator - Generation: Mistral 7B generates answers from retrieved context
Database Schema¶
CREATE TABLE rag_documents (
id UUID PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB,
embedding vector(768),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON rag_documents
USING hnsw (embedding vector_cosine_ops);
Admin API¶
All RAG endpoints require admin authentication:
| Method | Path | Description |
|---|---|---|
| POST | /api/admin/rag/ingest |
Chunk + embed + store document |
| POST | /api/admin/rag/query |
Semantic search + LLM answer |
| GET | /api/admin/rag/documents |
List stored documents |
| GET | /api/admin/rag/metrics |
RAG usage metrics |
| DELETE | /api/admin/rag/documents/:id |
Delete document |
Ingest¶
POST /api/admin/rag/ingest
Authorization: Bearer <admin-token>
{
"content": "Long document text...",
"metadata": { "source": "runbook", "topic": "deployment" }
}
The document is split into chunks, each chunk is embedded via Ollama, and stored in pgvector.
Query¶
POST /api/admin/rag/query
Authorization: Bearer <admin-token>
{ "query": "How do I restart the STL pipeline?" }
{
"answer": "To restart the STL pipeline, run: docker restart mdu-stl-pipeline",
"sources": [
{ "content": "...", "metadata": { "source": "runbook" }, "similarity": 0.89 }
]
}
pgvector¶
PostgreSQL image: pgvector/pgvector:pg16 (replaces standard postgres:16-alpine).
Extension version: 0.8.2. Enables vector similarity search with HNSW indexing for fast approximate nearest neighbor queries.