Ollama & RAG¶

MDU uses Ollama for local AI inference, powering embeddings for the RAG (Retrieval-Augmented Generation) pipeline and chat capabilities.

Ollama Setup¶

Binary: /usr/local/bin/ollama
Service: ollama.service (systemd)
Port: 127.0.0.1:11434
Memory limit: 6GB (MemoryMax=6G)
Max loaded models: 1

Models¶

Model	Size	Purpose
`nomic-embed-text`	274MB	768-dim embeddings for RAG
`mistral:7b-instruct-q4_K_M`	4.4GB	Chat/reasoning (RAG answers)
`phi3:mini`	2.2GB	Lightweight alternative

Client Module¶

/opt/mdu-api/ollama.cjs — CommonJS module with graceful degradation:

const { embed, chat, healthCheck } = require('./ollama.cjs');

// Generate embeddings
const vector = await embed("some text to embed");
// Returns: Float64Array(768)

// Chat completion
const answer = await chat("What is a FDM printer?");
// Returns: string

// Health check
const ok = await healthCheck();
// Returns: boolean

All functions return null/false on error — never throws.

RAG Pipeline¶

Architecture¶

Document → Chunker → Embeddings (nomic-embed-text)
                         |
                         v
                   pgvector (rag_documents table)
                         |
                  Query embedding
                         |
                         v
                   Cosine similarity search
                         |
                         v
                   Top-K chunks → Mistral 7B → Answer

Components¶

Chunker (/opt/mdu-api/chunker.cjs): Splits text into chunks with token estimation
Embeddings: 768-dimensional vectors via nomic-embed-text
Storage: rag_documents table with HNSW index on vector(768)
Search: Cosine similarity via pgvector <=> operator
Generation: Mistral 7B generates answers from retrieved context

Database Schema¶

CREATE TABLE rag_documents (
  id UUID PRIMARY KEY,
  content TEXT NOT NULL,
  metadata JSONB,
  embedding vector(768),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON rag_documents
  USING hnsw (embedding vector_cosine_ops);

Admin API¶

All RAG endpoints require admin authentication:

Method	Path	Description
POST	`/api/admin/rag/ingest`	Chunk + embed + store document
POST	`/api/admin/rag/query`	Semantic search + LLM answer
GET	`/api/admin/rag/documents`	List stored documents
GET	`/api/admin/rag/metrics`	RAG usage metrics
DELETE	`/api/admin/rag/documents/:id`	Delete document

Ingest¶

POST /api/admin/rag/ingest
Authorization: Bearer <admin-token>

{
  "content": "Long document text...",
  "metadata": { "source": "runbook", "topic": "deployment" }
}

The document is split into chunks, each chunk is embedded via Ollama, and stored in pgvector.

Query¶

POST /api/admin/rag/query
Authorization: Bearer <admin-token>

{ "query": "How do I restart the STL pipeline?" }

{
  "answer": "To restart the STL pipeline, run: docker restart mdu-stl-pipeline",
  "sources": [
    { "content": "...", "metadata": { "source": "runbook" }, "similarity": 0.89 }
  ]
}

pgvector¶

PostgreSQL image: pgvector/pgvector:pg16 (replaces standard postgres:16-alpine).

Extension version: 0.8.2. Enables vector similarity search with HNSW indexing for fast approximate nearest neighbor queries.