Skip to content

Ollama & RAG

MDU uses Ollama for local AI inference, powering embeddings for the RAG (Retrieval-Augmented Generation) pipeline and chat capabilities.

Ollama Setup

  • Binary: /usr/local/bin/ollama
  • Service: ollama.service (systemd)
  • Port: 127.0.0.1:11434
  • Memory limit: 6GB (MemoryMax=6G)
  • Max loaded models: 1

Models

Model Size Purpose
nomic-embed-text 274MB 768-dim embeddings for RAG
mistral:7b-instruct-q4_K_M 4.4GB Chat/reasoning (RAG answers)
phi3:mini 2.2GB Lightweight alternative

Client Module

/opt/mdu-api/ollama.cjs — CommonJS module with graceful degradation:

const { embed, chat, healthCheck } = require('./ollama.cjs');

// Generate embeddings
const vector = await embed("some text to embed");
// Returns: Float64Array(768)

// Chat completion
const answer = await chat("What is a FDM printer?");
// Returns: string

// Health check
const ok = await healthCheck();
// Returns: boolean

All functions return null/false on error — never throws.

RAG Pipeline

Architecture

Document → Chunker → Embeddings (nomic-embed-text)
                         |
                         v
                   pgvector (rag_documents table)
                         |
                  Query embedding
                         |
                         v
                   Cosine similarity search
                         |
                         v
                   Top-K chunks → Mistral 7B → Answer

Components

  1. Chunker (/opt/mdu-api/chunker.cjs): Splits text into chunks with token estimation
  2. Embeddings: 768-dimensional vectors via nomic-embed-text
  3. Storage: rag_documents table with HNSW index on vector(768)
  4. Search: Cosine similarity via pgvector <=> operator
  5. Generation: Mistral 7B generates answers from retrieved context

Database Schema

CREATE TABLE rag_documents (
  id UUID PRIMARY KEY,
  content TEXT NOT NULL,
  metadata JSONB,
  embedding vector(768),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON rag_documents
  USING hnsw (embedding vector_cosine_ops);

Admin API

All RAG endpoints require admin authentication:

Method Path Description
POST /api/admin/rag/ingest Chunk + embed + store document
POST /api/admin/rag/query Semantic search + LLM answer
GET /api/admin/rag/documents List stored documents
GET /api/admin/rag/metrics RAG usage metrics
DELETE /api/admin/rag/documents/:id Delete document

Ingest

POST /api/admin/rag/ingest
Authorization: Bearer <admin-token>

{
  "content": "Long document text...",
  "metadata": { "source": "runbook", "topic": "deployment" }
}

The document is split into chunks, each chunk is embedded via Ollama, and stored in pgvector.

Query

POST /api/admin/rag/query
Authorization: Bearer <admin-token>

{ "query": "How do I restart the STL pipeline?" }
{
  "answer": "To restart the STL pipeline, run: docker restart mdu-stl-pipeline",
  "sources": [
    { "content": "...", "metadata": { "source": "runbook" }, "similarity": 0.89 }
  ]
}

pgvector

PostgreSQL image: pgvector/pgvector:pg16 (replaces standard postgres:16-alpine).

Extension version: 0.8.2. Enables vector similarity search with HNSW indexing for fast approximate nearest neighbor queries.