Ollama¶

Ollama is a powerful local LLM runner that enables you to run various large language models on your own hardware. This guide covers the setup, configuration, and usage of Ollama in the Local AI Cyber Lab environment.

Overview¶

Ollama provides: - Local model execution - Model management - API access - GPU acceleration - Custom model support

Architecture¶

graph TB
    subgraph Client_Layer["Client Layer"]
        webui["Open WebUI"]
        api["API Clients"]
        cli["CLI Tools"]
    end

    subgraph Ollama["Ollama Service"]
        api_server["API Server"]
        model_manager["Model Manager"]
        inference["Inference Engine"]

        subgraph Models["Model Storage"]
            quantized["Quantized Models"]
            full["Full Models"]
        end
    end

    subgraph Hardware["Hardware"]
        gpu["GPU"]
        cpu["CPU"]
        memory["Memory"]
    end

    Client_Layer --> Ollama
    api_server --> model_manager
    model_manager --> Models
    inference --> Models
    inference --> Hardware

Installation¶

Ollama is automatically installed as part of the Local AI Cyber Lab environment. However, you can customize its configuration:

# Pull the latest Ollama image
docker-compose pull ollama

# Start Ollama service
docker-compose up -d ollama

Configuration¶

Environment Variables¶

# .env file
OLLAMA_HOST=0.0.0.0
OLLAMA_PORT=11434
OLLAMA_MODELS=llama2,codellama,mistral
OLLAMA_MEMORY_LIMIT=8g

Hardware Requirements¶

Minimum:
8GB RAM
4 CPU cores
20GB storage
Recommended:
16GB+ RAM
8+ CPU cores
NVIDIA GPU (8GB+ VRAM)
100GB+ SSD storage

Model Management¶

Available Models¶

Default Models:
llama2
codellama
mistral
neural-chat
Specialized Models:
llama2-uncensored
codellama-python
mistral-openorca
stable-beluga

Model Commands¶

# List available models
docker-compose exec ollama ollama list

# Pull a model
docker-compose exec ollama ollama pull llama2

# Remove a model
docker-compose exec ollama ollama rm llama2

# Get model information
docker-compose exec ollama ollama show llama2

Usage¶

API Endpoints¶

Chat Completion:

curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Model Management:

# List models
curl http://localhost:11434/api/tags

# Pull model
curl -X POST http://localhost:11434/api/pull \
  -d '{"name": "llama2"}'

Python Integration¶

import requests

def chat_with_model(prompt, model="llama2"):
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": [
                {"role": "user", "content": prompt}
            ]
        }
    )
    return response.json()

# Example usage
result = chat_with_model("Explain quantum computing")
print(result['message']['content'])

Performance Optimization¶

Model Quantization¶

# Pull quantized model
docker-compose exec ollama ollama pull llama2:7b-q4

# Compare performance
docker-compose exec ollama ollama run llama2:7b-q4 "Your test prompt"

Resource Management¶

GPU Settings:

# docker-compose.yml
services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Memory Management:

services:
  ollama:
    mem_limit: ${OLLAMA_MEMORY_LIMIT:-8g}
    environment:
      - CUDA_VISIBLE_DEVICES=0,1  # Specify GPUs

Security Considerations¶

Access Control¶

Network Security:

# docker-compose.yml
services:
  ollama:
    networks:
      - ai_network
    expose:
      - 11434

API Authentication:

# Example secure API call
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}
response = requests.post(url, headers=headers, json=data)

Model Security¶

Model Verification:

# Verify model checksum
docker-compose exec ollama ollama verify llama2

# Check model source
docker-compose exec ollama ollama show llama2 --json

Safe Usage Guidelines:
Implement rate limiting
Monitor model outputs
Use content filtering
Regular security audits

Monitoring¶

Health Checks¶

# docker-compose.yml
services:
  ollama:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Metrics Collection¶

Prometheus Integration:

# prometheus/config/ollama.yml
scrape_configs:
  - job_name: 'ollama'
    static_configs:
      - targets: ['ollama:11434']
    metrics_path: '/metrics'

Key Metrics:
Model inference time
Memory usage
GPU utilization
Request latency

Troubleshooting¶

Common Issues¶

Memory Issues:

# Check memory usage
docker stats ollama

# Adjust memory limit
docker-compose up -d --scale ollama=0
export OLLAMA_MEMORY_LIMIT=12g
docker-compose up -d ollama

GPU Problems:

# Verify GPU access
docker-compose exec ollama nvidia-smi

# Check CUDA installation
docker-compose exec ollama python -c "import torch; print(torch.cuda.is_available())"

Logs¶

# View logs
docker-compose logs -f ollama

# Enable debug logging
docker-compose exec ollama ollama serve --verbose