Skip to content

Ollama

Ollama is a powerful local LLM runner that enables you to run various large language models on your own hardware. This guide covers the setup, configuration, and usage of Ollama in the Local AI Cyber Lab environment.

Overview

Ollama provides: - Local model execution - Model management - API access - GPU acceleration - Custom model support

Architecture

graph TB
    subgraph Client_Layer["Client Layer"]
        webui["Open WebUI"]
        api["API Clients"]
        cli["CLI Tools"]
    end

    subgraph Ollama["Ollama Service"]
        api_server["API Server"]
        model_manager["Model Manager"]
        inference["Inference Engine"]

        subgraph Models["Model Storage"]
            quantized["Quantized Models"]
            full["Full Models"]
        end
    end

    subgraph Hardware["Hardware"]
        gpu["GPU"]
        cpu["CPU"]
        memory["Memory"]
    end

    Client_Layer --> Ollama
    api_server --> model_manager
    model_manager --> Models
    inference --> Models
    inference --> Hardware

Installation

Ollama is automatically installed as part of the Local AI Cyber Lab environment. However, you can customize its configuration:

# Pull the latest Ollama image
docker-compose pull ollama

# Start Ollama service
docker-compose up -d ollama

Configuration

Environment Variables

# .env file
OLLAMA_HOST=0.0.0.0
OLLAMA_PORT=11434
OLLAMA_MODELS=llama2,codellama,mistral
OLLAMA_MEMORY_LIMIT=8g

Hardware Requirements

  • Minimum:
  • 8GB RAM
  • 4 CPU cores
  • 20GB storage
  • Recommended:
  • 16GB+ RAM
  • 8+ CPU cores
  • NVIDIA GPU (8GB+ VRAM)
  • 100GB+ SSD storage

Model Management

Available Models

  1. Default Models:
  2. llama2
  3. codellama
  4. mistral
  5. neural-chat

  6. Specialized Models:

  7. llama2-uncensored
  8. codellama-python
  9. mistral-openorca
  10. stable-beluga

Model Commands

# List available models
docker-compose exec ollama ollama list

# Pull a model
docker-compose exec ollama ollama pull llama2

# Remove a model
docker-compose exec ollama ollama rm llama2

# Get model information
docker-compose exec ollama ollama show llama2

Usage

API Endpoints

  1. Chat Completion:

    curl -X POST http://localhost:11434/api/chat \
      -H "Content-Type: application/json" \
      -d '{
        "model": "llama2",
        "messages": [
          {"role": "user", "content": "Hello, how are you?"}
        ]
      }'
    

  2. Model Management:

    # List models
    curl http://localhost:11434/api/tags
    
    # Pull model
    curl -X POST http://localhost:11434/api/pull \
      -d '{"name": "llama2"}'
    

Python Integration

import requests

def chat_with_model(prompt, model="llama2"):
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": [
                {"role": "user", "content": prompt}
            ]
        }
    )
    return response.json()

# Example usage
result = chat_with_model("Explain quantum computing")
print(result['message']['content'])

Performance Optimization

Model Quantization

# Pull quantized model
docker-compose exec ollama ollama pull llama2:7b-q4

# Compare performance
docker-compose exec ollama ollama run llama2:7b-q4 "Your test prompt"

Resource Management

  1. GPU Settings:

    # docker-compose.yml
    services:
      ollama:
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]
    

  2. Memory Management:

    services:
      ollama:
        mem_limit: ${OLLAMA_MEMORY_LIMIT:-8g}
        environment:
          - CUDA_VISIBLE_DEVICES=0,1  # Specify GPUs
    

Security Considerations

Access Control

  1. Network Security:

    # docker-compose.yml
    services:
      ollama:
        networks:
          - ai_network
        expose:
          - 11434
    

  2. API Authentication:

    # Example secure API call
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    response = requests.post(url, headers=headers, json=data)
    

Model Security

  1. Model Verification:

    # Verify model checksum
    docker-compose exec ollama ollama verify llama2
    
    # Check model source
    docker-compose exec ollama ollama show llama2 --json
    

  2. Safe Usage Guidelines:

  3. Implement rate limiting
  4. Monitor model outputs
  5. Use content filtering
  6. Regular security audits

Monitoring

Health Checks

# docker-compose.yml
services:
  ollama:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Metrics Collection

  1. Prometheus Integration:

    # prometheus/config/ollama.yml
    scrape_configs:
      - job_name: 'ollama'
        static_configs:
          - targets: ['ollama:11434']
        metrics_path: '/metrics'
    

  2. Key Metrics:

  3. Model inference time
  4. Memory usage
  5. GPU utilization
  6. Request latency

Troubleshooting

Common Issues

  1. Memory Issues:

    # Check memory usage
    docker stats ollama
    
    # Adjust memory limit
    docker-compose up -d --scale ollama=0
    export OLLAMA_MEMORY_LIMIT=12g
    docker-compose up -d ollama
    

  2. GPU Problems:

    # Verify GPU access
    docker-compose exec ollama nvidia-smi
    
    # Check CUDA installation
    docker-compose exec ollama python -c "import torch; print(torch.cuda.is_available())"
    

Logs

# View logs
docker-compose logs -f ollama

# Enable debug logging
docker-compose exec ollama ollama serve --verbose

Additional Resources

  1. Ollama Documentation
  2. Model Library
  3. API Reference
  4. GitHub Repository

Integration Examples

  1. Web UI Integration
  2. Workflow Automation
  3. Security Integration
  4. Monitoring Setup