Ollama¶
Ollama is a powerful local LLM runner that enables you to run various large language models on your own hardware. This guide covers the setup, configuration, and usage of Ollama in the Local AI Cyber Lab environment.
Overview¶
Ollama provides: - Local model execution - Model management - API access - GPU acceleration - Custom model support
Architecture¶
graph TB
subgraph Client_Layer["Client Layer"]
webui["Open WebUI"]
api["API Clients"]
cli["CLI Tools"]
end
subgraph Ollama["Ollama Service"]
api_server["API Server"]
model_manager["Model Manager"]
inference["Inference Engine"]
subgraph Models["Model Storage"]
quantized["Quantized Models"]
full["Full Models"]
end
end
subgraph Hardware["Hardware"]
gpu["GPU"]
cpu["CPU"]
memory["Memory"]
end
Client_Layer --> Ollama
api_server --> model_manager
model_manager --> Models
inference --> Models
inference --> Hardware
Installation¶
Ollama is automatically installed as part of the Local AI Cyber Lab environment. However, you can customize its configuration:
# Pull the latest Ollama image
docker-compose pull ollama
# Start Ollama service
docker-compose up -d ollama
Configuration¶
Environment Variables¶
# .env file
OLLAMA_HOST=0.0.0.0
OLLAMA_PORT=11434
OLLAMA_MODELS=llama2,codellama,mistral
OLLAMA_MEMORY_LIMIT=8g
Hardware Requirements¶
- Minimum:
- 8GB RAM
- 4 CPU cores
- 20GB storage
- Recommended:
- 16GB+ RAM
- 8+ CPU cores
- NVIDIA GPU (8GB+ VRAM)
- 100GB+ SSD storage
Model Management¶
Available Models¶
- Default Models:
- llama2
- codellama
- mistral
-
neural-chat
-
Specialized Models:
- llama2-uncensored
- codellama-python
- mistral-openorca
- stable-beluga
Model Commands¶
# List available models
docker-compose exec ollama ollama list
# Pull a model
docker-compose exec ollama ollama pull llama2
# Remove a model
docker-compose exec ollama ollama rm llama2
# Get model information
docker-compose exec ollama ollama show llama2
Usage¶
API Endpoints¶
-
Chat Completion:
-
Model Management:
Python Integration¶
import requests
def chat_with_model(prompt, model="llama2"):
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": model,
"messages": [
{"role": "user", "content": prompt}
]
}
)
return response.json()
# Example usage
result = chat_with_model("Explain quantum computing")
print(result['message']['content'])
Performance Optimization¶
Model Quantization¶
# Pull quantized model
docker-compose exec ollama ollama pull llama2:7b-q4
# Compare performance
docker-compose exec ollama ollama run llama2:7b-q4 "Your test prompt"
Resource Management¶
-
GPU Settings:
-
Memory Management:
Security Considerations¶
Access Control¶
-
Network Security:
-
API Authentication:
Model Security¶
-
Model Verification:
-
Safe Usage Guidelines:
- Implement rate limiting
- Monitor model outputs
- Use content filtering
- Regular security audits
Monitoring¶
Health Checks¶
# docker-compose.yml
services:
ollama:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/health"]
interval: 30s
timeout: 10s
retries: 3
Metrics Collection¶
-
Prometheus Integration:
-
Key Metrics:
- Model inference time
- Memory usage
- GPU utilization
- Request latency
Troubleshooting¶
Common Issues¶
-
Memory Issues:
-
GPU Problems:
Logs¶
# View logs
docker-compose logs -f ollama
# Enable debug logging
docker-compose exec ollama ollama serve --verbose