#ollama

4 posts

Tesla P40 in a Homelab: 24GB of Inference on a Budget

Running a Tesla P40 for LLM inference. Why I ditched GPU passthrough for host-level drivers to stop the constant Proxmox crashes.

How to build a routing layer for AI agents that ensures sensitive data stays on local hardware while leveraging cloud LLMs for non-private tasks.

Moving beyond prompt engineering to implement token-level schema enforcement, pre-execution gates, and shell-safe execution pipelines for AI agents.

Deploying Ollama on Kubernetes can lead to GPU deadlocks. Here's how to avoid them.