A Weekend AI Experiment That Actually Stuck

What started as a Friday afternoon experiment turned into something we use. I built a multi-agent AI system for Fortytwo, a 40+ person cloud expert firm, over a weekend. Time tracking, revenue reporting, infrastructure monitoring. A real internal proof of concept on a real business. This is what worked, what did not, and what genuinely caught me off guard.

It started with a simple idea: what if I had an AI assistant that understood our business? Two days later, I was staring at a system that manages our time tracking, calculates revenue across currencies, monitors infrastructure, writes code, reviews security, and delegates work between specialized AI agents, all while enforcing a daily budget.

The name? Deep Thought. Because when you build something that tries to answer everything, the reference writes itself.

The Architecture: One Brain, Many Specialists

Deep Thought follows a coordinator pattern. There is one lead agent, Marvin, who receives all input and decides what to do with it. General questions, he answers directly. Anything requiring specialized tools or API access, he delegates.

Each specialist agent is self-contained with its own tools, logic, and personality. Drop a new folder into the system, and it auto-discovers it: no registration, no config changes. That design decision turned out to be one of the most valuable. I can prototype a new agent in 20 minutes and have it running the same day.

Deep Thought Dashboard Activity Feed

Different Models for Different Tasks

Not every task needs the same brain, or the same budget.

The Developer agent runs on GPT-4.1 Mini via Microsoft Foundry, keeping all code and data within our own tenant. Security analysis runs on Claude Sonnet, which handles deeper reasoning more effectively. DevOps status checks run on a self-hosted open-source model, because a check that runs 20 times a day costs nothing on your own hardware, and a non-trivial amount if you are calling a premium API every time.

We run a deliberate mix of providers depending on the task and the data involved.

Claude via the Anthropic API handles most reasoning-heavy work: code review, security audits, anything requiring nuance. Azure OpenAI covers operations involving internal company data, where we need the compliance guarantees that come with a proper cloud agreement. For sensitive data that should not leave the building, we use local models: either on dedicated hardware or through an EXO cluster our colleagues have assembled from spare machines. It is scrappy, it is fast enough, and it costs nothing per inference.

Matching the model to the task is what makes the system both trustworthy and affordable to run day to day. The cost difference compounds quickly.

The Three-Tier Security Model

Every tool in the system carries a security classification:

YELLOW: Requires explicit approval. File writes, shell commands, deployments. A prompt appears before anything happens.

RED: Blocked. No agent can delete a deployment or perform destructive operations without a manual override.

This is the part I am most glad I designed early. It means agents can have real capability while a human stays in the loop for anything that matters. Start with the security model. Everything else follows.

Chat log with developer approval

The Finance Agent: Where It Got Real

This is where AI assistance moved from interesting demo to something that genuinely changes how we operate.

The problem: 40+ people logging hours across dozens of projects in our Norwegian accounting system. Different billing models, different rates, currencies in NOK, EUR, and GBP. Management needs weekly revenue reports. Previously that meant exporting a CSV, opening Excel, matching rates, converting currencies, and building pivot tables. Every single week.

The solution: an agent that syncs timesheets, matches each entry to the correct billing rate, respects which hours are billable, converts currencies using live exchange rates, and generates a report automatically.

The first version got the revenue calculation wrong by 40%. It turned out our accounting system has three different pricing models, and some projects store rates per employee rather than per activity. After an hour of debugging, comparing output against a 255-line CSV export line by line, we reached an exact match. The only remaining difference was a 383 NOK variance from using today’s exchange rate rather than the rate from that specific week.

That accuracy did not come easy. But now it runs automatically, every week, at essentially no cost.

Teaching Agents How to Think

One of the more useful patterns in the system is skills: markdown documents injected into an agent’s context based on what it needs to know. Not code. Just knowledge.

We have skills for coding standards, security review checklists, data visualisation formats, and Norwegian compliance conventions. There is also one called token-efficiency that instructs the agent to be concise: get to the point, no repetition, fewest tool calls possible. It has a measurable impact on cost. An agent without it might use 500 tokens to say what a well-prompted agent says in 150.

Each agent gets exactly the knowledge it needs. Nothing more.

The Numbers

DevOps status checks: $0.00 (self-hosted, well – except from power)

Average daily cost: ~$3–5 on active development days

Cost per finance sync: ~$0.02

Cost per revenue report: ~$0.01

A budget cap enforces a daily spend limit in real time. If we hit it, all external API calls stop. No surprise bills, no runaway loops. That one line in every server log, Budget today: $3.64 / $10, is the canary in the coal mine.

AI chat interface developer view

Scroll to Top