Shipping a Public AI Service Without Going Broke
Javier Machin

Jun 14, 2026

Shipping a Public AI Service Without Going Broke

A case study on deploying a public, no-sign-up AI service and keeping it secure, well-behaved, and running at a fraction of a cent per story.

engineeringcase-studyAIarchitecture

Most of my posts here explain a concept. This one's different. It's a case study of something I built: Magic AI Story, a web app that turns a free-text description into an illustrated story. The fun part wasn't getting the AI to spit out a story. The fun part was the constraint nobody warns you about. How do you put a generative-AI service on the public internet, with no sign-up wall, and not wake up to a drained API budget and a pile of content you'd be ashamed to show your mum?

So let's talk about the real engineering: keeping a public AI service secure, well-behaved, and cheap enough that a side project can survive on a side-project budget.

The Trilemma

Here's the tension that shaped every decision. A public AI service pulls in three directions at once, and you can't fully satisfy all three:

  • Open access. No sign-up, no friction. Land on the page, type an idea, get a story. That's the whole magic. The moment you bolt on an account wall, half your visitors bounce before they ever see what it does.
  • Cost control. Every story still costs something, namely three image generations a pop. Multiply that by 'anyone on the internet, as many times as they like' and even cheap calls add up to a cliff edge you can see from here.
  • Abuse prevention. An open box that takes free text and returns AI-generated images is, let's be honest, an invitation. Some people will try to make it produce things you really don't want your name attached to. Others will just hammer it for fun.

Pick any two and the third bites you. Go open and cheap with no abuse controls and you're hosting a content-moderation incident. Keep it open and safe but skip the cost ceiling and you're funding strangers' fan fiction until your card declines. Make it safe and cheap behind a login, and congratulations, nobody uses it. The whole build was really an exercise in refusing to fully give up any of the three.

Keeping It Affordable

I'll lead with the number that makes this fun. A fully illustrated story (parsed input, a written narrative, and three generated images) costs me roughly half a cent to produce. Fractions of a penny per story, no matter how many people show up. The whole secret is what models you call and how you call them.

All the language work, meaning parsing the user's input, writing the story, and crafting the image prompts, runs on Nemotron via OpenRouter, a model that happens to be free. Free. For 'write a charming tale about a dragon who's afraid of heights,' you genuinely do not need a frontier model, and a capable open model that costs nothing does the job beautifully. That single choice zeroes out what would normally be the bulk of the per-story bill.

The only thing I actually pay for is the pictures. Image generation runs on Flux Dev through Runware, and each story produces three images. I'll admit that sometimes the generated images are not the best, but considering the budget, it's a solid little model. That's the entire variable cost, and it's a rounding error. Everything else (the database, the rate-limiting, the hosting) lives comfortably inside free tiers too.

The reason I think about it as cost per story rather than a monthly bill is that per-story economics is the number that actually scales with you. A flat monthly figure is meaningless the moment traffic moves. Cost per unit of value is the one that tells you whether the thing stays sustainable at ten stories a day or ten thousand. Get that number low and growth becomes a happy problem instead of a budget emergency.

But cheap models are only half of it. The other half is not being wasteful with the calls themselves, because the image calls are the one thing that isn't free:

  • One LLM does triple duty (input parsing, narrative generation, and image-prompt crafting) rather than reaching for separate specialised services for each.
  • Prompts and generation parameters are tuned to get a usable result on the first try, because every retry is wasted budget, and on the image side, real money.
  • The whole pipeline is structured to fail fast. If the input is junk, it gets caught before it ever reaches the image-generation step that actually costs something.

The lesson here has nothing to do with chasing free tiers. The real point is that choosing the right model for the job is a cost-engineering decision as much as a quality one. Reaching for the most powerful model out of habit is how a fraction-of-a-cent story quietly becomes a service you can't afford to let succeed.

Keeping the Budget From Walking Off a Cliff

Cheap-per-story is lovely right up until someone decides to generate ten thousand of them. With no sign-up, I had no user accounts to attach limits to, so the obvious tool (per-account quotas) was off the table.

The answer was a rate-limiting layer keyed on the anonymous visitor, backed by Redis. Redis is the right tool here because rate-limiting is all short-lived counters with expiry, exactly what an in-memory store with TTLs is built for, and far cheaper and faster than hammering a real database for every request. A visitor gets a sensible allowance, and once they hit it they're politely asked to come back later. No account required, but also no way to single-handedly empty the budget. It protects the wallet and keeps the front door open. The trilemma, partially squared.

Keeping It Secure (and Not Embarrassing)

'Secure' for a public AI service has very little to do with firewalls. What matters are the specific ways an open generative endpoint can be turned against you. I thought about it as a list of concrete vectors, each with its own answer:

  • Garbage or malicious input. Free text is the wild west. Everything the user types gets parsed and validated before it's allowed to drive a generation, so nonsense gets rejected at the door rather than turned into images.
  • Inappropriate content. This was the one that kept me honest. The same LLM that powers the experience also does a moderation pass, and the prompt design deliberately steers generation toward the wholesome and away from the unpleasant. The goal was a genuine balance: enough creative freedom that the stories are fun and surprising, enough guardrail that the service won't produce something it shouldn't.
  • Repeat abusers. Rate-limiting slows people down, but the determined ones need a harder stop. So there's a blacklist, a persistent ban capability backed by Supabase (Postgres), for cutting off bad actors entirely. Here Postgres is the right call precisely because blacklist entries are the opposite of rate-limit counters: they need to be durable and to stick around, not expire in an hour.
  • Flying blind. You can't protect what you can't see, so analytics are wired in to watch usage patterns, both to spot abuse and to learn what people actually do with the thing.

Notice the split: Redis for the ephemeral stuff (rate limits), Postgres for the durable stuff (bans). Same goal of protecting the service, but matching the storage to the lifetime of the data keeps each part simple and cheap.

The Genuinely Hard Bit: Taming Non-Determinism

If I had to point at the single trickiest engineering problem, it had nothing to do with cost or abuse. It came down to this: AI output is unpredictable by nature, and the rest of my app is code that expects structure. I'm asking a creative, free-wheeling model to produce something I then have to parse, validate, slot into a UI, and hand to an image generator. Creativity and reliable structure are not natural friends.

Squaring that came down to three things working together: careful prompt engineering to coax the model toward a consistent shape, tuned generation parameters to keep it from wandering too far, and robust error handling for the times it wanders anyway. The trick is to treat the model as a component that will occasionally hand you something weird, and to build the surrounding system so that 'weird' degrades gracefully instead of crashing the experience. You're not eliminating the chaos, you're containing it.

What I'd Tell Past Me

The thing I underestimated going in was how much of 'building an AI product' is actually not AI work. The model call is maybe a tenth of the effort. The other nine-tenths is everything around it: a rate limiter so you stay solvent, moderation so the output stays respectable, validation so malformed responses don't break the UI, and a ban list for the inevitable bad actor. The model call gets all the attention, but it's the surrounding engineering that decides whether the thing survives contact with real users.

If you're building something similar, my one piece of advice is to design for the adversarial, broke, worst-case version of your service from day one. Assume someone will abuse it, assume the calls that do cost money will get hammered, assume the model will misbehave, and let those assumptions shape the architecture. It's a lot cheaper than retrofitting them after the first nasty surprise.

Want to see the result? Give it a go. Alternatively if you don't want to deal with all that, contact me and I might be available to build it for you ;)

Running DeepSeek Locally: A Step-by-Step Guide

Learn how to install DeepSeek locally in just a few simple steps. Secure, private, and free.

Jan 11, 2025Read more