Hamster AI generating a local AI summary

Artificial Intelligence is everywhere – except your app, right?

The good news is that it’s never been easier to integrate AI features. The bad news is depending on what kind of features you want to add and how many of your users will use them, it can be very expensive to do so if you plan to use models from vendors like Anthropic, Google, or OpenAI, or model providers like AWS Bedrock or Azure.

Depending on how personalized your AI-generated content needs to be, you may be able to use techniques like prompt optimization and caching to reduce costs. If the same content works for many users, there is no sense in generating it multiple times.

But let’s say you’ve got thousands of users and your AI features need to be personalized, and you’ve determined it would be prohibitively expensive to use cloud model providers. Fortunately there are a number of open-source models that can be run locally on a device for free, and both Apple and Google now include models directly in their operating systems. As long as your AI features fit in the context window of the local model, you can have all the AI features you want and not have to pay a dime.

Now your focus shifts from managing costs, to optimizing for device constraints and performance. You’ll want to find the sweet spot of cool AI features that your users will love, but not kill their battery to provide a good user experience.

Computing at the edge

Modern computing happens either in the cloud or at the edge on a device.

Normally, apps with AI features use models from cloud providers like OpenAI or Google. Cloud resources are effectively infinite, so the sky is the limit for what you can do, and the constraints are cost and connectivity.

At the edge however, internet connectivity or cost doesn’t matter and the main constraint is physics. You only have so much memory and processing power. You can still build cool AI features, but you’ll need to be mindful of battery and heat if running on a device like a phone or tablet.

One straightforward approach is to do as little concurrent processing as possible, but coordinated in a way that feels like the app is doing a lot. A smart queueing system with intelligent prioritization can do this. If all your processing is background priority by default with any user-initiated action prioritized immediately, the user perception will be a highly-responsive, AI-heavy app.

This is how the newest version of my app works.

Optimizing for device constraints

The latest version of my Hamster Soup app leans heavily on local AI. When I added AI summarization features a couple of years ago, I used serverless cloud functions and OpenAI cloud models, with some smart caching in the cloud so I wasn’t paying to generate the same content over and over for every user. It would generate up to 5 variations and send a random variation to each subsequent user. Last year, I added local Apple Foundation Model support for devices that could support it, but the user could still choose to use the cloud model. Now, the app only uses on-device AI, and will continue to do so for the foreseeable future.

I also significantly increased the number of AI features in the app. Many of them at least appear to the user as if they’re all working at the same time, even though with today’s hardware only one request can be processed at a time across the entire device. (If another app or the OS is using the model, your app will have to wait.)

Hamster Soup app summarizes a news article

And yet by using in-progress states, the app can appear to be generating as many as 20 AI summaries at once when you first sync all the news feeds, while being responsive and switching priorities when the user wants to generate a new summary intentionally. Fortunately, Apple gave developers a lot of tools to optimize for device constraints and smooth performance, while providing a good user experience.

With Swift 6, Apple added a stricter concurrency model that prevents data races at compile time, allowing us to safely serialize and reschedule heavy on-device tasks like AI model inference while avoiding legacy-style locking for thread safety or manual GCD dispatching. You can also observe the device battery and thermal states, so your app logic can act responsibly while not burning down your user’s battery or burning up their device.

First you need to prioritize workloads. Everything runs background priority, except when the user wants to do something.

/// Priority levels for heavy tasks, determining queue precedence and workload gating.
public enum WorkPriority: Int, Comparable, Sendable {
    /// Low priority tasks run in the background (e.g. pre-generating suggestions, cache pre-warming, indexing).
    case background = 0
    
    /// High priority tasks initiated directly by user interaction (e.g. user taps "Export Now").
    case userInitiated = 1
    public static func < (lhs: WorkPriority, rhs: WorkPriority) -> Bool {
        lhs.rawValue < rhs.rawValue
    }
}

Next, you’ll want a policy for your workloads. This policy will not only decide whether a job can be scheduled in the queue, but also how long to wait before a job can run, so the logic can adapt to the device’s battery and thermal state.

/// Thermal-, power-, and backlog-aware pacing for resource-intensive operations (see Apple `ProcessInfo.thermalState`).
public final class WorkloadPolicy: Sendable {
    public struct Configuration: Sendable {
        public let maxPendingBackgroundJobs: Int
        public let maxBackgroundJobsPerForegroundSession: Int
        public let baseNominalDelaySeconds: Double
        public let baseFairDelaySeconds: Double
        public let baseSeriousDelaySeconds: Double
        public let baseCriticalDelaySeconds: Double
    }

    public let config: Configuration

    /// Gates whether a new background task can be enqueued based on the current workload backlog, session budget, and system thermal/power state.
    /// - Parameters:
    ///   - pendingBackgroundJobs: The number of background jobs currently waiting in the queue.
    ///   - backgroundJobsScheduledThisSession: The total number of background jobs scheduled in the current foreground session.
    ///   - thermalState: The active device thermal state. Defaults to `ProcessInfo.processInfo.thermalState`.
    ///   - isLowPowerModeEnabled: A boolean flag indicating whether the user's device is in Low Power Mode. Defaults to `ProcessInfo.processInfo.isLowPowerModeEnabled`.
    /// - Returns: A boolean indicating whether the task can be safely enqueued.
    public func canEnqueueBackgroundWork(
        pendingBackgroundJobs: Int,
        backgroundJobsScheduledThisSession: Int,
        thermalState: ProcessInfo.ThermalState = ProcessInfo.processInfo.thermalState,
        isLowPowerModeEnabled: Bool = ProcessInfo.processInfo.isLowPowerModeEnabled
    ) -> Bool {

	// ...

    }

    /// Computes the safety cooldown delay required before starting the next job, giving the SoC time to cool down between heavy runs.
    /// - Parameters:
    ///   - priority: The priority of the next job to execute. User-initiated jobs have shorter cooldowns.
    ///   - pendingBackgroundJobs: The backlog count of background tasks waiting in the queue.
    ///   - thermalState: The active device thermal state. Defaults to `ProcessInfo.processInfo.thermalState`.
    ///   - isLowPowerModeEnabled: A boolean flag indicating whether the device is in Low Power Mode. Defaults to `ProcessInfo.processInfo.isLowPowerModeEnabled`.
    /// - Returns: A `Duration` indicating how long the worker should sleep.
    public func interJobDelay(
        for priority: WorkPriority,
        pendingBackgroundJobs: Int,
        thermalState: ProcessInfo.ThermalState = ProcessInfo.processInfo.thermalState,
        isLowPowerModeEnabled: Bool = ProcessInfo.processInfo.isLowPowerModeEnabled
    ) -> Duration {

	// ...

    }
}

And finally, you’ll want a job queue that lets the app manage jobs, prioritize user-initiated tasks, and tie it all together.

Hamster Soup uses this exact code to manage the AI summarization tasks in the app, which run entirely on the device using Apple Foundation Models SDK, a roughly 3 billion parameter LLM built-in to all Apple Intelligence-supporting devices. As of June 2026, this model can run one workload at a time on mobile devices with a 4k context, perfect for summarizing all the news articles and 4-hour blocks of live updates once the Big Brother season gets going. Because it respects device battery and thermals and adjusts the queue accordingly, a typical iPhone powers through a bunch of Big Brother updates without breaking a sweat.

Hey – does that mean I have room to add even more AI-based features?!

For the complete implementation, see this gist on GitHub.