Making of Símarómur for iOS – Part 2/3

Part 2: The Rules Apple Doesn’t Document

Audio Unit Extensions come with serious restrictions. But unlike classic embedded systems, these aren’t about what the iPhone can handle – they’re about what Apple allows.

The frustrating part? Only a fraction of these rules is documented.

No Network Access

The most obvious restriction – and the only one Apple actually documents – is that TTS Audio Unit Extensions cannot access the internet.

For us, this was never a problem. Símarómur has always been fully offline, on Android too. That’s the whole point. But for the big cloud-based TTS providers like ElevenLabs, this slams the door shut. If your business model depends on server-side synthesis, you’re not offering an iOS system voice.

Storage Is a One-Way Street

Communication between the Containing App and the embedded AUE only flows one direction. The Containing App can push data to the AUE through the App Group Directory – say, to install a new voice.

Going the other way? Blocked. The AUE can’t write to that shared directory. Same story with Secure Storage and the Keychain.

What this means in practice:

No log files, no diagnostics
No usage statistics – not even “most common word missing from the dictionary”
No way to generate cryptographic keys and hand them off

For improving the product based on real-world data, this is a serious handicap.

Death Without Warning

Normal iOS apps have a well-defined lifecycle. They get notified when they move to the background, come back to the foreground, or are about to be terminated. Time to clean up, free memory, save state.

AUE get none of that – at least nothing related to termination. When memory pressure spikes, the system kills your extension. Hard. No warning, no chance to react.

60 Megabytes

The biggest constraint – and the one that shapes everything – is RAM. The memory limit for AUE is drastically restricted at runtime. How much exactly? Not documented. Is there an API to check how much you have left? No.

We asked Apple directly. They declined to answer.

From experience, the limits are:

80 MB in the foreground
60 MB in the background

These numbers are stunning, for several reasons.

Modern TTS models are often ten times that size. Even Apple’s own downloadable voices run five to eight times larger. Apple clearly knows 60 MB is unrealistically tight – they don’t hold themselves to it.

And it gets worse: a model doesn’t just need to fit in RAM – it needs extra headroom to actually run. The VITS model we use for Android Símarómur needs over 500 MB just for inference.

60MB. In over 25 years of embedded work, I can’t remember ever shipping something with constraints this tight besides software for simple micro controllers.

Next: Welcome to the Desert →.*