Some Ring doorbells can use AI features to interact with visitors when you’re not home. I ditched my Ring doorbell for a Reolink doorbell that runs fully locally, but I wondered if I could recreate a similar feature using a local LLM. I was partially successful.
What I wanted my doorbell to do
An AI-powered concierge
The idea seemed fairly plausible. When someone rings the doorbell and Home Assistant detects that no one is home, the doorbell should speak to the caller explaining that everyone is out and asking for their name and reason for calling. It should then listen for the response, process what they say, and respond accordingly.
With the use of a cloud-based LLM, this would seem to be a realistic goal. Converting text to speech and speech to text are simple enough to do using cloud-based services. An LLM would sit in the middle, taking what the caller said as the input and generating responses to be spoken by the doorbell.
I knew that doing this with a local LLM would be more challenging. My relatively weak hardware can only run smaller models, and these might not be up to the job. I figured it was worth a try to see whether I could get it all running locally.
- Resolution
-
2K
- Power Source
-
Battery
Reolink’s battery-powered Wi-Fi video doorbell is a great way to know who’s outside. With a 2K resolution and a 150°x150° head-to-toe view, this video doorbell can be powered either over battery or wired, depending on your existing setup.
How I set it up
TTS out, Whisper in, Ollama in the middle
There were three main components that I needed to make this work. I needed a way to transform text to speech (TTS) so that my doorbell could speak aloud to the caller. I needed a way to transform speech to text (STT) so that whatever the caller said could be converted into written text to pass to the LLM. And I needed a way to run a local LLM that would be the brains of the whole operation.
Thankfully, Home Assistant has some great options for each of these components. Piper is a local TTS engine that can turn written text into spoken audio that I can play through my doorbell. It runs entirely locally and is lightweight enough that you can run it on a Raspberry Pi 4.
How I Use Home Assistant to Describe Who’s At the Door Using AI
Get AI-generated descriptions of anyone your video doorbell detects.
Whisper provides the equivalent local STT component. It can take the audio recorded by my doorbell when the caller is speaking and convert it into text that I can pass to the local LLM. Once again, it runs entirely locally, which was my aim for this project.
The final piece of the puzzle is Ollama. This is a tool that allows you to run local large language models on your own hardware. There’s a Home Assistant integration that you can use to connect Ollama to Home Assistant.
The bottleneck is the capability of the LLM model that you run. Weaker hardware can only run smaller, less capable models, and the larger the model you try to run, the slower the responses are likely to be. I had to use a fairly small model to ensure that it didn’t take too long to generate responses.
Reality didn’t match my hopes
The concept is fine, the execution isn’t
It took me some time to get everything set up. As always with Home Assistant, other people had done most of the hard work; there was a useful GitHub Gist explaining how to play audio and TTS through my Reolink doorbell, which came in very handy.
I had some issues with the audio capture starting while the spoken greeting from the doorbell was still playing, which messed things up, but eventually figured out how to work around it.
The first parts of my idea worked well. When the doorbell was pressed, the LLM would generate a spoken greeting which would play through the doorbell speaker. It would explain that everyone was out and ask the caller for their name and the purpose of their call.
The doorbell would then record their spoken response and STT would turn it into text. So far, so good.
The problem was that trying to have a two-way conversation with the AI-powered doorbell just didn’t work. The small LLM would get confused and start talking nonsense, and the responses would take too long to come through.
It seems likely that the concept would work much better with a powerful enough LLM running the show. Until I win the lottery, however, I’m stuck with what I’ve got.
I built a workable alternative
It’s actually a pretty solid setup
Since the main sticking point was trying to have a conversation with the caller, I simply cut out that part of the process. Instead, when the caller gives their name and reason for calling, the STT turns this into text, and that text is then sent as a notification to my phone. The doorbell then says that it will pass on the message and ends the conversation.
It means that whenever someone rings the doorbell when we’re out, I get a notification telling me who it was and why they were calling. It works reasonably well most of the time, with the occasional slightly hilarious notification appearing when things go wrong. For the most part, however, it’s a genuinely useful feature.
This is the direction the world is going in
The trend now is for AI in all the things, and it doesn’t look like slowing down any time soon. While Ring’s AI-powered concierge is useful, the company doesn’t have the best reputation for privacy. The good news is that it’s possible to recreate at least parts of these features completely locally with a little effort.


