Note: This is generated from a transcript from one of my YouTube videos


Elevate Your Smart Doorbell: AI Vision Announcements with Home Assistant

Elevate Your Smart Doorbell: AI Vision Announcements with Home Assistant demonstration at 6.0s Elevate Your Smart Doorbell: AI Vision Announcements with Home Assistant demonstration at 17.0s
Elevate Your Smart Doorbell: AI Vision Announcements with Home Assistant demonstration at 34.0s Elevate Your Smart Doorbell: AI Vision Announcements with Home Assistant demonstration at 37.0s
I'm thrilled to share a fantastic upgrade that will revolutionize your smart home doorbell experience. We're diving into LLM Vision, a brilliant integration that empowers your Home Assistant to "see" and describe who's at your door using advanced AI vision. Imagine hearing, "A devilishly handsome, definitely not middle-aged man is at the door." You might initially think this sounds like a novelty, but trust me, once you start using it, you'll discover just how surprisingly useful this feature can be for your smart home setup. It can even handle scenarios like, "Ding Dong, they have already gone," or "Ding Dong, check the camera, their face is hidden." Let's get started!

Getting Started: Google Cloud and Gemini API Setup

Getting Started: Google Cloud and Gemini API Setup demonstration at 60.0s Getting Started: Google Cloud and Gemini API Setup demonstration at 66.0s
Getting Started: Google Cloud and Gemini API Setup demonstration at 74.0s First things first, we need to configure Gemini in Google Cloud. Don't worry if that sounds daunting; it's quite straightforward. You'll want to navigate to the Google Cloud Console and create a new project specifically for this purpose. Once your project is set up, the next crucial step is to enable the Gemini API. This API is the powerhouse behind your new AI vision capabilities.

Next, you’ll need to obtain an API key from AI Studio. Simply head over to AIStudio.google.com/app/APIkey. I’ll make sure to include that link in the description below for easy access. Click “Create API Key” and ensure you select the Google Cloud project you just configured. Keep this API key readily available, as we’ll need it very soon.

Installing the LLM Vision Integration

Installing the LLM Vision Integration demonstration at 92.0s Installing the LLM Vision Integration demonstration at 100.0s
Installing the LLM Vision Integration demonstration at 122.0s Installing the LLM Vision Integration demonstration at 132.0s
Now for the exciting part: installing the LLM Vision integration itself. You'll acquire this from HACS, the Home Assistant Community Store. I must commend the documentation available at [llmvision.gitbook.io](https://llmvision.gitbook.io) – it's truly excellent, so feel free to explore it for deeper insights. While there's a blueprint available, I opted not to use it because I wanted the system to process a single image rather than a video, which is significantly quicker. The entire process is quite snappy.

First, install it from HACS just like any other custom integration. After installation, give Home Assistant a quick restart. Then, you can install it as a standard integration directly through the Home Assistant interface. When prompted, paste the API key we obtained earlier and select “Google” as your provider. You’ve got this!

Addressing Privacy Concerns

Addressing Privacy Concerns demonstration at 153.0s Before we proceed, let's address the elephant in the room: privacy. I know some of you might be thinking, "Hold on, this is sending images to the cloud. What about my family's privacy?" That's a completely valid concern. Since this is a remote LLM and not hosted locally, there's indeed a consideration about images being uploaded to Google's servers.

Here’s how I’ve approached it in my setup: my kids won’t be ringing the doorbell, so it should never capture images of them. Crucially, my automation is configured to trigger only when someone actually rings the doorbell, not merely when there’s motion in front of the camera. This significantly limits what gets sent to the cloud.

Building the Automation

Building the Automation demonstration at 173.0s Building the Automation demonstration at 174.0s
Building the Automation demonstration at 195.0s Building the Automation demonstration at 196.0s
Building the Automation demonstration at 208.0s Building the Automation demonstration at 209.0s
Building the Automation demonstration at 229.0s Alright, let's get into the fun part: integrating this into an automation. I already have a pretty neat automation that captures a picture from my video doorbell and sends a notification to my phone with the image every time someone presses the button. It also announces that the doorbell was rung throughout the house on various Google Home devices, which is super handy if you're upstairs and can't hear the physical doorbell. My goal is to enhance this existing automation so it can describe what it sees.

The magic happens when we introduce a new action called llmvision.image_analyzer, which is provided by the integration we just added. Let me walk you through the configuration that makes this work. We’re utilizing the llmvision.image_analyzer action with some specific parameters that I’ve found to be very effective.

I’ve set the max_tokens to 50 to ensure the responses remain concise. A “token” is essentially a word, so this limits the description to about 50 words. I’ve also set the temperature to 0.2 for more consistent results. The temperature parameter ranges from 0 to 1; a value closer to 1 encourages more creative and varied responses, while a value closer to 0 yields more precise and consistent output. I’ve fine-tuned mine to 0.2, though I did experiment with 1 to try and get more interesting results – your mileage may legitimately vary! For the model, I’m using the Gemini 2.0 Flash model, primarily because it’s quite capable and, importantly, it’s free.

Crafting the AI Prompt: Prompt Engineering

Crafting the AI Prompt: Prompt Engineering demonstration at 257.0s Here's where it gets truly fascinating: prompt engineering. This is where you instruct the AI exactly what to look for and how to respond. I've carefully crafted a prompt that handles various common scenarios quite well. This is the exact prompt I use that has given me excellent results:
This is a picture from a video doorbell.
If there is nobody there say they have already gone and nothing else.
Otherwise if they're walking away say they have already gone and nothing else.
Otherwise if they're faces hidden say check the camera, they're faces hidden and nothing else.
Otherwise if none of the above true then say a adjective, man/woman who looks like famous actor which I thought was quite fun.
Holding whatever they're holding is at the door.

The reason for being so explicit with the prompt, including phrases like “and nothing else,” is to prevent the AI from generating multiple, lengthy responses. I particularly love this approach because it effectively covers the most common situations you’ll encounter. Sometimes people ring the bell and immediately walk away, perhaps dropping off a package or simply in a hurry. Other times, you might have someone whose face is obscured by a hood or mask, which is definitely valuable information to have before you open your front door.

The Results and Customization

The Results and Customization demonstration at 352.0s The results have been surprisingly accurate and, I must admit, quite entertaining! I tweaked the prompt a little after it consistently told me there was a "middle-aged man" at the door during my testing, and now it frequently tells me I look like Matt Damon. Brilliant! It has quickly become one of my favorite subtle enhancements in my smart home setup. Instead of a generic "doorbell pressed" announcement, you receive this wonderfully descriptive message that provides useful context about who's there. For example, "A man who wants you to subscribe is waving a sign around."

If you’re considering implementing this yourself, I highly recommend starting with a basic setup and then fine-tuning the prompts to align with your specific needs. Perhaps you desire more detailed descriptions, or maybe you prefer to focus on particular elements like packages or uniforms. The possibilities for customization are vast!

What do you think? Are you going to give this a try in your own smart home setup? Let me know in the comments below what creative prompts you come up with! If you found this helpful, as always, don’t forget to give it a thumbs up and subscribe for more smart home content. Thanks for watching, and I’ll see you in the next one!

Links:

I’m using a Reolink PoE video doorbell, which I highly recommend: https://amzn.to/4knsj8D

Here’s a link to Google AI studio: https://aistudio.google.com/app/apikey This is the llm vision integration documentation: https://llmvision.gitbook.io/getting-started

This is the prompt that I used to get good results:

This is a picture from a video doorbell.

If is nobody there say "they have already gone" and nothing else.
Otherwise, If they are walking away say "they have already gone" and nothing else.
Otherwise, If their face is hidden say "Check the camera, their face is hidden." and nothing else.
Otherwise, If none of the previous are true, say "A {adjective} {man/woman} who looks like {famous actor} holding {what they are holding} is at the door"

And here’s the complete YAML of the automation:

alias: Notifications - Doorbell
description: ""
triggers:
  - type: turned_on
    device_id: 60f0f7dba756a82ed054ec8200829078
    entity_id: bd8e8bfd2b39fb7832bbc6cdd5df0fb3
    domain: binary_sensor
    trigger: device
conditions: []
actions:
  - delay:
      hours: 0
      minutes: 0
      seconds: 0
      milliseconds: 500
  - action: camera.snapshot
    metadata: {}
    data:
      filename: /config/www/reolink_snapshot/last_snapshot_doorbell.jpg
    target:
      entity_id: camera.reolink_video_doorbell_poe_fluent
    enabled: true
  - action: llmvision.image_analyzer
    data:
      include_filename: false
      max_tokens: 50
      provider: 01JW3EN2J5JB2HVCN7DH6H8269
      image_file: /config/www/reolink_snapshot/last_snapshot_doorbell.jpg
      model: gemini-2.0-flash
      message: This is a picture from a video doorbell.


        If is nobody there say "they have already gone" and nothing else.

        Otherwise, If they are walking away say "they have already gone" and
        nothing else.

        Otherwise, If their face is hidden say "Check the camera, their face is
        hidden." and nothing else.

        Otherwise, If none of the previous are true, say "A {adjective}
        {man/woman} who looks like {famous actor} holding {what they are
        holding} is at the door"
      temperature: 0.2
    response_variable: vision_result
  - action: notify.mobile_app_pixel_9_pro
    metadata: {}
    data:
      title: Doorbell
      message: "{{ vision_result.response_text }}"
      data:
        image: /local/reolink_snapshot/last_snapshot_doorbell.jpg
        clickAction: app://com.mcu.reolink
    enabled: true
    alias: Notify Ben's phone
  - action: tts.speak
    target:
      entity_id: tts.home_assistant_cloud
    data:
      cache: true
      media_player_entity_id: media_player.kitchen
      message: Ding dong. {{ vision_result.response_text }}
    enabled: true
    alias: Notify kitchen speaker
mode: single

Video

You can watch the full video on YouTube here:

Support me to keep making videos

Ko-Fi

If you like the work I’m doing, please drop a like on the video, or consider subscribing to the channel.

In case you’re in a particularly generous mood, you can fund my next cup of coffee over on Ko-Fi

The links from some of my videos are affiliate links, which means I get a small kickback at no extra cost to you. It just means that the affiliate knows the traffic came from me.

×