Building a Real-Time Video Chat with Gemini 2.0, Gradio, and WebRTC 👀👂

Community Article Published January 13, 2025

In December 2024 Google released Gemini 2.0 - a complete overhaul of their flagship AI model. One of the coolest new features is the ability to have natural, human-like video chat conversations with Gemini via the multimodal live API. In this tutorial, we'll build a web application that lets you have real-time video chats with Gemini using Python.

Our application will enable:

  • Real-time video chat with Gemini using your webcam
  • Real-time Audio streaming for natural conversations
  • Optional image upload capabilities
  • A clean, user-friendly interface

Prerequisites

  • Basic Python knowledge
  • A Google Cloud account with a Gemini API key. Go here
  • The following python packages:
pip install gradio-webrtc==0.0.28 google-generativeai==0.3.0

Our application will be built with Gradio, a framework for building AI-powered web applications entirely in Python. Gradio will handle all the UI elements for us so we can just focus on the core logic of the application. The gradio-webrtc package is an extension of Gradio that enables low-latency audio/video streams via WebRTC, a real-time communication protocol.

The google-generativeai package is Google's official package for interacting with Gemini.

Implementing the Audio Video Handler

The core of our application is the GeminiHandler class which will set up the audio/video stream to the Gemini server. Let's break down the implementation into parts.

Class Constructor and Copy Method

The class constructor will create all the variables we need to handle the video session. Namely, queues for storing the audio and video output as well as a session variable for storing the connection to the gemini server. 

The copy method ensures that each user has their own unique stream handler.

import asyncio
from gradio_webrtc import AsyncAudioVideoStreamHandler

class GeminiHandler(AsyncAudioVideoStreamHandler):
    def __init__(
        self, expected_layout="mono", output_sample_rate=24000, output_frame_size=480
    ) -> None:
        super().__init__(
            expected_layout,
            output_sample_rate,
            output_frame_size,
            input_sample_rate=16000,
        )
        self.audio_queue = asyncio.Queue()
        self.video_queue = asyncio.Queue()
        self.quit = asyncio.Event()
        self.session = None
        self.last_frame_time = 0

    def copy(self) -> "GeminiHandler":
        """Copy gets called whenever a new user connects to the server.
        This ensures that each user has an independent handler.
        """
        return GeminiHandler(
            expected_layout=self.expected_layout,
            output_sample_rate=self.output_sample_rate,
            output_frame_size=self.output_frame_size,
        )

Audio Processing

The audio processing is handled by the emit and receive methods. The receive method is called whenever a new audio frame is received from the user and emit returns Gemini's next audio frame.

In the emit method we'll connect to the Gemini API by calling the connect method in the background (that's what asyncio.create_task means). The Gemini python library uses a context manager to open and close a connection. We use an asyncio.Event to keep this context manager open until the user clicks the stop button or closes the page. At that point, the shutdown method is called and the asyncio.Event is set.

async def connect(self, api_key: str):
    """Connect to the Gemini API"""
    if self.session is None:
        client = genai.Client(api_key=api_key, http_options={"api_version": "v1alpha"})
        config = {"response_modalities": ["AUDIO"]}
        async with client.aio.live.connect(
            model="gemini-2.0-flash-exp", config=config
        ) as session:
            self.session = session
            asyncio.create_task(self.receive_audio())
            # Wait for the quit event to keep the connection open
            await self.quit.wait()

async def generator(self):
    while not self.quit.is_set():
        turn = self.session.receive()
        async for response in turn:
            if data := response.data:
                yield data

async def receive_audio(self):
    async for audio_response in async_aggregate_bytes_to_16bit(
        self.generator()
    ):
        self.output_queue.put_nowait(audio_response)

async def receive(self, frame: tuple[int, np.ndarray]) -> None:
    _, array = frame
    array = array.squeeze()
    audio_message = encode_audio(array)
    if self.session:
        await self.session.send(audio_message)

async def emit(self) -> AudioEmitType:
    if not self.args_set.is_set():
        await self.wait_for_args()
    if self.session is None:
        asyncio.create_task(self.connect(self.latest_args[1]))
    array = await self.output_queue.get()
    return (self.output_sample_rate, array)

def shutdown(self) -> None:
    self.quit.set()
    self.connection = None
    self.args_set.clear()
    self.quit.clear()

Video Processing

The video processing will be handled by the video_receive and video_emit methods. For our application, we will simply show the webcam stream back to the user but every 1 second we will send the latest webcam frame, as well the optional uploaded image, to the Gemini server.

async def video_receive(self, frame: np.ndarray):
    """Send video frames to the server"""
    if self.session:
        # send image every 1 second
        # otherwise we flood the API
        if time.time() - self.last_frame_time > 1:
            self.last_frame_time = time.time()
            await self.session.send(encode_image(frame))
            if self.latest_args[2] is not None:
                await self.session.send(encode_image(self.latest_args[2]))
    self.video_queue.put_nowait(frame)
    
async def video_emit(self) -> VideoEmitType:
    """Return video frames to the client"""
    return await self.video_queue.get()

image/png

The UI

Finally, let's create the Gradio interface with proper styling and components.

Below the HTML header we place two rows - one for inputting the gemini API key and the other for starting the video chat. When the page is first opened, only the first row will be visible. Once an API key is inputted, the second row will be visible and the first row will not be.

The webrtc.stream method sets up our video chat. As inputs to this event, we'll pass in the api key and the optional image_input component. We set time_limit=90 so that a video chat is limited to 90 seconds. The free tier of the Gemini API only allows two concurrent connections so we set a concurrency_limit=2 to ensure only two users are connected at a time.

css = """
#video-source {max-width: 600px !important; max-height: 600 !important;}
"""

with gr.Blocks(css=css) as demo:
gr.HTML(
  """
<div style='display: flex; align-items: center; justify-content: center; gap: 20px'>
  <div style="background-color: var(--block-background-fill); border-radius: 8px">
      <img src="https://www.gstatic.com/lamda/images/gemini_favicon_f069958c85030456e93de685481c559f160ea06b.png" style="width: 100px; height: 100px;">
  </div>
  <div>
      <h1>Gen AI SDK Voice Chat</h1>
      <p>Speak with Gemini using real-time audio streaming</p>
      <p>Powered by <a href="https://gradio.app/">Gradio</a> and <a href=https://freddyaboulton.github.io/gradio-webrtc/">WebRTC</a>⚡️</p>
      <p>Get an API Key <a href="https://support.google.com/googleapi/answer/6158862?hl=en">here</a></p>
  </div>
</div>
"""
)
with gr.Row() as api_key_row:
  api_key = gr.Textbox(label="API Key", type="password", placeholder="Enter your API Key", value=os.getenv("GOOGLE_API_KEY"))
with gr.Row(visible=False) as row:
  with gr.Column():
      webrtc = WebRTC(
          label="Video Chat",
          modality="audio-video",
          mode="send-receive",
          elem_id="video-source",
          # See for changes needed to deploy behind a firewall
          # https://freddyaboulton.github.io/gradio-webrtc/deployment/
          rtc_configuration=None,
          icon="https://www.gstatic.com/lamda/images/gemini_favicon_f069958c85030456e93de685481c559f160ea06b.png",
          pulse_color="rgb(35, 157, 225)",
          icon_button_color="rgb(35, 157, 225)",
      )
  with gr.Column():
      image_input = gr.Image(label="Image", type="numpy", sources=["upload", "clipboard"])

  webrtc.stream(
      GeminiHandler(),
      inputs=[webrtc, api_key, image_input],
      outputs=[webrtc],
      time_limit=90,
      concurrency_limit=2,
  )
  api_key.submit(
  lambda: (gr.update(visible=False), gr.update(visible=True)),
  None,
  [api_key_row, row],
)

if __name__ == "__main__":
demo.launch()

Conclusion

This implementation creates a full-featured voice chat interface for Gemini AI, supporting both audio and image inputs. The use of WebRTC enables real-time, low-latency communication, while the async design ensures efficient handling of streams.

Our application is hosted on Hugging Face here. To learn more about WebRTC streaming with python, visit the gradio-webrtc docs. Gradio is a great tool for building custom UIs in python, it works for any kind of AI application. Check out the docs here.

Community

I try to use at hosted on Hugging Face, it show waiting for a long time and not working like your video demonstration.

Where are you connecting from? Can you try again? Perhaps there was a lot of traffic to the site.

Sign up or log in to comment