TechingToday

Google's Gemini Pro 1.5 can now not only see, but also hear.

General

Google has updated its very powerful artificial intelligence model Gemini Pro 1.5 to include for the first time the ability to listen to the content of audio and video files.

The update was announced at Google Next, where the search giant confirmed that the model can listen to updated clips and provide information without requiring transcription.

What this means is that you can give a documentary or video presentation and ask questions about both the audio and video in the clip.

This is part of Google's broader effort to create more multimodal models that can understand a variety of input types, not just text. This move is made possible by the Gemini family of models learning speech, video, text, and code simultaneously.

Google introduced Gemini Pro 1.5 in February with a 1 million token context window. This means that video can be processed when combined with multimodal training data.

The technology giant is now adding voice to its input options. This means that podcasts can be given and key moments or specific mentions can be heard. The same can be done for audio attached to video files while analyzing the video content.

The new update is part of the middle tier of the Gemini family, which has three form factors: the smaller Nano for on-device use, Pro, which offers a free version of the Gemini chatbot, and Ultra, which offers Gemini Advanced.

For some reason Google released the 1.5 update only for Gemini Pro, not Ultra. It is not clear if Gemini Ultra 1.5 will be available, and if so, when it will be accessible.

The large context window, which starts at 250,000 (similar to the Claude 3 Opus) and exceeds 1 million for certain authorized users, means that there is no need to tweak the model with specific data. Simply load that data at the start of the chat and ask a question.

This update also means that Gemini can now generate transcripts of video clips.

Probably after the Google I/O developer conference next month. At this time, it is only available through the Google Cloud developer dashboard VertexAI.

VertexAI is a powerful tool for interacting with various models, building AI applications, and testing what is possible, but it is not widely accessible and is primarily targeted at developers, businesses, and researchers rather than consumers.

VertexAI allows users to insert visual or audio media, such as a short film or a person giving a speech, and add text prompts. This could be "Give me 5 bullet points that summarize your speech" or "How many times did you say Gemini?"

Google's primary users of Gemini Pro 1.5 are businesses, and it already has partnerships with TBS, REplit, and others that it uses to tag metadata and create code.

Google has also begun using Gemini Pro 1.5 in its own products, including Code Assist, a generative AI coding assistant for tracking changes across large code bases.

The change to Gemini Pro 1.5 was announced at Google Next along with a major update to the DeepMind AI image model Imagen 2 that enhances Gemini's image generation capabilities.

It gets in-painting and out-painting, which allows users to remove or add any element from the generated image. This is similar to the recent updates OpenAI made to the DALL-E model.

Google is also trying to link AI responses on Gemini and other platforms with Google search to ensure that they always contain up-to-date information.