Gemini 2.0 Flash 图文混合生成

1- 探索 Gemini 2.0 Flash 图文混合生成

最近两天又被 Gemini 2.0 的图文混合生成刷屏了，我也简单上手体验了下，效果着实惊人！尤其是在一致性生成方面，连续生成的帧图很适合制作成 GIF，为了自动化这一过程，我甚至写了程序（[lencx/ai-explore]^[1]）。

2- 模型简介

去年 12 月，Google 首次向可信测试用户推出了 Gemini 2.0 Flash 的原生图像输出功能。最近又将此功能开放给所有受 Google AI Studio 支持的地区的开发者进行试验。现在大家可以通过 Google AI Studio 中的 Gemini 2.0 Flash 实验版（gemini-2.0-flash-exp）或 Gemini API 来体验这一新能力（了解更多 [Experiment with Gemini 2.0 Flash native image generation]^[2]）。

Gemini 2.0 Flash 融合了多模态输入、增强的推理能力和自然语言理解，能生成更为精细的图像。以下是 Gemini 2.0 Flash 多模态输出的几种突出应用场景：

图文协作：使用 Gemini 2.0 Flash，你可以讲述一个故事，它便会生成与之配套的插图，并保持故事中的人物和场景连贯统一。如果你对图像有所反馈，它还能据此重新讲述故事，或改变绘图风格。
对话式图像编辑：Gemini 2.0 Flash 支持通过自然语言多轮对话编辑图像，非常适合反复修改以达到理想效果，或与模型共同探索不同创意。
世界知识理解：与许多其他图像生成模型不同，Gemini 2.0 Flash 利用广泛的世界知识与增强推理能力，更精准地生成图像。这使其尤其适合创建真实且细致的图像，例如配合菜谱制作插图。不过需注意，尽管其追求准确性，但与所有语言模型一样，其知识仍是广泛而非绝对全面的。
文字渲染能力：多数图像生成模型在准确渲染较长的文本内容时表现不佳，常出现格式错乱、字符难以辨识或拼写错误等情况。根据内部基准测试，Gemini 2.0 Flash 在文字渲染方面的表现强于主要竞争模型，适用于广告、社交媒体发布，甚至制作邀请函等场景。

目前 Gemini 2.0 Flash 系列模型效果都不错，但侧重不同。为了方便理解，我整理了一份表格。

Gemini Model Comparison Table

2.1- 使用限制

为获得最佳图像生成性能，建议使用：英语、西班牙语（墨西哥）、日语、简体中文或印地语。图片生成功能不支持音频或视频输入，且图片生成功能未必总会触发以下操作：

模型可能只会输出文本。尝试明确要求输出图片。
- 中文提示：" 生成图片 “、” 随时提供图片 “、” 更新图片 " 等。
- 英文提示："generate an image", "provide images as you go along", "update the image" 等。
模型可能会在中途停止生成，请重试或尝试使用其他提示。

为图片生成文字时，你如果先生成文字，再请求包含文字的图片，Gemini 效果会最好。

2.2- 如何使用

目前在 https://gemini.google.com 中暂未发现混合图文的模型，要想体验该实验功能则需通过 https://aistudio.google.com/prompts/new_chat 来访问，在进入 AI Studio 界面后在右侧栏先完成以下设置：

模型选择 “Gemini 2.0 Flash Experimental” 预览版
输出格式选择 “Image and text”
其他参数根据需要自行调节

AI Studio Settings

以上设置完成后，就可以愉快玩耍了。我这里简单进行了一个生成连续动画帧的测试，一致性确实惊艳。

[!NOTE] Prompt
Create an animation by generating multiple frames, showing a seed growing into a plant and then blooming into a flower, in a pixel art style.

Gemini Generated Frames Preview

因 Gemini 中无法直接将连续图片合成为 GIF 动画，在依次下载图片并顺序命名后，可借助 ffmpeg 命令行在本地进行合成。

# 将当前目录下名为 flower01.jpeg、flower02.jpeg ...，
# 依次排列的图片，以 10 帧/秒的速度合成一个无限循环的动画 GIF。
ffmpeg -framerate 10 -i 'flower%2d.jpeg' -loop 0 flower.gif

FFmpeg command in Warp terminal

合成后的效果如下：

Generated Flower Animation GIF

[!INFO] Warp & FFmpeg
在以上截图中用到的命令行终端是 [Warp]^[3] ，它是一个更加现代化的跨平台终端应用（支持 MacOS、Linux、Windows）。通过提供可定制的界面、智能补全、命令纠错、AI 驱动的命令建议和工作流管理等功能，显著提升了终端用户体验和开发效率。

在 AI 模式下可以让其给出命令建议，快速执行。目前支持 claude 3.7 sonnet、gpt-4o、o3-mini、gemini 2.0 flash、deepseek-r1（v3）等。Warp 还有很多高级玩法这里就不展开了，感兴趣的朋友可以自行查看文档。

[FFmpeg]^[4] 是一款开源的多媒体处理瑞士军刀，支持格式转换（如 MOV 转 MP4、MP4 转 GIF），视频编辑（如剪切、合并、压缩）、滤镜（如调整亮度、对比度或颜色）、元数据管理（添加、修改或删除文件元数据，如标题和描述）等，几乎涵盖了所有音视频处理的需求。除了高度灵活的命令行操作，还有基于其二次开发的各种多媒体处理应用。简单来说，有了 FFmpeg，音视频处理的各种难题几乎都能迎刃而解。

事情到这里本该结束了，但不知大家注意到没：以上生成的图片左下角是有水印的，而且在 AI Studio 中测试，默认是无法保存聊天记录的（页面每次刷新都会清空对话内容）。水印问题稍后我们再聊，先说说如何保存对话记录：在 Settings 中启用 Save Settings，然后再返回 Create Prompt 对话，就可以将对话保存在侧栏 Library 中了。

AI Studio Save Settings

3- Gemini API

水印问题：在 Gemini 和 AI Studio 中，Google 会默认生成水印图片。但如果使用 API 来生成，则可避免此问题。API 除了去水印，还可以玩出许多新东西，下面就正式进入编程部分吧。

思考：如果想制作 GIF，需先在 AI Studio 中生成连续动画帧，再逐个下载图片并重命名图片（顺序编号），最后才能调用本地命令行 ffmpeg 将其合并为可查看的 GIF。如此复杂的操作步骤不禁让人思考，如何将其自动化，在发送 prompt 后，直接返回一个可查看的 GIF！

Automation Idea Flowchart

3.1- 开发准备

在 AI 编程中，[Python]^[5] 或 [Node.js]^[6] 都是主流开发语言，生态庞大，选择一个自己喜欢的学习即可。本次教程使用 Python 作为演示代码，所以需要大家先来安装一下 Python 开发环境。如果你的系统中有多个 Python 版本或项目，强烈建议安装 [uv]^[7] 来隔离环境，避免依赖冲突（这部分就不过多介绍了，根据官方文档操作即可，或者询问 AI 基本都能搞定）。

[!INFO] uv
它是一款超高速的 Python 包和项目管理器，由 Rust 编写，可替代 pip、pip-tools、pipx、poetry 等多个工具，速度比 pip 快 10-100 倍。它支持 Python 版本管理、应用安装、脚本运行，并提供高效的项目管理功能，包括通用锁定文件和 Cargo 风格的工作区。uv 兼容 pip 接口，节省磁盘空间，并可直接通过 curl 或 pip 安装，无需预装 Rust 或 Python，适用于 macOS、Linux 和 Windows。

注：目前许多主流开源项目都采用 uv 进行项目管理，如 [Open WebUI]^[8]、[OpenAI Agents SDK]^[9] 等。

3.2- 编写代码

完整代码：https://github.com/lencx/ai-explore（探索和学习 AI 的代码集合）。

在对话界面中点击右上角的 “Get code” 可以获取到一个快速开始的代码，你可以在此基础上按需要开发自己逻辑。

AI Studio Get Code button

不过以上代码过于粗糙，且离我们将连续帧图片合成为 GIF 动画相去甚远。所以我就写了一个完整版：

# ref: https://github.com/lencx/ai-explore/blob/main/gemini/img2gif.py

import os
import datetime
from dotenv import load_dotenv
from io import BytesIO
from PIL import Image
from google import genai
from google.genai import types


def init_session():
    """
    Initialize the session: load environment variables, create folders, and generate the Markdown file path.
    初始化会话：加载环境变量、创建文件夹、生成 Markdown 文件路径
    """
    load_dotenv()
    api_key = os.environ.get("GEMINI_API_KEY")
    if not api_key:
        # Raise an error if the API key is missing / 若 API 密钥缺失则抛出错误，在 .env 文件中设置
        raise ValueError("GEMINI_API_KEY is missing. Please set it in the .env file.")

    # Initialize client / 初始化客户端
    client = genai.Client(api_key=api_key)

    # Generate folder name by timestamp / 使用时间戳生成文件夹名称
    timestamp = datetime.datetime.now().strftime("%Y.%m.%d_%H:%M:%S")
    session_title = f"chat_{timestamp}"
    main_folder = os.path.join("output", session_title)
    os.makedirs(main_folder, exist_ok=True)

    # Create image folder / 创建图片文件夹
    image_folder = os.path.join(main_folder, "images")
    os.makedirs(image_folder, exist_ok=True)

    # Markdown file path / Markdown 文件路径
    md_file_path = os.path.join(main_folder, "index.md")

    return {
        "client": client,
        "main_folder": main_folder,
        "image_folder": image_folder,
        "md_file_path": md_file_path,
        "session_title": session_title
    }


def append_to_markdown(md_file_path, content):
    """
    Append content to the Markdown file.
    将内容追加写入 Markdown 文件
    """
    with open(md_file_path, "a", encoding="utf-8") as f:
        f.write(content)


def process_api_response(response, message_count, image_folder):
    """
    Process the API response: print text to console, save images locally, and generate Markdown-format text.
    处理 API 返回结果，将文本输出到终端，并保存图片到本地，同时生成 Markdown 格式文本
    """
    md_snippet = ""

    # Boundary check: if no candidates, return empty string immediately
    # 边界检查：若无候选项，直接返回空字符串
    if not response or not response.candidates:
        return md_snippet

    # Only handle the first candidate for simplicity / 仅处理第一个候选结果
    parts = response.candidates[0].content.parts
    if not parts:
        return md_snippet

    # Traverse each part in the candidate / 遍历候选结果中的各个部分
    for i, part in enumerate(parts):
        if part.text is not None:
            print("GeminiBot:", part.text)
            md_snippet += part.text + "\n\n"
        elif part.inline_data is not None:
            image_filename = f"message{message_count}_image_{i + 1}.png"
            image_path = os.path.join(image_folder, image_filename)
            image_rel_path = os.path.join("images", image_filename)
            try:
                # Save image / 保存图片
                image = Image.open(BytesIO(part.inline_data.data))
                image.save(image_path)
                print(f"Image saved to: {image_path}")
                # Insert an image link into Markdown / 在 Markdown 中插入图片链接
                md_snippet += f"![Generated Image {i + 1}]({image_rel_path})\n\n"
            except Exception as e:
                error_msg = f"Error saving image: {e}"
                print(error_msg)
                md_snippet += error_msg + "\n\n"

    return md_snippet


def main():
    """
    Main function: initialize the session, loop for user input, and generate responses.
    主函数：初始化会话、循环获取用户输入并生成回复
    """
    # Initialize session / 初始化会话
    session = init_session()
    client = session["client"]
    md_file_path = session["md_file_path"]
    image_folder = session["image_folder"]

    # Write initial information to Markdown / 将初始信息写入 Markdown
    init_md = f"# Chat Session: {session['session_title']}\n\n"
    init_md += f"**Start Time:** {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n"
    with open(md_file_path, "w", encoding="utf-8") as f:
        f.write(init_md)

    # Start the conversation / 开始对话，输入 'exit' 或 'quit' 结束对话
    print("Welcome to the conversation tool. Type 'exit' or 'quit' to end the conversation.")
    message_count = 1

    while True:
        try:
            user_input = input("User: ").strip()
        except KeyboardInterrupt:
            # Gracefully handle Ctrl + C / 优雅地处理 Ctrl + C
            print("\nConversation ended by user.")
            break

        if user_input.lower() in ["exit", "quit"]:
            print("Conversation ended.")
            break

        if not user_input:
            # Skip empty input / 跳过空输入
            continue

        # Get timestamp for each message / 为每条消息记录时间戳
        message_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        # Append user input to Markdown / 将用户输入追加写入 Markdown
        user_md = f"**User [{message_timestamp}]:** {user_input}\n\n"
        append_to_markdown(md_file_path, user_md)

        # Call the API to generate a response / 调用 API 生成回复
        try:
            response = client.models.generate_content(
                model="models/gemini-2.0-flash-exp",
                contents=user_input,
                config=types.GenerateContentConfig(response_modalities=['Text', 'Image'])
            )
        except Exception as e:
            error_msg = f"Error calling API: {e}"
            print(error_msg)
            append_to_markdown(md_file_path, error_msg + "\n\n")
            continue

        # Process the response and generate Markdown content / 处理回复，生成 Markdown 内容
        bot_md_header = f"**GeminiBot [{message_timestamp}]:**\n\n"
        bot_md_body = process_api_response(response, message_count, image_folder)

        # Append bot response to Markdown / 将 Bot 的回复追加写入 Markdown
        append_to_markdown(md_file_path, bot_md_header + bot_md_body)

        message_count += 1

    # Print the path to the saved Markdown file / 打印对话保存的 Markdown 路径
    print(f"\nAll conversation content has been saved in: {session['main_folder']}")


if __name__ == "__main__":
    main()

3.3- 运行代码

要运行 img2gif.py 需要在系统中预先安装 Python、uv 和 ffmpeg，以及申请 Gemini API Key。

3.3.1- Step 1：下载项目

git clone https://github.com/lencx/ai-explore.git
cd ai-explore

3.3.2- Step 2：设置 API Key

复制 .env.example 内容到 .env（若不存在则新建），然后将申请的 Gemini API Key 添加进 .env 文件。

# https://aistudio.google.com/apikey
GEMINI_API_KEY=

3.3.3- Step 3：同步和激活环境

uv sync

# On macOS/Linux
source .venv/bin/activate

# On Windows (PowerShell)
.venv\Scripts\activate

3.3.4- Step 4：运行程序

python gemini/img2gif.py

3.4- 代码演示

执行 python gemini/img2gif.py 命令后，用户可以在终端输入 Pormpt，并实时查看返回的内容（终端或 Markdown 文件），当 Gemini 一次接口调用返回的图片大于 2 张时，该程序会自动调用 ffmpeg 将这组图片合并为 GIF。没有水印而且完全自动化，喜欢折腾的朋友可以在此基础上继续扩展。

点击图片观看视频演示