Skip to main content

Command Palette

Search for a command to run...

Integrating Gemini API: From First Request to Streaming Conversations

Updated
6 min read
Integrating Gemini API: From First Request to Streaming Conversations
S

Full-stack developer documenting what I’m learning as I go. This space is all about tech, understanding how things work, and writing things down as they start to make sense.

If you want to get hands-on with Gemini's API quickly — without spinning up a local environment — Google Colab is the fastest path. This post walks through the full integration: from understanding what's actually happening when you hit the API, to building a stateful, streaming conversation system using the google-genai SDK.


Core Concepts (LLM Inference Pipeline)

Before jumping into code, a quick grounding in what the API is doing under the hood. Not theory for theory's sake — these concepts directly shape how you'll architect your requests.

What Happens When You Send a Prompt

Every API request follows the same pipeline:

Tokenization → Embedding → Contextualization → Generation

The model doesn't read text — it converts your input into tokens, maps each token to a high-dimensional vector (embedding), runs those embeddings through a Transformer's self-attention mechanism to build context-aware representations, and then generates output one token at a time via autoregression.

That last part is important: the model predicts one token, appends it, and repeats. This is why generation has latency, and it's exactly what streaming solves.

Key API Parameters You Control

Parameter What It Does
model Which model engine to invoke
contents The full conversation history you send with every request
system_instruction Persistent behavior/persona instructions, separate from the conversation
temperature Scales the probability distribution over the next token
max_output_tokens Hard ceiling on output length

Temperature is worth understanding precisely. It's not a "creativity dial" — it's a scaling factor on token selection probabilities. Low temperature (0.0–0.2) collapses the distribution, making the top token near-certain. High temperature (0.8–1.0) flattens it, giving lower-probability tokens a real chance. Use 0.0 for code generation and data extraction; use 0.8+ for brainstorming. High temperature also increases hallucination risk — the model is intentionally choosing less probable tokens.


Statelessness: The Architecture You Must Account For

Standard generate_content calls are completely stateless. Every request is independent. If you ask a follow-up question in a new API call without including prior context, the model has no idea what you're referencing. You own the conversation state — you build it, maintain it, and send it in full with every request.


Setting Up Gemini API in Google Colab

1. Install the SDK

!pip install google-genai

2. Secure API Key Handling

Never hardcode API keys. Colab's Secrets manager (userdata) is the right approach:

from google.colab import userdata
import os

os.environ["GEMINI_API_KEY"] = userdata.get("GEMINI_API_KEY")

This keeps the key out of your notebook cells and prevents accidental exposure when sharing.

3. Initialize the Client

from google import genai
from google.genai import types

client = genai.Client()
model = "gemini-2.0-flash-lite"

Implementation

1. Single-Turn Request

The simplest call — pass a string directly to contents:

response = client.models.generate_content(
    model=model,
    contents="What is quantum computing? Answer in one sentence"
)

print(response.text)

For single-turn requests, Gemini's SDK lets you skip the role/parts structure entirely. The .text property gives you clean output directly.

2. Multi-Turn Conversations

Since the API is stateless, you maintain a messages list and send it in full with each call. Gemini's message format uses role ("user" or "model") and parts:

def add_user_message(messages, text):
    messages.append({"role": "user", "parts": [{"text": text}]})

def add_model_message(messages, text):
    messages.append({"role": "model", "parts": [{"text": text}]})

def chat(messages, system_instruction=None):
    config = None
    if system_instruction:
        config = types.GenerateContentConfig(
            system_instruction=system_instruction
        )

    response = client.models.generate_content(
        model=model,
        contents=messages,
        config=config
    )

    print(response.text)
    return response.text

Usage:

messages = []

add_user_message(messages, "Suggest 3 classic sci-fi books for a beginner.")
answer = chat(messages)
add_model_message(messages, answer)

add_user_message(messages, "Which of those has the most hopeful ending?")
final_answer = chat(messages)

The second call works correctly because the full conversation history — including Gemini's first response — is included in contents. Without add_model_message, the follow-up would have no context.

3. System Instructions

System instructions give the model persistent behavioral directives that apply across the entire request. They're passed via GenerateContentConfig, separate from the contents array:

system_prompt = """
You are a sci-fi expert who is enthusiastic but brief.
You always explain WHY a book is beginner-friendly.
Use a friendly, robotic tone (e.g., "Greetings, human learner").
"""

messages = []
add_user_message(messages, "Suggest 3 classic sci-fi books for a beginner.")

answer = chat(messages, system_instruction=system_prompt)
add_model_message(messages, answer)

add_user_message(messages, "Which of those has the most hopeful ending?")
final_answer = chat(messages, system_instruction=system_prompt)

The system_instruction must be passed with every chat() call. It's not stored server-side between requests — same statelessness rule applies.

4. Response Streaming

The blocking generate_content call waits for the entire response before returning — sometimes 10–30 seconds on longer outputs. Streaming fixes this by returning a generator that yields chunks as the model produces them.

The tricky part: you need to serve two consumers simultaneously — the user (who wants to see text as it arrives) and your message history (which needs the complete assembled string).

def chat(messages):
    response = client.models.generate_content_stream(
        model=model,
        contents=messages
    )

    complete_message = ""

    for chunk in response:
        print(chunk.text, end="", flush=True)
        complete_message += chunk.text

    return complete_message

Two things make this work correctly:

  • end="" prevents print() from inserting a newline between chunks, so text flows continuously

  • flush=True bypasses Python's output buffer — without it, chunks accumulate and print in batches, defeating the purpose

  • complete_message accumulates the full response so it can be stored in conversation history via add_model_message

Full streaming conversation:

messages = []

add_user_message(messages, "Suggest 3 classic sci-fi books for a beginner.")
answer = chat(messages)
add_model_message(messages, answer)

add_user_message(messages, "Which of those has the most hopeful ending?")
final_answer = chat(messages)

The interface is identical to the non-streaming version — only the chat() internals change.


Key Observations

Statelessness is a design constraint, not a limitation. Once you internalize that the API is memoryless, the pattern (maintain history → send in full → append response) becomes second nature. It also gives you complete control over what the model "remembers."

System instructions are not magic. They shape tone and behavior well, but they don't override the model's training. Think of them as strong defaults, not hard constraints.

Temperature = 0.0 produces highly deterministic outputs and is preferred for debugging and evaluation tasks. Increasing the temperature introduces randomness into token selection, making responses more diverse but less predictable. While higher temperatures generally reduce repeatability, they do not guarantee different outputs on every run.

Streaming is almost always worth it. The code change is minimal and the UX improvement — especially for longer responses — is significant. There's no downside once you handle the accumulator pattern correctly.

The config object is the extension point. Temperature, system instructions, safety settings, and max tokens all go through GenerateContentConfig. Get comfortable with it early.