Calling Your First LLM API

Learn to make your first programmatic call to an LLM provider's API. You'll set up authentication securely, send a chat request, parse the response, stream tokens as they arrive, and handle rate limits and errors gracefully.

Loading video…

What you'll be able to do

Configure API keys securely using environment variables and avoid hard-coding secrets
Make a basic chat completion request to an LLM API and parse the response
Implement token streaming to display output incrementally
Handle common errors and rate limits with retries and backoff
Understand request parameters like model, temperature, and max tokens

Overview

Calling an LLM API is the foundational skill for building AI applications. At its core, you send a structured request (your prompt plus parameters) over HTTPS and receive a generated response. This lesson uses the OpenAI Python SDK as the example, but the same concepts apply to Anthropic, Google, and most providers.

Setting Up Your Environment

Install the SDK and a dotenv helper:

pip install openai python-dotenv

Never hard-code API keys. Store them in environment variables or a .env file that is git-ignored.

# .env
OPENAI_API_KEY=sk-...

from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.environ["OPENAI_API_KEY"]

Add .env to .gitignore so secrets never reach version control. In production, use a secrets manager (AWS Secrets Manager, Vault, or platform env vars).

Your First Call

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from env automatically

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Explain what an API is in one sentence."}
    ],
    temperature=0.7,
    max_tokens=100,
)

print(response.choices[0].message.content)

Key Request Parameters

model: which model to use (affects cost, speed, quality)
messages: the conversation history; roles are system, user, assistant
temperature: 0 = deterministic, higher = more creative (0.0–2.0)
max_tokens: cap on response length

Reading the Response

The response object contains choices (usually one), each with a message. It also includes a usage field reporting prompt_tokens, completion_tokens, and total_tokens — essential for tracking cost.

print(response.usage.total_tokens)

Streaming Responses

For a responsive UX, stream tokens as they are generated instead of waiting for the full response.

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a haiku about code."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Each chunk carries a small delta. You concatenate deltas to build the full message.

Handling Errors and Rate Limits

Network calls fail. Providers enforce rate limits (requests/tokens per minute) and return 429 Too Many Requests. Use retries with exponential backoff.

import time
from openai import RateLimitError, APIError

def call_with_retry(client, **kwargs):
    for attempt in range(5):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError:
            wait = 2 ** attempt
            time.sleep(wait)
        except APIError as e:
            if attempt == 4:
                raise
            time.sleep(2 ** attempt)
    raise RuntimeError("Max retries exceeded")

The SDK also has built-in retries (max_retries on the client). Add timeouts to avoid hanging requests.

Cost and Safety Tips

Set max_tokens to control runaway costs.
Log usage to monitor spend.
Validate and sanitize user input before sending.
Use the cheapest model that meets quality needs.
Rotate keys if they are ever exposed.

Summary

You now know how to authenticate securely, send a chat request, parse responses, stream output, and handle rate limits. These primitives underpin every LLM-powered feature you’ll build.

Check your understanding

6 questions — answer to see instant feedback.

Q1. What is the safest way to provide your API key to your application?

Keys should never be in source code or public files. Environment variables or a git-ignored .env file (and a secrets manager in production) keep them out of version control.

Q2. What does setting stream=True accomplish?

Streaming returns the response in small delta chunks as they're generated, improving perceived responsiveness, but does not change accuracy, cost, or rate limits.

Q3. Which HTTP status code typically indicates you've hit a rate limit?

A 429 Too Many Requests response signals you've exceeded the provider's rate limit; the recommended response is to retry with exponential backoff.

Q4. Which parameter most directly controls the maximum length and cost of a response?

max_tokens caps how many tokens the model can generate, directly limiting response length and the associated cost.

Q5. In one short phrase, name the retry strategy recommended for handling 429 rate-limit errors.

Answer:exponential backoff
Exponential backoff increases the wait time after each failed attempt (e.g., 1s, 2s, 4s), reducing pressure on the API and improving the chance of success.

Q6. Which field in the response object should you log to monitor token consumption and cost?

Answer:usage
The usage field reports prompt_tokens, completion_tokens, and total_tokens, which you log to track spending.

Ask the AI tutor about this lessonStuck or curious? Ask a question and get a grounded answer.

The tutor answers from this lesson's material and can make mistakes — verify anything important.