Calling Your First LLM API
Learn to make your first programmatic call to an LLM provider's API. You'll set up authentication securely, send a chat request, parse the response, stream tokens as they arrive, and handle rate limits and errors gracefully.
Loading video…
What you'll be able to do
- Configure API keys securely using environment variables and avoid hard-coding secrets
- Make a basic chat completion request to an LLM API and parse the response
- Implement token streaming to display output incrementally
- Handle common errors and rate limits with retries and backoff
- Understand request parameters like model, temperature, and max tokens
Overview
Calling an LLM API is the foundational skill for building AI applications. At its core, you send a structured request (your prompt plus parameters) over HTTPS and receive a generated response. This lesson uses the OpenAI Python SDK as the example, but the same concepts apply to Anthropic, Google, and most providers.
Setting Up Your Environment
Install the SDK and a dotenv helper:
pip install openai python-dotenv
Never hard-code API keys. Store them in environment variables or a .env file that is git-ignored.
# .env
OPENAI_API_KEY=sk-...
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.environ["OPENAI_API_KEY"]
Add .env to .gitignore so secrets never reach version control. In production, use a secrets manager (AWS Secrets Manager, Vault, or platform env vars).
Your First Call
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from env automatically
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Explain what an API is in one sentence."}
],
temperature=0.7,
max_tokens=100,
)
print(response.choices[0].message.content)
Key Request Parameters
- model: which model to use (affects cost, speed, quality)
- messages: the conversation history; roles are
system,user,assistant - temperature: 0 = deterministic, higher = more creative (0.0–2.0)
- max_tokens: cap on response length
Reading the Response
The response object contains choices (usually one), each with a message. It also includes a usage field reporting prompt_tokens, completion_tokens, and total_tokens — essential for tracking cost.
print(response.usage.total_tokens)
Streaming Responses
For a responsive UX, stream tokens as they are generated instead of waiting for the full response.
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a haiku about code."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Each chunk carries a small delta. You concatenate deltas to build the full message.
Handling Errors and Rate Limits
Network calls fail. Providers enforce rate limits (requests/tokens per minute) and return 429 Too Many Requests. Use retries with exponential backoff.
import time
from openai import RateLimitError, APIError
def call_with_retry(client, **kwargs):
for attempt in range(5):
try:
return client.chat.completions.create(**kwargs)
except RateLimitError:
wait = 2 ** attempt
time.sleep(wait)
except APIError as e:
if attempt == 4:
raise
time.sleep(2 ** attempt)
raise RuntimeError("Max retries exceeded")
The SDK also has built-in retries (max_retries on the client). Add timeouts to avoid hanging requests.
Cost and Safety Tips
- Set
max_tokensto control runaway costs. - Log
usageto monitor spend. - Validate and sanitize user input before sending.
- Use the cheapest model that meets quality needs.
- Rotate keys if they are ever exposed.
Summary
You now know how to authenticate securely, send a chat request, parse responses, stream output, and handle rate limits. These primitives underpin every LLM-powered feature you’ll build.
Check your understanding
6 questions — answer to see instant feedback.
Exponential backoff increases the wait time after each failed attempt (e.g., 1s, 2s, 4s), reducing pressure on the API and improving the chance of success.
The usage field reports prompt_tokens, completion_tokens, and total_tokens, which you log to track spending.
Ask the AI tutor about this lessonStuck or curious? Ask a question and get a grounded answer.
The tutor answers from this lesson's material and can make mistakes — verify anything important.