LangSmith: The Complete Guide to Building Production-Ready LLM Applications

August 16, 2025
8 min read
Python TypeScript LangSmith LangChain LangGraph OpenAI AI Observability AI Development

LangSmith: The Complete Guide to Building Production-Ready LLM Applications

Large Language Model (LLM) applications are transforming industries, but building reliable, production-ready AI systems remains challenging. Enter LangSmith - a comprehensive platform designed specifically for developing, monitoring, and optimizing LLM applications at scale.

In this comprehensive guide, we’ll explore how LangSmith addresses the critical challenges of LLM development and helps you ship AI applications with confidence.

What is LangSmith?

LangSmith is a platform for building production-grade LLM applications that provides three core capabilities:

  • Observability: Monitor and analyze your LLM applications in real-time
  • Evaluation: Systematically test and measure application performance
  • Prompt Engineering: Iterate on prompts with version control and collaboration

The platform is framework-agnostic, meaning you can use it with or without LangChain’s open-source frameworks like langchain and langgraph.

Why LangSmith Matters for LLM Development

The Challenge of LLM Unpredictability

LLMs don’t always behave predictably. Small changes in prompts, models, or inputs can significantly impact results. This unpredictability makes it difficult to:

  • Debug issues in production
  • Measure application performance consistently
  • Iterate on prompts effectively
  • Ensure reliability across different scenarios

LangSmith’s Solution

LangSmith provides structured approaches to these challenges through:

  1. Comprehensive Tracing: Track every component of your LLM application
  2. Quantitative Evaluation: Measure performance with structured metrics
  3. Collaborative Development: Enable teams to work together on prompt engineering
  4. Production Monitoring: Get insights into real-world application behavior

Core Features Deep Dive

1. Observability: See Inside Your LLM Applications

Observability in LangSmith allows you to trace and analyze every aspect of your LLM application’s behavior.

Key Capabilities:

  • Trace Analysis: Visualize the complete flow of your application
  • Metrics & Dashboards: Configure custom metrics and monitoring dashboards
  • Alerts: Set up notifications for performance issues or anomalies
  • Real-time Monitoring: Track application behavior as it happens

Getting Started with Observability

Here’s how to set up basic tracing for a RAG (Retrieval-Augmented Generation) application:

Installation:

Terminal window
pip install -U langsmith openai

Environment Setup:

Terminal window
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langsmith-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"

Basic RAG Application with Tracing:

from openai import OpenAI
from langsmith import traceable
openai_client = OpenAI()
@traceable
def retriever(query: str):
# Mock retriever - replace with your actual retrieval logic
results = ["Harrison worked at Kensho", "He was a software engineer"]
return results
@traceable
def rag(question: str):
docs = retriever(question)
system_message = f"""Answer the user's question using only the provided information below:
{chr(10).join(docs)}"""
return openai_client.chat.completions.create(
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": question},
],
model="gpt-4o-mini",
)
# Usage
response = rag("Where did Harrison work?")
print(response.choices[0].message.content)

The @traceable decorator automatically captures:

  • Function inputs and outputs
  • Execution time and performance metrics
  • Error handling and debugging information
  • Nested function calls and their relationships

2. Evaluation: Measure What Matters

Evaluation provides quantitative ways to measure your LLM application’s performance, crucial for maintaining quality as you iterate and scale.

Components of Evaluation:

  1. Dataset: Test inputs and optionally expected outputs
  2. Target Function: What you’re evaluating (could be a single LLM call or entire application)
  3. Evaluators: Functions that score your target function’s outputs

Setting Up Evaluations

Creating a Dataset:

from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset(
dataset_name="rag_qa_dataset",
description="Questions and answers for RAG evaluation"
)
# Add examples to the dataset
examples = [
{
"inputs": {"question": "Where did Harrison work?"},
"outputs": {"answer": "Harrison worked at Kensho"}
},
{
"inputs": {"question": "What was Harrison's role?"},
"outputs": {"answer": "Harrison was a software engineer"}
}
]
for example in examples:
client.create_example(
inputs=example["inputs"],
outputs=example["outputs"],
dataset_id=dataset.id
)

Running Evaluations:

from langsmith.evaluation import evaluate
from langsmith.schemas import Example, Run
def accuracy_evaluator(run: Run, example: Example) -> dict:
"""Custom evaluator to check answer accuracy"""
predicted = run.outputs.get("answer", "").lower()
expected = example.outputs.get("answer", "").lower()
return {
"key": "accuracy",
"score": 1.0 if expected in predicted else 0.0
}
# Run evaluation
results = evaluate(
lambda inputs: rag(inputs["question"]),
data=dataset,
evaluators=[accuracy_evaluator],
experiment_prefix="rag_experiment"
)
print(f"Accuracy: {results['accuracy']}")

Using Pre-built Evaluators

LangSmith integrates with the openevals package for common evaluation patterns:

from langsmith.evaluation import evaluate
from openevals.evaluators import Correctness
# Use pre-built correctness evaluator
results = evaluate(
lambda inputs: rag(inputs["question"]),
data="rag_qa_dataset",
evaluators=[Correctness()],
experiment_prefix="correctness_test"
)

3. Prompt Engineering: Iterate with Confidence

LangSmith’s prompt engineering capabilities provide version control, collaboration features, and systematic testing for prompt development.

Key Features:

  • Version Control: Automatic tracking of prompt changes
  • Collaboration: Team-based prompt development
  • A/B Testing: Compare different prompt versions
  • Integration: Seamless connection with your applications

Prompt Engineering Workflow

  1. Create Prompts in the UI: Use LangSmith’s web interface to design prompts
  2. Version Management: Automatically track changes and iterations
  3. Testing: Evaluate prompts against your datasets
  4. Deployment: Push successful prompts to production

Advanced Use Cases

Multi-Agent Systems

LangSmith excels at tracing complex multi-agent workflows:

@traceable
def research_agent(topic: str):
"""Agent that researches a topic"""
# Research logic here
return f"Research findings on {topic}"
@traceable
def writing_agent(research: str, style: str):
"""Agent that writes content based on research"""
# Writing logic here
return f"Article written in {style} style based on: {research}"
@traceable
def multi_agent_workflow(topic: str, style: str):
research = research_agent(topic)
article = writing_agent(research, style)
return article

Production Monitoring

Set up comprehensive monitoring for production applications:

from langsmith import Client
from langsmith.run_helpers import trace
client = Client()
@trace
def production_rag(question: str, user_id: str):
try:
result = rag(question)
# Log additional metadata
client.create_run(
name="production_query",
inputs={"question": question, "user_id": user_id},
outputs={"result": result},
run_type="llm",
tags=["production", "rag"]
)
return result
except Exception as e:
# Log errors for debugging
client.create_run(
name="production_error",
inputs={"question": question, "user_id": user_id},
error=str(e),
run_type="llm",
tags=["production", "error"]
)
raise

Integration with LangChain and LangGraph

If you’re using LangChain or LangGraph, LangSmith integration is even simpler:

LangChain Integration

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Set environment variables
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your-api-key"
# Create a simple chain
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{input}")
])
chain = prompt | llm | StrOutputParser()
# Automatic tracing is enabled!
result = chain.invoke({"input": "What is LangSmith?"})

LangGraph Integration

from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage
def chatbot(state):
return {"messages": [llm.invoke(state["messages"])]}
# Build graph
workflow = StateGraph({"messages": list})
workflow.add_node("chatbot", chatbot)
workflow.set_entry_point("chatbot")
workflow.add_edge("chatbot", END)
app = workflow.compile()
# Automatic tracing for the entire graph
result = app.invoke({"messages": [HumanMessage(content="Hello!")]})

Best Practices for Production

1. Structured Logging

@traceable(tags=["production", "user-facing"])
def production_function(input_data):
# Add context to traces
return process_data(input_data)

2. Error Handling

@traceable
def robust_llm_call(prompt: str):
try:
return llm.invoke(prompt)
except Exception as e:
# LangSmith will automatically capture the error
logger.error(f"LLM call failed: {e}")
return "I apologize, but I'm having trouble processing your request."

3. Performance Monitoring

from langsmith import Client
import time
client = Client()
@traceable
def monitored_function(input_data):
start_time = time.time()
result = expensive_operation(input_data)
duration = time.time() - start_time
# Log custom metrics
client.create_run(
name="performance_metric",
inputs={"duration": duration},
run_type="tool"
)
return result

Getting Started: Step-by-Step

1. Sign Up and Get API Key

  1. Visit LangSmith
  2. Create an account
  3. Navigate to Settings → API Keys
  4. Create a new API key

2. Install Dependencies

Terminal window
pip install langsmith openai

3. Set Environment Variables

Terminal window
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-api-key"

4. Start Tracing

Add the @traceable decorator to your functions and start seeing insights immediately.

5. Create Your First Evaluation

  1. Create a dataset with test cases
  2. Define evaluation metrics
  3. Run evaluations to measure performance

Conclusion

LangSmith transforms LLM application development from an art into a science. By providing comprehensive observability, systematic evaluation, and collaborative prompt engineering, it enables teams to build reliable, production-ready AI applications.

Whether you’re building a simple chatbot or a complex multi-agent system, LangSmith provides the tools you need to:

  • Debug effectively with detailed tracing
  • Measure performance with quantitative evaluations
  • Iterate confidently with version-controlled prompt engineering
  • Scale reliably with production monitoring

The platform’s framework-agnostic approach means you can integrate it into existing workflows, while its deep integration with LangChain and LangGraph provides seamless experiences for those ecosystems.

Start your LangSmith journey today and experience the difference that proper tooling makes in LLM application development. Your future self (and your users) will thank you for building more reliable, observable, and maintainable AI applications.


Ready to get started with LangSmith? Check out the official documentation and begin building production-ready LLM applications today.

Share Feedback