Introduction to Web Scraping and FireCrawl Setup

Web scraping has evolved significantly over the past decade. What once required complex browser automation and constant maintenance now has elegant solutions that handle the heavy lifting for you. In this first part, we’ll explore modern web scraping challenges and set up FireCrawl as our primary tool.

The Evolution of Web Scraping

Traditional Challenges

Modern websites present unique challenges for scrapers:

  • JavaScript-heavy content: Single Page Applications (SPAs) render content dynamically
  • Anti-bot measures: CAPTCHA, rate limiting, and sophisticated detection systems
  • Complex authentication: OAuth, multi-factor authentication, and session management
  • Dynamic content: Infinite scroll, lazy loading, and real-time updates
  • Inconsistent structures: Frequent layout changes and A/B testing

The FireCrawl Solution

FireCrawl addresses these challenges by providing:

  • Headless browser rendering: Full JavaScript execution and dynamic content handling
  • Anti-detection technology: Sophisticated techniques to avoid bot detection
  • Structured data extraction: AI-powered content parsing and formatting
  • Scalable infrastructure: Handle high-volume scraping without managing servers
  • Simple API interface: Clean, RESTful API that abstracts complexity

Understanding FireCrawl

What is FireCrawl?

FireCrawl is a web scraping service that provides a simple API for extracting data from modern websites. It handles the complexity of browser automation, JavaScript rendering, and anti-bot evasion while providing clean, structured data.

Key Features

  1. Smart Crawling: Automatically discovers and follows links
  2. Content Extraction: Converts web pages to clean markdown or structured data
  3. JavaScript Rendering: Full support for dynamic content
  4. Rate Limiting: Built-in respect for robots.txt and rate limits
  5. Data Formats: Multiple output formats (JSON, Markdown, HTML)

Pricing and Plans

FireCrawl offers several pricing tiers:

  • Free Tier: 500 credits per month
  • Starter: $20/month for 10,000 credits
  • Professional: $100/month for 100,000 credits
  • Enterprise: Custom pricing for high-volume usage

Note: Credits are consumed based on page complexity and processing requirements.

Setting Up Your Development Environment

Prerequisites

Before we begin, ensure you have:

Terminal window
# Node.js (version 16 or higher)
node --version
# npm or yarn package manager
npm --version

Project Initialization

Let’s create a new project for our web scraping experiments:

Terminal window
# Create a new directory
mkdir firecrawl-scraping-project
cd firecrawl-scraping-project
# Initialize npm project
npm init -y
# Install required dependencies
npm install @mendable/firecrawl-js dotenv
npm install -D typescript @types/node ts-node nodemon
# Create TypeScript configuration
npx tsc --init

Project Structure

Create the following directory structure:

firecrawl-scraping-project/
├── src/
│ ├── config/
│ │ └── firecrawl.ts
│ ├── scrapers/
│ │ └── basic-scraper.ts
│ ├── utils/
│ │ └── helpers.ts
│ └── index.ts
├── data/
│ └── output/
├── .env
├── .gitignore
├── package.json
└── tsconfig.json

Let’s create these files:

Terminal window
# Create directory structure
mkdir -p src/{config,scrapers,utils} data/output
# Create basic files
touch src/config/firecrawl.ts
touch src/scrapers/basic-scraper.ts
touch src/utils/helpers.ts
touch src/index.ts
touch .env
touch .gitignore

FireCrawl Account Setup

1. Create Your Account

  1. Visit FireCrawl.dev
  2. Sign up for a free account
  3. Verify your email address
  4. Access your dashboard

2. Get Your API Key

  1. Navigate to the API Keys section in your dashboard
  2. Generate a new API key
  3. Copy the key securely

3. Environment Configuration

Add your API key to the .env file:

.env
FIRECRAWL_API_KEY=your_api_key_here
NODE_ENV=development

Update your .gitignore file:

.gitignore
node_modules/
.env
.env.local
.env.production
dist/
build/
*.log
data/output/*.json
data/output/*.csv
.DS_Store

FireCrawl Configuration

Create the FireCrawl configuration file:

src/config/firecrawl.ts
import FirecrawlApp from '@mendable/firecrawl-js';
import dotenv from 'dotenv';
dotenv.config();
if (!process.env.FIRECRAWL_API_KEY) {
throw new Error('FIRECRAWL_API_KEY environment variable is required');
}
export const firecrawlApp = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY,
});
export const defaultScrapeOptions = {
formats: ['markdown', 'html'] as const,
includeTags: ['title', 'meta', 'h1', 'h2', 'h3', 'p', 'a', 'img'],
excludeTags: ['script', 'style', 'nav', 'footer'],
onlyMainContent: true,
timeout: 30000,
};
export const defaultCrawlOptions = {
limit: 10,
scrapeOptions: defaultScrapeOptions,
allowBackwardCrawling: false,
allowExternalContentLinks: false,
};

Your First FireCrawl Script

Let’s create a simple scraper to test our setup:

src/scrapers/basic-scraper.ts
import { firecrawlApp, defaultScrapeOptions } from '../config/firecrawl';
import { saveToFile } from '../utils/helpers';
export interface ScrapedData {
url: string;
title: string;
content: string;
metadata: {
scrapedAt: string;
statusCode: number;
contentLength: number;
};
}
export class BasicScraper {
async scrapeUrl(url: string): Promise<ScrapedData | null> {
try {
console.log(`🔍 Scraping: ${url}`);
const scrapeResult = await firecrawlApp.scrapeUrl(url, {
...defaultScrapeOptions,
formats: ['markdown'],
});
if (!scrapeResult.success) {
console.error('❌ Scraping failed:', scrapeResult.error);
return null;
}
const data: ScrapedData = {
url,
title: scrapeResult.data.metadata?.title || 'No title',
content: scrapeResult.data.markdown || '',
metadata: {
scrapedAt: new Date().toISOString(),
statusCode: scrapeResult.data.metadata?.statusCode || 0,
contentLength: scrapeResult.data.markdown?.length || 0,
},
};
console.log(`✅ Successfully scraped: ${data.title}`);
return data;
} catch (error) {
console.error('❌ Error scraping URL:', error);
return null;
}
}
async scrapeMultipleUrls(urls: string[]): Promise<ScrapedData[]> {
const results: ScrapedData[] = [];
for (const url of urls) {
const data = await this.scrapeUrl(url);
if (data) {
results.push(data);
// Add delay to respect rate limits
await new Promise(resolve => setTimeout(resolve, 1000));
}
}
return results;
}
}

Create the helper utilities:

src/utils/helpers.ts
import fs from 'fs/promises';
import path from 'path';
export async function saveToFile(
data: any,
filename: string,
format: 'json' | 'csv' = 'json'
): Promise<void> {
const outputDir = path.join(process.cwd(), 'data', 'output');
// Ensure output directory exists
await fs.mkdir(outputDir, { recursive: true });
const filePath = path.join(outputDir, `${filename}.${format}`);
try {
if (format === 'json') {
await fs.writeFile(filePath, JSON.stringify(data, null, 2));
} else if (format === 'csv') {
// Simple CSV conversion (you might want to use a library like csv-writer)
const csv = convertToCSV(data);
await fs.writeFile(filePath, csv);
}
console.log(`💾 Data saved to: ${filePath}`);
} catch (error) {
console.error('❌ Error saving file:', error);
}
}
function convertToCSV(data: any[]): string {
if (!data.length) return '';
const headers = Object.keys(data[0]);
const csvHeaders = headers.join(',');
const csvRows = data.map(row =>
headers.map(header => {
const value = row[header];
// Handle nested objects and escape commas
if (typeof value === 'object') {
return `"${JSON.stringify(value).replace(/"/g, '""')}"`;
}
return `"${String(value).replace(/"/g, '""')}"`;
}).join(',')
);
return [csvHeaders, ...csvRows].join('\n');
}
export function delay(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
export function isValidUrl(url: string): boolean {
try {
new URL(url);
return true;
} catch {
return false;
}
}

Testing Your Setup

Create the main entry point:

src/index.ts
import { BasicScraper } from './scrapers/basic-scraper';
import { saveToFile } from './utils/helpers';
async function main() {
const scraper = new BasicScraper();
// Test URLs - replace with your targets
const testUrls = [
'https://example.com',
'https://httpbin.org/html',
'https://quotes.toscrape.com/',
];
console.log('🚀 Starting web scraping test...');
const results = await scraper.scrapeMultipleUrls(testUrls);
if (results.length > 0) {
await saveToFile(results, `scraping-test-${Date.now()}`, 'json');
console.log(`✅ Scraped ${results.length} pages successfully!`);
} else {
console.log('❌ No data was scraped');
}
}
// Run the script
main().catch(console.error);

Update your package.json scripts:

{
"scripts": {
"start": "ts-node src/index.ts",
"dev": "nodemon --exec ts-node src/index.ts",
"build": "tsc",
"test": "ts-node src/index.ts"
}
}

Running Your First Scraper

Execute your scraper:

Terminal window
# Run the scraper
npm start
# Or for development with auto-reload
npm run dev

You should see output similar to:

🚀 Starting web scraping test...
🔍 Scraping: https://example.com
✅ Successfully scraped: Example Domain
🔍 Scraping: https://httpbin.org/html
✅ Successfully scraped: Herman Melville - Moby-Dick
💾 Data saved to: /path/to/data/output/scraping-test-1702234567890.json
✅ Scraped 2 pages successfully!

Understanding the Output

FireCrawl returns structured data including:

  • Content: Clean markdown or HTML
  • Metadata: Title, description, keywords, status codes
  • Links: Extracted links and their relationships
  • Images: Image URLs and alt text
  • Structure: Heading hierarchy and content organization

Common Issues and Solutions

API Key Issues

// Verify your API key is working
import { firecrawlApp } from './config/firecrawl';
async function testApiKey() {
try {
const result = await firecrawlApp.scrapeUrl('https://example.com');
console.log('✅ API key is valid');
} catch (error) {
console.error('❌ API key issue:', error);
}
}

Rate Limiting

// Add delays between requests
await new Promise(resolve => setTimeout(resolve, 1000));

Timeout Issues

// Increase timeout for slow websites
const scrapeOptions = {
...defaultScrapeOptions,
timeout: 60000, // 60 seconds
};

Next Steps

In the next part, we’ll dive deeper into FireCrawl’s API capabilities and learn how to:

  • Extract specific data using CSS selectors
  • Handle different content formats
  • Implement robust error handling
  • Work with pagination and dynamic content

You now have a solid foundation for web scraping with FireCrawl. The setup we’ve created will serve as the base for all future examples in this series.

Key Takeaways

  • FireCrawl simplifies modern web scraping challenges
  • Proper project structure and configuration are essential
  • Always respect rate limits and website policies
  • Start with simple examples before tackling complex scenarios
  • Environment variables keep your API keys secure

Ready to move on to basic scraping techniques? Let’s continue with Part 2!

Share Feedback