Introduction to Web Scraping and FireCrawl Setup

Web scraping has evolved significantly over the past decade. What once required complex browser automation and constant maintenance now has elegant solutions that handle the heavy lifting for you. In this first part, we’ll explore modern web scraping challenges and set up FireCrawl as our primary tool.

The Evolution of Web Scraping

Traditional Challenges

Modern websites present unique challenges for scrapers:

JavaScript-heavy content: Single Page Applications (SPAs) render content dynamically
Anti-bot measures: CAPTCHA, rate limiting, and sophisticated detection systems
Complex authentication: OAuth, multi-factor authentication, and session management
Dynamic content: Infinite scroll, lazy loading, and real-time updates
Inconsistent structures: Frequent layout changes and A/B testing

The FireCrawl Solution

FireCrawl addresses these challenges by providing:

Headless browser rendering: Full JavaScript execution and dynamic content handling
Anti-detection technology: Sophisticated techniques to avoid bot detection
Structured data extraction: AI-powered content parsing and formatting
Scalable infrastructure: Handle high-volume scraping without managing servers
Simple API interface: Clean, RESTful API that abstracts complexity

Understanding FireCrawl

What is FireCrawl?

FireCrawl is a web scraping service that provides a simple API for extracting data from modern websites. It handles the complexity of browser automation, JavaScript rendering, and anti-bot evasion while providing clean, structured data.

Key Features

Smart Crawling: Automatically discovers and follows links
Content Extraction: Converts web pages to clean markdown or structured data
JavaScript Rendering: Full support for dynamic content
Rate Limiting: Built-in respect for robots.txt and rate limits
Data Formats: Multiple output formats (JSON, Markdown, HTML)

Pricing and Plans

FireCrawl offers several pricing tiers:

Free Tier: 500 credits per month
Starter: $20/month for 10,000 credits
Professional: $100/month for 100,000 credits
Enterprise: Custom pricing for high-volume usage

Note: Credits are consumed based on page complexity and processing requirements.

Setting Up Your Development Environment

Prerequisites

Before we begin, ensure you have:

# Node.js (version 16 or higher)
node --version

# npm or yarn package manager
npm --version

Project Initialization

Let’s create a new project for our web scraping experiments:

# Create a new directory
mkdir firecrawl-scraping-project
cd firecrawl-scraping-project

# Initialize npm project
npm init -y

# Install required dependencies
npm install @mendable/firecrawl-js dotenv
npm install -D typescript @types/node ts-node nodemon

# Create TypeScript configuration
npx tsc --init

Project Structure

Create the following directory structure:

firecrawl-scraping-project/
├── src/
│   ├── config/
│   │   └── firecrawl.ts
│   ├── scrapers/
│   │   └── basic-scraper.ts
│   ├── utils/
│   │   └── helpers.ts
│   └── index.ts
├── data/
│   └── output/
├── .env
├── .gitignore
├── package.json
└── tsconfig.json

Let’s create these files:

# Create directory structure
mkdir -p src/{config,scrapers,utils} data/output

# Create basic files
touch src/config/firecrawl.ts
touch src/scrapers/basic-scraper.ts
touch src/utils/helpers.ts
touch src/index.ts
touch .env
touch .gitignore

FireCrawl Account Setup

1. Create Your Account

Visit FireCrawl.dev
Sign up for a free account
Verify your email address
Access your dashboard

2. Get Your API Key

Navigate to the API Keys section in your dashboard
Generate a new API key
Copy the key securely

3. Environment Configuration

Add your API key to the .env file:

FIRECRAWL_API_KEY=your_api_key_here
NODE_ENV=development

Update your .gitignore file:

node_modules/
.env
.env.local
.env.production
dist/
build/
*.log
data/output/*.json
data/output/*.csv
.DS_Store

FireCrawl Configuration

Create the FireCrawl configuration file:

import FirecrawlApp from '@mendable/firecrawl-js';
import dotenv from 'dotenv';

dotenv.config();

if (!process.env.FIRECRAWL_API_KEY) {
  throw new Error('FIRECRAWL_API_KEY environment variable is required');
}

export const firecrawlApp = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY,
});

export const defaultScrapeOptions = {
  formats: ['markdown', 'html'] as const,
  includeTags: ['title', 'meta', 'h1', 'h2', 'h3', 'p', 'a', 'img'],
  excludeTags: ['script', 'style', 'nav', 'footer'],
  onlyMainContent: true,
  timeout: 30000,
};

export const defaultCrawlOptions = {
  limit: 10,
  scrapeOptions: defaultScrapeOptions,
  allowBackwardCrawling: false,
  allowExternalContentLinks: false,
};

Your First FireCrawl Script

Let’s create a simple scraper to test our setup:

import { firecrawlApp, defaultScrapeOptions } from '../config/firecrawl';
import { saveToFile } from '../utils/helpers';

export interface ScrapedData {
  url: string;
  title: string;
  content: string;
  metadata: {
    scrapedAt: string;
    statusCode: number;
    contentLength: number;
  };
}

export class BasicScraper {
  async scrapeUrl(url: string): Promise<ScrapedData | null> {
    try {
      console.log(`🔍 Scraping: ${url}`);

      const scrapeResult = await firecrawlApp.scrapeUrl(url, {
        ...defaultScrapeOptions,
        formats: ['markdown'],
      });

      if (!scrapeResult.success) {
        console.error('❌ Scraping failed:', scrapeResult.error);
        return null;
      }

      const data: ScrapedData = {
        url,
        title: scrapeResult.data.metadata?.title || 'No title',
        content: scrapeResult.data.markdown || '',
        metadata: {
          scrapedAt: new Date().toISOString(),
          statusCode: scrapeResult.data.metadata?.statusCode || 0,
          contentLength: scrapeResult.data.markdown?.length || 0,
        },
      };

      console.log(`✅ Successfully scraped: ${data.title}`);
      return data;

    } catch (error) {
      console.error('❌ Error scraping URL:', error);
      return null;
    }
  }

  async scrapeMultipleUrls(urls: string[]): Promise<ScrapedData[]> {
    const results: ScrapedData[] = [];

    for (const url of urls) {
      const data = await this.scrapeUrl(url);
      if (data) {
        results.push(data);
        // Add delay to respect rate limits
        await new Promise(resolve => setTimeout(resolve, 1000));
      }
    }

    return results;
  }
}

Create the helper utilities:

import fs from 'fs/promises';
import path from 'path';

export async function saveToFile(
  data: any,
  filename: string,
  format: 'json' | 'csv' = 'json'
): Promise<void> {
  const outputDir = path.join(process.cwd(), 'data', 'output');

  // Ensure output directory exists
  await fs.mkdir(outputDir, { recursive: true });

  const filePath = path.join(outputDir, `${filename}.${format}`);

  try {
    if (format === 'json') {
      await fs.writeFile(filePath, JSON.stringify(data, null, 2));
    } else if (format === 'csv') {
      // Simple CSV conversion (you might want to use a library like csv-writer)
      const csv = convertToCSV(data);
      await fs.writeFile(filePath, csv);
    }

    console.log(`💾 Data saved to: ${filePath}`);
  } catch (error) {
    console.error('❌ Error saving file:', error);
  }
}

function convertToCSV(data: any[]): string {
  if (!data.length) return '';

  const headers = Object.keys(data[0]);
  const csvHeaders = headers.join(',');

  const csvRows = data.map(row =>
    headers.map(header => {
      const value = row[header];
      // Handle nested objects and escape commas
      if (typeof value === 'object') {
        return `"${JSON.stringify(value).replace(/"/g, '""')}"`;
      }
      return `"${String(value).replace(/"/g, '""')}"`;
    }).join(',')
  );

  return [csvHeaders, ...csvRows].join('\n');
}

export function delay(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

export function isValidUrl(url: string): boolean {
  try {
    new URL(url);
    return true;
  } catch {
    return false;
  }
}

Testing Your Setup

Create the main entry point:

import { BasicScraper } from './scrapers/basic-scraper';
import { saveToFile } from './utils/helpers';

async function main() {
  const scraper = new BasicScraper();

  // Test URLs - replace with your targets
  const testUrls = [
    'https://example.com',
    'https://httpbin.org/html',
    'https://quotes.toscrape.com/',
  ];

  console.log('🚀 Starting web scraping test...');

  const results = await scraper.scrapeMultipleUrls(testUrls);

  if (results.length > 0) {
    await saveToFile(results, `scraping-test-${Date.now()}`, 'json');
    console.log(`✅ Scraped ${results.length} pages successfully!`);
  } else {
    console.log('❌ No data was scraped');
  }
}

// Run the script
main().catch(console.error);

Update your package.json scripts:

{
  "scripts": {
    "start": "ts-node src/index.ts",
    "dev": "nodemon --exec ts-node src/index.ts",
    "build": "tsc",
    "test": "ts-node src/index.ts"
  }
}

Running Your First Scraper

Execute your scraper:

# Run the scraper
npm start

# Or for development with auto-reload
npm run dev

You should see output similar to:

🚀 Starting web scraping test...
🔍 Scraping: https://example.com
✅ Successfully scraped: Example Domain
🔍 Scraping: https://httpbin.org/html
✅ Successfully scraped: Herman Melville - Moby-Dick
💾 Data saved to: /path/to/data/output/scraping-test-1702234567890.json
✅ Scraped 2 pages successfully!

Understanding the Output

FireCrawl returns structured data including:

Content: Clean markdown or HTML
Metadata: Title, description, keywords, status codes
Links: Extracted links and their relationships
Images: Image URLs and alt text
Structure: Heading hierarchy and content organization

Common Issues and Solutions

API Key Issues

// Verify your API key is working
import { firecrawlApp } from './config/firecrawl';

async function testApiKey() {
  try {
    const result = await firecrawlApp.scrapeUrl('https://example.com');
    console.log('✅ API key is valid');
  } catch (error) {
    console.error('❌ API key issue:', error);
  }
}

Rate Limiting

// Add delays between requests
await new Promise(resolve => setTimeout(resolve, 1000));

Timeout Issues

// Increase timeout for slow websites
const scrapeOptions = {
  ...defaultScrapeOptions,
  timeout: 60000, // 60 seconds
};

Next Steps

In the next part, we’ll dive deeper into FireCrawl’s API capabilities and learn how to:

Extract specific data using CSS selectors
Handle different content formats
Implement robust error handling
Work with pagination and dynamic content

You now have a solid foundation for web scraping with FireCrawl. The setup we’ve created will serve as the base for all future examples in this series.

Key Takeaways

FireCrawl simplifies modern web scraping challenges
Proper project structure and configuration are essential
Always respect rate limits and website policies
Start with simple examples before tackling complex scenarios
Environment variables keep your API keys secure

Ready to move on to basic scraping techniques? Let’s continue with Part 2!