Introduction to Web Scraping and FireCrawl Setup
Web scraping has evolved significantly over the past decade. What once required complex browser automation and constant maintenance now has elegant solutions that handle the heavy lifting for you. In this first part, we’ll explore modern web scraping challenges and set up FireCrawl as our primary tool.
The Evolution of Web Scraping
Traditional Challenges
Modern websites present unique challenges for scrapers:
- JavaScript-heavy content: Single Page Applications (SPAs) render content dynamically
- Anti-bot measures: CAPTCHA, rate limiting, and sophisticated detection systems
- Complex authentication: OAuth, multi-factor authentication, and session management
- Dynamic content: Infinite scroll, lazy loading, and real-time updates
- Inconsistent structures: Frequent layout changes and A/B testing
The FireCrawl Solution
FireCrawl addresses these challenges by providing:
- Headless browser rendering: Full JavaScript execution and dynamic content handling
- Anti-detection technology: Sophisticated techniques to avoid bot detection
- Structured data extraction: AI-powered content parsing and formatting
- Scalable infrastructure: Handle high-volume scraping without managing servers
- Simple API interface: Clean, RESTful API that abstracts complexity
Understanding FireCrawl
What is FireCrawl?
FireCrawl is a web scraping service that provides a simple API for extracting data from modern websites. It handles the complexity of browser automation, JavaScript rendering, and anti-bot evasion while providing clean, structured data.
Key Features
- Smart Crawling: Automatically discovers and follows links
- Content Extraction: Converts web pages to clean markdown or structured data
- JavaScript Rendering: Full support for dynamic content
- Rate Limiting: Built-in respect for robots.txt and rate limits
- Data Formats: Multiple output formats (JSON, Markdown, HTML)
Pricing and Plans
FireCrawl offers several pricing tiers:
- Free Tier: 500 credits per month
- Starter: $20/month for 10,000 credits
- Professional: $100/month for 100,000 credits
- Enterprise: Custom pricing for high-volume usage
Note: Credits are consumed based on page complexity and processing requirements.
Setting Up Your Development Environment
Prerequisites
Before we begin, ensure you have:
# Node.js (version 16 or higher)node --version
# npm or yarn package managernpm --versionProject Initialization
Let’s create a new project for our web scraping experiments:
# Create a new directorymkdir firecrawl-scraping-projectcd firecrawl-scraping-project
# Initialize npm projectnpm init -y
# Install required dependenciesnpm install @mendable/firecrawl-js dotenvnpm install -D typescript @types/node ts-node nodemon
# Create TypeScript configurationnpx tsc --initProject Structure
Create the following directory structure:
firecrawl-scraping-project/├── src/│ ├── config/│ │ └── firecrawl.ts│ ├── scrapers/│ │ └── basic-scraper.ts│ ├── utils/│ │ └── helpers.ts│ └── index.ts├── data/│ └── output/├── .env├── .gitignore├── package.json└── tsconfig.jsonLet’s create these files:
# Create directory structuremkdir -p src/{config,scrapers,utils} data/output
# Create basic filestouch src/config/firecrawl.tstouch src/scrapers/basic-scraper.tstouch src/utils/helpers.tstouch src/index.tstouch .envtouch .gitignoreFireCrawl Account Setup
1. Create Your Account
- Visit FireCrawl.dev
- Sign up for a free account
- Verify your email address
- Access your dashboard
2. Get Your API Key
- Navigate to the API Keys section in your dashboard
- Generate a new API key
- Copy the key securely
3. Environment Configuration
Add your API key to the .env file:
FIRECRAWL_API_KEY=your_api_key_hereNODE_ENV=developmentUpdate your .gitignore file:
node_modules/.env.env.local.env.productiondist/build/*.logdata/output/*.jsondata/output/*.csv.DS_StoreFireCrawl Configuration
Create the FireCrawl configuration file:
import FirecrawlApp from '@mendable/firecrawl-js';import dotenv from 'dotenv';
dotenv.config();
if (!process.env.FIRECRAWL_API_KEY) { throw new Error('FIRECRAWL_API_KEY environment variable is required');}
export const firecrawlApp = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY,});
export const defaultScrapeOptions = { formats: ['markdown', 'html'] as const, includeTags: ['title', 'meta', 'h1', 'h2', 'h3', 'p', 'a', 'img'], excludeTags: ['script', 'style', 'nav', 'footer'], onlyMainContent: true, timeout: 30000,};
export const defaultCrawlOptions = { limit: 10, scrapeOptions: defaultScrapeOptions, allowBackwardCrawling: false, allowExternalContentLinks: false,};Your First FireCrawl Script
Let’s create a simple scraper to test our setup:
import { firecrawlApp, defaultScrapeOptions } from '../config/firecrawl';import { saveToFile } from '../utils/helpers';
export interface ScrapedData { url: string; title: string; content: string; metadata: { scrapedAt: string; statusCode: number; contentLength: number; };}
export class BasicScraper { async scrapeUrl(url: string): Promise<ScrapedData | null> { try { console.log(`🔍 Scraping: ${url}`);
const scrapeResult = await firecrawlApp.scrapeUrl(url, { ...defaultScrapeOptions, formats: ['markdown'], });
if (!scrapeResult.success) { console.error('❌ Scraping failed:', scrapeResult.error); return null; }
const data: ScrapedData = { url, title: scrapeResult.data.metadata?.title || 'No title', content: scrapeResult.data.markdown || '', metadata: { scrapedAt: new Date().toISOString(), statusCode: scrapeResult.data.metadata?.statusCode || 0, contentLength: scrapeResult.data.markdown?.length || 0, }, };
console.log(`✅ Successfully scraped: ${data.title}`); return data;
} catch (error) { console.error('❌ Error scraping URL:', error); return null; } }
async scrapeMultipleUrls(urls: string[]): Promise<ScrapedData[]> { const results: ScrapedData[] = [];
for (const url of urls) { const data = await this.scrapeUrl(url); if (data) { results.push(data); // Add delay to respect rate limits await new Promise(resolve => setTimeout(resolve, 1000)); } }
return results; }}Create the helper utilities:
import fs from 'fs/promises';import path from 'path';
export async function saveToFile( data: any, filename: string, format: 'json' | 'csv' = 'json'): Promise<void> { const outputDir = path.join(process.cwd(), 'data', 'output');
// Ensure output directory exists await fs.mkdir(outputDir, { recursive: true });
const filePath = path.join(outputDir, `${filename}.${format}`);
try { if (format === 'json') { await fs.writeFile(filePath, JSON.stringify(data, null, 2)); } else if (format === 'csv') { // Simple CSV conversion (you might want to use a library like csv-writer) const csv = convertToCSV(data); await fs.writeFile(filePath, csv); }
console.log(`💾 Data saved to: ${filePath}`); } catch (error) { console.error('❌ Error saving file:', error); }}
function convertToCSV(data: any[]): string { if (!data.length) return '';
const headers = Object.keys(data[0]); const csvHeaders = headers.join(',');
const csvRows = data.map(row => headers.map(header => { const value = row[header]; // Handle nested objects and escape commas if (typeof value === 'object') { return `"${JSON.stringify(value).replace(/"/g, '""')}"`; } return `"${String(value).replace(/"/g, '""')}"`; }).join(',') );
return [csvHeaders, ...csvRows].join('\n');}
export function delay(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));}
export function isValidUrl(url: string): boolean { try { new URL(url); return true; } catch { return false; }}Testing Your Setup
Create the main entry point:
import { BasicScraper } from './scrapers/basic-scraper';import { saveToFile } from './utils/helpers';
async function main() { const scraper = new BasicScraper();
// Test URLs - replace with your targets const testUrls = [ 'https://example.com', 'https://httpbin.org/html', 'https://quotes.toscrape.com/', ];
console.log('🚀 Starting web scraping test...');
const results = await scraper.scrapeMultipleUrls(testUrls);
if (results.length > 0) { await saveToFile(results, `scraping-test-${Date.now()}`, 'json'); console.log(`✅ Scraped ${results.length} pages successfully!`); } else { console.log('❌ No data was scraped'); }}
// Run the scriptmain().catch(console.error);Update your package.json scripts:
{ "scripts": { "start": "ts-node src/index.ts", "dev": "nodemon --exec ts-node src/index.ts", "build": "tsc", "test": "ts-node src/index.ts" }}Running Your First Scraper
Execute your scraper:
# Run the scrapernpm start
# Or for development with auto-reloadnpm run devYou should see output similar to:
🚀 Starting web scraping test...🔍 Scraping: https://example.com✅ Successfully scraped: Example Domain🔍 Scraping: https://httpbin.org/html✅ Successfully scraped: Herman Melville - Moby-Dick💾 Data saved to: /path/to/data/output/scraping-test-1702234567890.json✅ Scraped 2 pages successfully!Understanding the Output
FireCrawl returns structured data including:
- Content: Clean markdown or HTML
- Metadata: Title, description, keywords, status codes
- Links: Extracted links and their relationships
- Images: Image URLs and alt text
- Structure: Heading hierarchy and content organization
Common Issues and Solutions
API Key Issues
// Verify your API key is workingimport { firecrawlApp } from './config/firecrawl';
async function testApiKey() { try { const result = await firecrawlApp.scrapeUrl('https://example.com'); console.log('✅ API key is valid'); } catch (error) { console.error('❌ API key issue:', error); }}Rate Limiting
// Add delays between requestsawait new Promise(resolve => setTimeout(resolve, 1000));Timeout Issues
// Increase timeout for slow websitesconst scrapeOptions = { ...defaultScrapeOptions, timeout: 60000, // 60 seconds};Next Steps
In the next part, we’ll dive deeper into FireCrawl’s API capabilities and learn how to:
- Extract specific data using CSS selectors
- Handle different content formats
- Implement robust error handling
- Work with pagination and dynamic content
You now have a solid foundation for web scraping with FireCrawl. The setup we’ve created will serve as the base for all future examples in this series.
Key Takeaways
- FireCrawl simplifies modern web scraping challenges
- Proper project structure and configuration are essential
- Always respect rate limits and website policies
- Start with simple examples before tackling complex scenarios
- Environment variables keep your API keys secure
Ready to move on to basic scraping techniques? Let’s continue with Part 2!