How to Build a Job Scraping Tool with Puppeteer in JavaScript
Web scraping is a powerful technique for extracting data from websites when APIs aren't available. For home gym enthusiasts looking to develop technical skills during rest days, building a simple web scraping tool can be an engaging project. This guide walks through creating a job scraping application using JavaScript and Puppeteer.
Setting Up Your Project
To begin, create a new folder for your project and open it in VS Code or your preferred editor. Initialize a new Node.js project with these steps:
- Open a terminal and run
npm init
to create a package.json file - Add the type and start script to your package.json:
{
"type": "module",
"scripts": {
"start": "node index.js"
}
}
Installing Puppeteer
Puppeteer is a Node.js library that provides a high-level API to control a headless Chrome browser. Install it with:
npm install puppeteer
Creating Your Scraper
Create an index.js file and import the necessary dependencies:
import puppeteer from 'puppeteer';
import { Parser } from 'json2csv';
import fs from 'fs';
Next, set up the browser instance and navigation:
// URL to scrape
const URL = 'https://www.naukri.com/software-development-jobs';
// Launch browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set user agent to avoid being blocked
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36');
// Navigate to the target URL
await page.goto(URL);
Extracting Job Information
To extract job information, you need to identify the CSS selectors for the elements containing the data you want. In this example, we're extracting job titles, company names, experience requirements, and locations:
// Wait for job cards to load
await page.waitForSelector('.jobTuple');
// Extract job information
const jobs = await page.$$eval('.jobTuple', cards => {
return cards.map(card => {
// Get job title
const titleSelector = card.querySelector('.title');
const title = titleSelector.innerText;
const url = titleSelector.href;
// Get company name
const companySelector = card.querySelector('.company');
const companyName = companySelector.innerText;
// Get experience requirement
const experienceSelector = card.querySelector('.expwdth');
const experience = experienceSelector?.innerText;
// Get location
const locationSelector = card.querySelector('.locwdth');
const location = locationSelector?.innerText;
return {
title,
url,
companyName,
experience,
location
};
});
});
Exporting to CSV
After extracting the data, you can export it to a CSV file for further analysis:
// Convert JSON to CSV
const parser = new Parser();
const csv = parser.parse(jobs);
// Write to file
fs.writeFileSync('jobs.csv', csv);
// Close the browser
await browser.close();
console.log('Job data saved to jobs.csv');
Running Your Scraper
Run your scraper with the command:
npm start
The script will launch a headless Chrome browser, navigate to the job site, extract the job information, and save it to a CSV file. You can then open this file in Excel or any spreadsheet program to view and analyze the data.
Important Considerations
When scraping websites, remember:
- Always respect the website's terms of service and robots.txt file
- Add delays between requests to avoid overloading the server
- CSS selectors may change if the website updates its design
- Some websites actively block scraping attempts
This simple project demonstrates the power of web scraping for data collection. With these fundamentals, you can adapt the technique to gather information about fitness equipment prices, exercise techniques, or other home gym related data from various sources.