Scraping Shopee/Lazada Data Automatically (marketplace-report-crawler)
Hướng dẫn chi tiết về Scraping Shopee/Lazada Data Automatically (marketplace-report-crawler) trong Vibe Coding dành cho None.
Scraping Shopee/Lazada Data Automatically (marketplace-report-crawler)
In the hyper-competitive world of Southeast Asian e-commerce, information isn’t just power—it’s the difference between a thriving storefront and a digital ghost town. If you’ve ever spent a Sunday afternoon manually copying prices from Shopee or checking competitor stock levels on Lazada, you know the “Midnight Price War” pain. By the time you’ve updated your spreadsheet, the market has already moved.
This is where the traditional approach to development fails. In the old world, you would hire a developer to write a rigid Python script that breaks the moment Shopee changes a CSS class name from product-price to p-price-v2. In the world of Vibe Coding, we don’t just build scrapers; we build intelligent, resilient “marketplace-report-crawlers” that understand the intent of the data they are seeking.
This guide will show you how to automate marketplace data extraction using AI-driven agents, ensuring your business stays ahead of the curve without the technical debt of legacy scrapers.
The Core Problem: Why Marketplaces are Hard to Scrape
Marketplaces like Shopee and Lazada are not static websites; they are complex, highly defensive web applications. They employ several layers of protection that make traditional scraping a nightmare:
- Dynamic Content Loading: Most data (prices, reviews, stock) is loaded via JavaScript after the initial page load. A simple
requestscall in Python will often return an empty shell. - Anti-Bot Shields: They use sophisticated fingerprinting to detect non-human behavior. If your “vibe” is too robotic, you’ll be met with a captcha or a permanent IP ban.
- Variable Selectors: To thwart scrapers, these platforms frequently update their HTML structure. A scraper built on Monday might be useless by Wednesday.
- Infinite Scrolling: Product lists don’t have traditional pagination; they load more as you scroll, requiring active browser interaction.
How the “Vibe Coding” Solution Works
In Vibe Coding, we move away from “Hardcoded Selectors” and move toward “Semantic Extraction.” Instead of telling the computer, “Find the text inside the second div of the third span,” we tell our AI agent, “Look at this page and find the price of the item, regardless of its HTML tag.”
Our architecture for a modern marketplace-report-crawler follows a three-stage pipeline:
- The Orchestrator: Uses Playwright or Puppeteer to navigate the site, handle cookies, and mimic human scrolling.
- The Semantic Parser: Instead of Regex, we feed the raw HTML (or a cleaned Markdown version) into an LLM (like Gemini or GPT-4o) to extract structured JSON.
- The Reporter: Automatically pushes this data into a dashboard, a Google Sheet, or a Slack alert.
How it Works: The Technical Architecture
To build a scraper that actually survives more than a week, we need to implement a “Headless Browser with an AI Brain.”
1. Browser Context & Stealth
We use Playwright because it allows us to control Chromium, Firefox, and WebKit with a single API. More importantly, it supports “Stealth” plugins that modify the browser’s fingerprint (like the navigator.webdriver flag) to make it indistinguishable from a real user on a MacBook Pro or a Windows PC.
2. The Interaction Loop
The crawler doesn’t just “get” a URL. It:
- Sets a realistic User-Agent.
- Waits for “Network Idle” to ensure all JavaScript has fired.
- Performs “Micro-Scrolls”—scrolling down 300 pixels, waiting 2 seconds, then scrolling again—to trigger lazy-loading images and prices.
3. AI-Powered Data Sanitization
One of the biggest breakthroughs in Vibe Coding is the ability to ignore the “junk.” A marketplace page has thousands of lines of HTML code dedicated to tracking, ads, and navigation. We use a “Text-to-Markdown” converter to strip the HTML down to its bare essentials (links, text, and prices) before sending it to the AI for extraction. This saves tokens and increases accuracy.
Practical Example: Building the Crawler
Let’s look at a practical implementation using a Vibe Coding approach. We will use TypeScript with Playwright and an AI extraction layer.
The Setup
First, we define our goal: “Extract product name, price, and rating for ‘Ergonomic Chairs’ from Shopee.”
import { chromium } from 'playwright-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
// Use stealth to bypass basic bot detection
const playwright = chromium.use(StealthPlugin());
async function runMarketplaceCrawler(keyword: string) {
const browser = await playwright.launch({ headless: true });
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
});
const page = await context.newPage();
// Navigate to the marketplace
console.log(`Searching for ${keyword}...`);
await page.goto(`https://shopee.sg/search?keyword=${encodeURIComponent(keyword)}`);
// Wait for the product grid to appear
await page.waitForSelector('.shopee-search-item-result__items');
// Perform a human-like scroll to load lazy-loaded data
for (let i = 0; i < 5; i++) {
await page.mouse.wheel(0, 500);
await page.waitForTimeout(1000);
}
// Instead of complex CSS selectors, we grab the innerText of the results container
const rawContent = await page.innerText('.shopee-search-item-result__items');
// THE VIBE CODING MAGIC: Send the raw text to an AI agent
const structuredData = await extractWithAI(rawContent);
console.log('Report Generated:', structuredData);
await browser.close();
}
The AI Extraction Layer
Now, instead of writing a parser for every single marketplace, we use a single function that utilizes an LLM to turn that messy rawContent into a clean JSON array.
async function extractWithAI(text: string) {
// We use a prompt that defines the 'Job to be Done'
const prompt = `
You are an expert data analyst. Below is raw text from an e-commerce marketplace search page.
Your task is to extract a list of products.
For each product, identify:
- Product Name
- Current Price (in SGD)
- Rating (out of 5)
- Number of Items Sold
Return the data ONLY as a valid JSON array.
Raw Text: ${text.substring(0, 5000)} // Truncate to save tokens
`;
// Call your LLM of choice (Gemini, OpenAI, etc.)
const response = await aiProvider.generate(prompt);
return JSON.parse(response);
}
Why this is better:
- Resilience: If Shopee changes the class name from
_32_asdto_99_xyz, theinnerTextof the parent container usually still contains the price and name. The AI “understands” that$49.90next to “Chair” is a price, regardless of the HTML tag. - Speed of Development: You don’t spend hours in the Chrome DevTools Inspector. You focus on the “vibe” of the data you want.
Best Practices & Tips for Marketplace Crawling
Even with AI, you need to be a “good citizen” of the web and ensure your crawler is efficient.
1. Use Residential Proxies
Datacenter IPs (like those from AWS or Google Cloud) are often blocked instantly. If you are running a serious report-crawler, use a residential proxy service. This routes your traffic through real home internet connections, making your “vibe” appear as a legitimate local shopper.
2. Implement Rate Limiting
Do not hit the server 100 times a second. In Vibe Coding, we prioritize quality over raw volume. Add random “jitter” to your requests. Instead of scraping every 60 seconds, scrape at 58 seconds, then 72 seconds, then 64 seconds.
3. Cache the “Big” Data
Marketplace pages are heavy. If you are scraping the same category multiple times a day, consider if you really need to reload images or CSS. You can use Playwright’s route feature to block requests for .jpg, .png, and .css files. This speeds up your crawler by 3x and reduces your bandwidth costs.
await page.route('**/*.{png,jpg,jpeg}', route => route.abort());
4. Monitor “Marketplace Health”
Websites change. Even the best AI-driven scraper needs a “Dead Man’s Switch.” Set up a simple check: if the AI returns 0 products for a keyword that usually returns 50, send yourself a Slack notification. This usually means the marketplace has implemented a new type of pop-up or captcha that your interaction loop needs to account for.
Solving Real-World Business Problems
The marketplace-report-crawler isn’t just a technical exercise; it solves concrete problems that manual workers face every day:
- Dynamic Pricing: A retailer can set their prices to be 1% lower than the top competitor automatically. If the crawler detects a competitor’s flash sale, the business can react in real-time.
- Inventory Intelligence: By tracking the “Number of Items Sold” over several days, you can calculate the “Run Rate” of a competitor’s product. This tells you exactly which products are trending before you place your own wholesale orders.
- Review Sentiment Analysis: Beyond just prices, you can crawl reviews to see what customers hate about a competitor’s product. This allows you to adjust your marketing to highlight how your product solves those specific complaints.
Conclusion: From Scraper to Intelligence Agent
The era of the “dumb scraper” is over. Building a marketplace-report-crawler in the Vibe Coding era is about creating an Intelligence Agent that navigates the web with the intuition of a human and the speed of a machine.
By combining browser automation (Playwright), stealth techniques, and semantic AI extraction, you move away from fixing broken code and move toward generating business value. You stop being a developer who maintains scripts and start being an architect who designs data flows.
Whether you are monitoring your own brand’s presence or scouting for the next big product, the ability to automatically crawl and report on marketplace data is the ultimate competitive advantage. Start small, focus on the “vibe” of your interaction loop, and let the AI handle the messy details of the ever-changing web.