Lightfeed Extractor: TypeScript Library for Robust LLM-Based Web Scraping
Lightfeed Extractor: Production-Ready LLM Web Scraping in TypeScript
Lightfeed has open-sourced Lightfeed Extractor, a TypeScript library that handles the full pipeline from raw HTML to validated, structured data using LLMs.
The Problem It Solves
Traditional web scraping breaks constantly: write CSS selectors, the site changes layout, everything breaks at 2am. LLMs seemed like the fix, but raw HTML is full of nav bars, footers, and tracking junk that eats token budgets. A typical product page is 80% noise. LLMs also return malformed JSON more often than expected, especially with nested arrays and complex schemas.
Key Features
- HTML to markdown conversion with main content extraction (strips nav, headers, footers)
- Any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.)
- Zod schemas for type-safe extraction with real validation
- Partial data recovery from malformed LLM output (if 19 of 20 products parsed correctly, you get those 19)
- Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches
- URL cleaning for relative URLs, markdown-escaped links, and tracking parameters
The Pipeline
HTML cleanup, markdown conversion, LLM call, JSON parsing, error recovery, schema validation. The library handles this entire stack, eliminating the boilerplate that teams rebuild for every scraping project.
Companion Tools
Pairs with Lightfeed Browser Agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction.
Technical Details
- Apache 2.0 licensed
- npm:
@lightfeed/extractor - Used in production at Lightfeed
- GitHub: github.com/lightfeed/extractor
The library addresses a real pain point for teams building data pipelines, combining content extraction, LLM integration, and error resilience in a single package.