Lightfeed Extractor: TypeScript Library for Robust LLM-Based Web Scraping

Available in: 中文

2026-03-26T06:58:57.234Z·1 min read

Lightfeed open-sourced a TypeScript library for LLM-based web scraping that handles the full pipeline: HTML cleanup, markdown conversion, LLM extraction, JSON parsing, and error recovery with Zod schema validation.

Lightfeed Extractor: Production-Ready LLM Web Scraping in TypeScript

Lightfeed has open-sourced Lightfeed Extractor, a TypeScript library that handles the full pipeline from raw HTML to validated, structured data using LLMs.

The Problem It Solves

Traditional web scraping breaks constantly: write CSS selectors, the site changes layout, everything breaks at 2am. LLMs seemed like the fix, but raw HTML is full of nav bars, footers, and tracking junk that eats token budgets. A typical product page is 80% noise. LLMs also return malformed JSON more often than expected, especially with nested arrays and complex schemas.

Key Features

HTML to markdown conversion with main content extraction (strips nav, headers, footers)
Any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.)
Zod schemas for type-safe extraction with real validation
Partial data recovery from malformed LLM output (if 19 of 20 products parsed correctly, you get those 19)
Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches
URL cleaning for relative URLs, markdown-escaped links, and tracking parameters

The Pipeline

HTML cleanup, markdown conversion, LLM call, JSON parsing, error recovery, schema validation. The library handles this entire stack, eliminating the boilerplate that teams rebuild for every scraping project.

Companion Tools

Pairs with Lightfeed Browser Agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction.

Technical Details

Apache 2.0 licensed
npm: @lightfeed/extractor
Used in production at Lightfeed
GitHub: github.com/lightfeed/extractor

The library addresses a real pain point for teams building data pipelines, combining content extraction, LLM integration, and error resilience in a single package.

↗ Original source · 2026-03-26T00:00:00.000Z

Comments0