Lightfeed Extractor: TypeScript Library for Robust LLM-Based Web Scraping

Available in: 中文
2026-03-26T06:58:57.234Z·1 min read
Lightfeed open-sourced a TypeScript library for LLM-based web scraping that handles the full pipeline: HTML cleanup, markdown conversion, LLM extraction, JSON parsing, and error recovery with Zod schema validation.

Lightfeed Extractor: Production-Ready LLM Web Scraping in TypeScript

Lightfeed has open-sourced Lightfeed Extractor, a TypeScript library that handles the full pipeline from raw HTML to validated, structured data using LLMs.

The Problem It Solves

Traditional web scraping breaks constantly: write CSS selectors, the site changes layout, everything breaks at 2am. LLMs seemed like the fix, but raw HTML is full of nav bars, footers, and tracking junk that eats token budgets. A typical product page is 80% noise. LLMs also return malformed JSON more often than expected, especially with nested arrays and complex schemas.

Key Features

The Pipeline

HTML cleanup, markdown conversion, LLM call, JSON parsing, error recovery, schema validation. The library handles this entire stack, eliminating the boilerplate that teams rebuild for every scraping project.

Companion Tools

Pairs with Lightfeed Browser Agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction.

Technical Details

The library addresses a real pain point for teams building data pipelines, combining content extraction, LLM integration, and error resilience in a single package.

↗ Original source · 2026-03-26T00:00:00.000Z
← Previous: Nit: Git Rebuilt in Zig Saves AI Coding Agents 71% on TokensNext: US Government Buying Americans' Data Without Warrants Through Data Broker Loophole →
Comments0