You've got a URL. You want to feed it to an LLM. Simple, right? Not quite. Paste raw HTML into a prompt and you'll burn through tokens on nav bars, cookie banners, and inline SVGs that mean nothing to the model. Paste the visible text and you lose structure, headings, lists, and tables. You need something in between: clean, structured Markdown that preserves the content's meaning without the browser-rendering baggage. That's the problem I run into every time I build a RAG pipeline or an AI agent that needs to read the web. The input is a URL. The output needs to be something an LLM can parse without hallucinating about a stray <div> tag. This post walks through why HTML-to-Markdown conversion matters, where naive approaches break down, and how to get it done in a single API call. Why Markdown, Not Plain Text or Raw HTML Markdown hits a sweet spot for LLM consumption: Structure is preserved. Headings, lists, tables, and code blocks survive the conversion. The model knows what's a section title and what's a list item. Noise is stripped. Navigation menus, footers, ad slots, and tracking pixels disappear. Token efficiency. A typical web page in raw HTML runs 3-10x more tokens than its Markdown equivalent. That's real money at scale. Compatibility. Every major LLM framework, from LangChain to LlamaIndex, treats Markdown as a first-class input format. Plain text throws away structure. Raw HTML drags in noise. Markdown is the working format. Where Naive HTML-to-Markdown Conversion Fails You might think you can just grab the HTML, run it through a library like html-to-markdown or markdownify, and call it a day. I've tried. Here's what goes wrong: Boilerplate Bleeds Through Most websites wrap their actual content in a sea of chrome. Headers, sidebars, cookie consent modals, "related articles" widgets, newsletter signup forms. A naive converter treats all of it as content. You end up with Markdown that starts with "Skip to main content" and ends with "© 2024 Acme Corp. All rights reserved." JavaScript-Rendered Pages Come Back Empty If the page relies on client-side rendering (React, Vue, Next.js SSR hydration), a simple HTTP GET returns a shell with a <div id="root"> and no actual content. You need a headless browser to execute JavaScript first. Tables and Nested Structures Break HTML tables with merged cells, nested lists with custom styling, definition lists, figures with captions — these all lose their semantics in a dumb regex-based conversion. The Markdown ends up garbled or, worse, silently wrong. Repeated Elements Create Noise Paginated comment sections, infinite-scroll product grids, and duplicate navigation elements get converted multiple times. Your 500-word article now has 2,000 words of repetitive junk. The Right Pipeline: Scrape, Clean, Convert Here's the pipeline I use. It handles all the edge cases above without requiring me to maintain a headless browser fleet or write custom extraction logic for every site layout. Step one: use a scraping API that renders JavaScript and returns clean content. Step two: feed that into your LLM pipeline. With NeuroAPI, step one is a single request: curl -X POST https://neuroapi.me/v1/scrape -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{ "url": "https://example.com/blog/some-article", "format": "markdown" }' The response comes back as clean Markdown. No nav bars. No cookie banners. Just the content, structured with proper headings, lists, links, and code blocks. What the Response Looks Like For a typical blog post, the API returns something like this (simplified): # Building a RAG Pipeline With Real-Time Data Retrieval-augmented generation works best when your documents are fresh. Here's how to set up a pipeline that pulls live web content into your vector store. ## Prerequisites - Python 3.11+ - A vector database (Pinecone, Weaviate, or Chroma) - An embedding model (OpenAI, Cohere, or open-source) ## Step 1: Crawl Your Sources Start by mapping the sites you want to index: python import requests response = requests.post( "https://neuroapi.me/v1/map", json={"url": "https://docs.example.com"} ) urls = response.json()["urls"] ## Step 2: Scrape and Convert Batch-scrape all discovered URLs... Compare that to the raw HTML of the same page, which includes 40+ lines of <script> tags, a JSON-LD block, OpenGraph metadata, multiple nav elements, and a footer with 15 links. The Markdown version is maybe 20% of the token count. Batch Processing for RAG Pipelines If you're building a knowledge base, you're not scraping one page. You're scraping hundreds or thousands. NeuroAPI's batch endpoint handles this: curl -X POST https://neuroapi.me/v1/batch-scrape -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{ "urls": [ "https://example.com/docs/getting-started", "https://example.com/docs/authentication", "https://example.com/docs/api-reference" ], "format": "markdown" }' You get back an array of Markdown documents, one per URL, all in a single response. No need to manage concurrency yourself. No rate-limit headaches. The API handles parallelism and retries internally. From here, you chunk the Markdown, embed it, and push it into your vector store. The structure-aware headings and lists mean your chunks are semantically coherent, which directly improves retrieval quality. Tips for Better Results Use the /v1/map endpoint first. Before scraping, map the site to discover all relevant URLs. This avoids scraping the same page twice and helps you estimate costs. Filter by content type. Not every page is worth indexing. Use the /v1/extract endpoint with a schema to pull structured metadata (title, author, date, category) and decide programmatically which pages to keep. Watch your chunk size. Clean Markdown pages often run 500-3,000 tokens. Chunk accordingly to match your embedding model's context window and retrieval strategy.