A few months ago I was debugging a crawl job that kept returning empty results from a site I knew had thousands of pages of product data. Turns out, the site's robots.txt blocked our user agent from the /products/ path. The crawler was doing exactly what it should — honoring the site owner's wishes and moving on. That moment reminded me how many engineers treat robots.txt as an afterthought. It's not. It's a signal from site owners to automated visitors, and ignoring it is a fast track to blocked IPs, legal headaches, and a bad reputation. At NeuroAPI, we take robots.txt compliance seriously. This post explains what the protocol actually is, why it matters in 2025, and how our platform handles it under the hood. What robots.txt Actually Is The Robots Exclusion Protocol is a simple text file placed at the root of a website (e.g., https://example.com/robots.txt). It tells crawlers which paths they can and can't access. It's been around since 1994, when Martijn Koster proposed it to manage the explosion of early web robots. In 2022, the protocol became an official IETF standard: RFC 9309. That was a big deal. It formalized parsing rules that had been mostly convention for nearly three decades. A basic robots.txt looks like this: User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /public/ Crawl-delay: 10 This tells all crawlers (User-agent: *) to stay out of /admin/ and /private/, permits access to /public/, and requests a 10-second delay between requests. The rules are path-based pattern matching — not complicated, but full of edge cases (longest match wins, order matters for some parsers, wildcard support varies). Why It Matters More Than Ever For years, robots.txt lived in a quiet corner of web infrastructure. Search engines obeyed it. Most other bots didn't bother checking. That changed when AI companies started crawling the web at massive scale to build training datasets. Cloudflare launched AI Audit specifically to help site owners see which AI crawlers honor their robots.txt policies and which don't. News publishers have updated their files to explicitly block AI training crawlers. The file went from a polite suggestion to a battleground for data consent. There's also a legal dimension. The landmark case eBay v. Bidder's Edge (1999) used robots.txt compliance as evidence in a trespass-to-chattels claim. While there's no law that says robots.txt must be obeyed, courts have looked at whether a crawler respected it when determining intent and good faith. Ignoring it won't automatically get you sued, but it makes your case weaker if a dispute arises. How NeuroAPI Handles robots.txt Every time you hit our crawl or scrape endpoints, we check the target site's robots.txt before fetching any page. Here's the flow: 1. Fetch and parse. We retrieve the robots.txt file and parse it according to RFC 9309 rules. The result is cached — we don't re-fetch it for every single request to the same domain. 2. Match the path. Before requesting a URL, we check whether the path is allowed for our crawler's user agent. We follow the longest-match-first rule from the RFC. 3. Respect crawl-delay. If the file specifies a Crawl-delay, we honor it by spacing out requests to that domain. 4. Skip disallowed paths. If a path is blocked, we don't fetch it. Period. The URL gets marked as skipped in your job results so you know what happened. This happens automatically. You don't need to configure anything. If you're running a large crawl with /v1/crawl, the compliance logic runs for every URL in the queue. What About Sites Without robots.txt? If a site has no robots.txt file (returns a 404), we treat all paths as allowed. That's the standard behavior per RFC 9309 — no file means no restrictions. We still apply rate limiting and polite crawling defaults to avoid hammering servers. A Practical Example Let's say you want to crawl a documentation site to build a RAG knowledge base. Here's how you'd do it with NeuroAPI: curl -X POST https://api.neuroapi.dev/v1/crawl -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{ "url": "https://docs.example.com", "limit": 200, "maxDepth": 3, "includePaths": ["/docs/*"] }' Behind the scenes, NeuroAPI will: 1. Fetch https://docs.example.com/robots.txt 2. Parse the rules for all user agents (and our specific one, if listed) 3. Only crawl paths under /docs/ that aren't disallowed 4. Return clean Markdown for each page, with skipped URLs noted in the job status You can poll the job with /v1/crawl/status to see exactly which pages were crawled and which were skipped due to robots.txt restrictions. Common Questions Can I override robots.txt? No, and that's by design. We don't offer an "ignore robots.txt" flag. If you have a legitimate need to access a path that's blocked, the right approach is to contact the site owner and ask for access — or use their official API if one exists. What about the robots meta tag? The robots.txt file controls crawler access to paths. The <meta name="robots"> tag in HTML controls indexing behavior. They're complementary. We respect robots.txt for access control. The meta tag is more relevant to search engines deciding what to index. Does NeuroAPI's user agent identify itself? Yes. Our crawler sends a clear user agent string that identifies requests as coming from NeuroAPI. We believe in transparency — site owners should be able to see who's accessing their content and make informed decisions about it. The Bigger Picture Respecting robots.txt isn't just about avoiding legal risk. It's about being a good citizen of the web. The protocol exists so site owners can communicate their preferences to automated visitors without having to block IPs or deploy CAPTCHAs against every crawler. When you use NeuroAPI, you get that peace of mind built-in.