What robots.txt Actually Is

The Robots Exclusion Protocol is a simple text file placed at the root of a website (e.g., https://example.com/robots.txt ). It tells crawlers which paths they can and can't access. It's been around since 1994, when Martijn Koster proposed it to manage the explosion of early web robots. In 2022, the protocol became an official IETF standard: RFC 9309 . That was a big deal. It formalized parsing rules that had been mostly convention for nearly three decades. A basic robots.txt looks like this: User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /public/ Crawl-delay: 10 This tells all crawlers ( User-agent: * ) to stay out of /admin/ and /private/ , permits access to /public/ , and requests a 10-second delay between requests. The rules are path-based pattern matching — not complicated, but full of edge cases (longest match wins, order matters for some parsers, wildcard support varies).

Why It Matters More Than Ever

For years, robots.txt lived in a quiet corner of web infrastructure. Search engines obeyed it. Most other bots didn't bother checking. That changed when AI companies started crawling the web at massive scale to build training datasets. Cloudflare launched AI Audit specifically to help site owners see which AI crawlers honor their robots.txt policies and which don't. News publishers have updated their files to explicitly block AI training crawlers. The file went from a polite suggestion to a battleground for data consent. There's also a legal dimension. The landmark case eBay v. Bidder's Edge (1999) used robots.txt compliance as evidence in a trespass-to-chattels claim. While there's no law that says robots.txt must be obeyed, courts have looked at whether a crawler respected it when determining intent and good faith. Ignoring it won't automatically get you sued, but it makes your case weaker if a dispute arises.

How NeuroAPI Handles robots.txt

Every time you hit our crawl or scrape endpoints, we check the target site's robots.txt before fetching any page. Here's the flow: Fetch and parse. We retrieve the robots.txt file and parse it according to RFC 9309 rules. The result is cached — we don't re-fetch it for every single request to the same domain. Match the path. Before requesting a URL, we check whether the path is allowed for our crawler's user agent. We follow the longest-match-first rule from the RFC. Respect crawl-delay. If the file specifies a Crawl-delay , we honor it by spacing out requests to that domain. Skip disallowed paths. If a path is blocked, we don't fetch it. Period. The URL gets marked as skipped in your job results so you know what happened. This happens automatically. You don't need to configure anything. If you're running a large crawl with /v1/crawl , the compliance logic runs for every URL in the queue.

What About Sites Without robots.txt?

If a site has no robots.txt file (returns a 404), we treat all paths as allowed. That's the standard behavior per RFC 9309 — no file means no restrictions. We still apply rate limiting and polite crawling defaults to avoid hammering servers.

Can I override robots.txt?

No, and that's by design. We don't offer an "ignore robots.txt" flag. If you have a legitimate need to access a path that's blocked, the right approach is to contact the site owner and ask for access — or use their official API if one exists.

What about the robots meta tag?

The robots.txt file controls crawler access to paths. The <meta name="robots"> tag in HTML controls indexing behavior. They're complementary. We respect robots.txt for access control. The meta tag is more relevant to search engines deciding what to index.

Does NeuroAPI's user agent identify itself?

Yes. Our crawler sends a clear user agent string that identifies requests as coming from NeuroAPI. We believe in transparency — site owners should be able to see who's accessing their content and make informed decisions about it.

How NeuroAPI Respects robots.txt — Web Scraping Compliance