How the crawler works

How Thorgate fetches vendor documents, what it does when a fetch fails, and the rate limits in place.

Thorgate crawls each tracked document on a defined schedule and processes the result through several stages. This article describes the mechanics so you can interpret what happens (and what doesn't).

Crawl frequency

By default, every tracked document is crawled once per day. Crawls are spread across the day, not concentrated at midnight, to avoid hammering vendor servers.

A manual "Crawl now" is available on every document, with a 60-minute cooldown per document. After a manual crawl, the next manual crawl on that document is locked for 60 minutes. The daily automatic crawl is unaffected.

Fetch strategy

Each crawl tries direct HTTP fetch first, with a Thorgate-identifying user agent. The user agent is identifiable so vendors who want to exclude Thorgate can do so via robots.txt — most don't.

If the direct fetch fails — typically because of a Cloudflare challenge, a 403 response, or a JavaScript-rendered single-page application — Thorgate falls back to Jina Reader, a service that uses a headless browser to render the page and extract the text. This handles most non-trivial bot mitigations.

For PDF documents, Thorgate uses Jina Reader to extract text from the PDF. The largest PDF we've successfully tracked is roughly 100 pages.

Robots.txt

If a vendor's robots.txt explicitly disallows crawling, Thorgate respects it and does not fetch the document. The document on your vendor page will show as "blocked by robots.txt" rather than appearing as an error. Most vendors do not block the privacy-policy URL specifically, so this is uncommon in practice.

Rate limiting

Thorgate rate-limits crawls per domain to avoid overwhelming any single vendor:

  • At most 1 fetch per minute per vendor domain.
  • At most 30 fetches per hour per vendor domain.

For a single account this is invisible; the limit is shared across all customers tracking the same domain (since the catalog crawler does the work once and fans out the result).

Content normalisation

After fetching, the response is normalised before being compared to the previous version:

  • HTML is converted to text using a markdown-aware extraction.
  • Whitespace is collapsed (trailing whitespace, runs of blank lines, etc.).
  • Boilerplate (cookie banners, navigation, footers) is removed where Thorgate can identify it.
  • The result is hashed (SHA-256) for deduplication.

If the new hash matches the previous hash, no change event is created — even if the underlying HTML changed (because the difference was in noise, not content).

When a crawl fails

Fetch failures happen. Causes:

  • Vendor outage. Thorgate retries with exponential backoff.
  • Vendor moved the URL. Thorgate cannot follow redirects beyond five hops; if the canonical URL changed, you'll need to update the vendor record.
  • Permanent 404. Thorgate marks the document as unreachable and notifies you in the next digest.
  • Robots.txt change. As above.

The vendor detail page shows the most recent crawl status for each document. A persistent error stays visible until the underlying issue is resolved or the document is removed from tracking.

What happens when a change is detected

When the new content hash differs from the previous one:

  1. A new document version is created and stored in the archive (content-addressed by hash).
  2. A diff is computed between the new and previous versions.
  3. The diff is sent to Anthropic's Claude for severity classification and summary generation.
  4. A change event is created and visible in your Changes feed.
  5. If digests are enabled, the event is queued for the next digest.
  6. For "content_changed" events on a vendor, Thorgate also re-runs fact extraction to refresh the structured properties (jurisdictions, subprocessors, retention periods) shown on the vendor page.

The whole pipeline typically completes within a few minutes of the initial crawl.

Related