Skip to main content
ML Data

Reliable Web Data
for AI & ML Pipelines

The public web is already scraped. The data that matters now sits behind logins, paywalls, JS rendering, and anti-bot protection with no API to access it.

Surreal landscape — a sun rising behind mountains under a clouded sky

98%

success rate

Lower cost per request

10+

CAPTCHA solvers built in

100M+

Residential and mobile IPs
The shift

Why browsers are
the new data pipeline

The licensing well is running dry. The data that powered the last generation of frontier models — Reddit, Stack Overflow, Shutterstock, the AP — is now locked up in exclusive contracts.

01

No accessible public API

Most major platforms (LinkedIn, Reddit, X, Instagram) have either closed their APIs or made them prohibitively expensive, leaving the browser as the only way to extract their data.

02

Auth-gated content

Forums, legal databases, financial platforms, gated research portals. The data exists but requires login, and the platform actively blocks automated access.

03

JS-rendered content

SPAs, infinite scroll, lazy-loaded data, client-side pagination. The data isn't in the HTML — it loads after JS execution. A simple HTTP request gets an empty shell.

04

Geo-restricted content

Results vary by country. Training a model on US-only data gives you a US-only model.

case study

Agentic deep research on the live web

01

Task

Your agent gets a research question — or runs a daily competitor-pricing sweep.

02

Search

Surfsky runs the search and opens the top results on fresh sessions.

03

Read at scale

The agent reads as many sources as the answer needs — hundreds in parallel, none blocked or rate-limited.

04

Answer

The model reads the full context and answers with a citation for every source.

USER'S AGENT research.ts const web = surfsky() const hits = web.search("acme pricing") const ctx = web.read(hits) // all 240, no cap agent.run(ctx, { cite: true }) answer from 240 sources acme — pro now $49/mo ↑ [1] g2 — 4.6 vs globex 4.4 [2] news — raised +25% q1 [3] answered, every claim cited (38s) 240 sources SURFSKY CLOUD live retrieval google · 240 hits acme.com us residential g2.com us residential techcrunch.com us mobile ! acme: cloudflare - solving ✓ acme passed (2.4s) fetches as many as it needs +237 more TOP RESULTS acme.com/pricing [1] pro plan $49/mo business $99/mo changed $39 → $49 1,240 words (2.4s) | verify you are human cloudflare - surfsky is passing it g2.com/compare/acme [2] acme 4.6 / 5 globex 4.4 / 5 reviews 2,140 2,080 words (1.8s) techcrunch.com/acme-raises [3] headline acme raises pro when mar 2026 delta +25% yoy 760 words (1.2s)
Features

Full browser or plain HTTP.Same stealth.

Both modes run the same Chrome stack and the same residential network. The difference is how much control you need.

wss://surfsky.io/cdp DUPLEX · 1 WS
COMMANDEVENT
Page.navigate frameStartedLoading
Page.loadEventFired
turnstile.detected
Runtime.evaluate challenge.solved
Playwright Puppeteer Selenium
multi-step flows · auth · dynamic UI infinite scroll · anti-bot challenges persistent profiles · cookies · storage
CDP

For complex collection flows

Login, navigate multi-step UIs, handle pagination, interact with dynamic content. A live browser session your pipeline controls.

  • Playwright · Puppeteer · Selenium compatible
  • Persistent profiles, cookies, local storage
  • One websocket connection, full control
POST /render REQUEST
{
  "url": "example.com/products",
  "waitFor": ".price-loaded",
  "returns": ["html", "png", "data"]
}
200 OK application/json RESPONSE
HTML 128 KB PNG 1920×1080 JSON 48 items

AUTO-RETRY · ANTI-BOT FLAGS

try 1 Cloudflare interstitial RETRY
try 2 rotated fingerprint + proxy OK 200
HTTP API

For bulk collection

1 request => 1 rendered page with structured data back. Run thousands of pages through the pipeline without managing browser sessions.

  • POST /render with URL + optional waitFor
  • Returns HTML, screenshots, structured data
  • Built-in retries on transient anti-bot flags

Try it on your
hardest target.

Tell us what you're automating. We'll get you set up.