diff --git a/docs/phase4-standalone.md b/docs/phase4-standalone.md index a7bbfdf..1cc01b5 100755 --- a/docs/phase4-standalone.md +++ b/docs/phase4-standalone.md @@ -31,105 +31,115 @@ Reference: ## HTTP Client Architecture -### Cookie Jar +The HTTP client uses a **hybrid** approach: [httpcloak](https://github.com/sardanioss/httpcloak) +for Chrome TLS fingerprint emulation, with FlareSolverr as a fallback for challenge solving. -The `httpclient.Client` creates a `cookiejar.Jar` for every instance. All cookies from -direct HTTP responses and from FlareSolverr are stored in this shared jar. Sources can -read cookies from the jar using the `Cookie(name, host)` method: +### TLS Fingerprint Problem -```go -// After a request, check the jar for a specific cookie (like Kotlin's -// client.cookieJar.loadForRequest()): -value := client.Cookie("mhub_access", "mangahub.io") -``` +Go's `net/http` has a different TLS fingerprint (JA3/JA4 ciphers, HTTP/2 settings) than Chrome. +When Go sends a `cf_clearance` cookie obtained from FlareSolverr's Chrome, Cloudflare **rejects** +it because the TLS fingerprint doesn't match Chrome's. Every direct request gets re-challenged. -### Response Headers from FlareSolverr +The fix: use **httpcloak** as the transport, which mimics Chrome's TLS fingerprint perfectly. +Now when FlareSolverr solves a challenge and returns `cf_clearance`, the cookie is fed into +httpcloak's session. Subsequent requests via httpcloak (with matching TLS fingerprint + cookie) +pass Cloudflare without re-challenge. -When `Do()` falls back to FlareSolverr, the `doFS()` method now properly propagates -the actual response headers from FlareSolverr (including `Set-Cookie`) instead of -copying request headers into the response. Cookies from FS are also explicitly added -as `Set-Cookie` headers and fed into the shared cookie jar. - -### Single Unified Client - -All sources share one unified HTTP client (`internal/httpclient.Client`) that handles both -direct requests and Cloudflare/DDoS bypass transparently: +### Architecture ``` httpclient.DefaultClient() - ├── Direct HTTP (net/http + cookie jar + rate limiter) - └── FlareSolverr fallback (auto-detected from FLARESOLVERR_URL env var) + ├── httpcloak.Session (Chrome TLS fingerprint + cookie jar) + │ ├── 200 → return response (fast path, no re-challenge) + │ └── 403/503 → FlareSolverr fallback + └── FlareSolverr raw mode (auto-detected from FLARESOLVERR_URL env var) + └── Cookies fed back into httpcloak session for next request ``` -The `Do(req)` method implements **adaptive logic** matching the Kotlin `cloudflareClient`: +### Flows -1. **Try direct** — normal HTTP request with shared cookie jar + rate limiting -2. **If 403/503** (Cloudflare/DDoS challenge) — falls back to **FlareSolverr raw mode** -3. **FlareSolverr solves the challenge** — returns actual server response + cookies -4. **Cookies fed into shared jar** — subsequent requests to the same host skip FS -5. **Chrome HTML wrapper stripped** — FlareSolverr wraps responses in - `
...body` — the client - strips this wrapper so callers receive the actual server body (JSON or HTML). +- **Non-Cloudflare site**: httpcloak → 200. Fast, ~0.5s. +- **Cloudflare site, first request**: httpcloak → 403 → FS solves challenge (~12-60s) → + `cf_clearance` cookie stored in httpcloak session. +- **Cloudflare site, subsequent requests**: httpcloak (with Chrome TLS + clearance cookie) → + 200. Fast, ~1-2s. + +### FlareSolverr Integration + +FlareSolverr is auto-configured from `FLARESOLVERR_URL` env var (e.g., +`http://localhost:8191`). If unset, the client works in direct-only mode. + +| Env Var | Default | Description | +|---|---|---| +| `FLARESOLVERR_URL` | — | FlareSolverr endpoint | +| `FLARESOLVERR_SESSION` | `goyomi` | FS browser session ID; reuses one Chrome instance. Set empty for per-request sessions (more parallel, more memory) | +| `FLARESOLVERR_LOG_LEVEL` | `info` | FS log verbosity | + +The `flare` package (`internal/httpclient/flare/`) is a backward-compatible alias for +`httpclient.Client`. Sources that already import `flare` continue to compile. ### Kotlin vs Go Mapping | Kotlin (`cloudflareClient`) | Go (`httpclient.Client`) | |---|---| -| `OkHttpClient` with Cloudflare interceptor | `net/http.Client` with FlareSolverr fallback | -| Intercepts 503 → solves via WebView → retries with cookies | Tries direct first → on 403/503 → FS raw → strips wrapper | -| Returns raw server response body | Returns stripped FS body or direct body | -| Shared across all sources | `DefaultClient()` singleton shared across all sources | - -### How Sources Use the Client - -```go -import "goyomi/internal/httpclient" - -// Default: use shared singleton (preferred for most sources) -func (s *Source) fetch(ctx context.Context) error { - req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) - req.Header.Set("Accept", "application/json") - resp, err := httpclient.DefaultClient().Do(req) - // ... -} - -// Custom rate limit: create a dedicated client -func New() *Source { - c := httpclient.NewClient(httpclient.WithRateLimit(5, 10)) - return &Source{client: c} -} - -// Custom headers: set on the request (no need for a custom client) -func (s *Source) fetch(ctx context.Context) error { - req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) - req.Header.Set("Accept", "text/css") // DDOS-Guard bypass - resp, err := httpclient.DefaultClient().Do(req) -} -``` - -### FlareSolverr Integration - -FlareSolverr is auto-configured from the `FLARESOLVERR_URL` environment variable. -If unset, the client works in direct-only mode (no Cloudflare bypass). - -The `flare` package (`internal/httpclient/flare/`) is a backward-compatible shim — -all it does is alias `httpclient.Client` and `httpclient.NewClient`. Sources that -imported `flare` before continue to compile without changes. +| Android WebView (Chrome) → solves 503 → retries with cookies | httpcloak (Chrome TLS fingerprint) → on 403 → FS fallback → cookies fed to httpcloak | +| WebView + OkHttp share Android's TLS stack | httpcloak mimics Chrome's JA3 fingerprint | +| Challenge solved once, cookies reused | Challenge solved once, cookies reused via httpcloak session | +| Returns raw server response | Returns FS raw body or httpcloak body | +| Shared across all extensions | `DefaultClient()` singleton shared across all sources | ### Design Decisions -1. **Why a singleton?** — Shared cookie jar means cookies from FlareSolverr (solved - challenges) benefit all sources. Shared rate limiter prevents hammering the same host. -2. **Why strip FS wrapper?** — FlareSolverr routes requests through Chrome, which wraps - JSON responses in `...
JSON...`. Without stripping, JSON parsers - fail. For HTML responses, Chrome adds `` tags but the page content - remains parseable by goquery. -3. **Why direct first?** — Most sites don't have Cloudflare. Direct requests are faster - (no Chrome overhead) and return clean response bodies. FS is only invoked on actual - challenge responses (403/503). -4. **Why not Chrome rendering?** — The Kotlin `cloudflareClient` does NOT render through - Chrome. It intercepts 503, solves the challenge, and returns the raw server response. - Our approach matches this behavior. +1. **Why httpcloak?** — Go's net/http TLS fingerprint doesn't match Chrome's, so + Cloudflare clearance cookies from FS are rejected on direct requests. httpcloak + emulates Chrome's JA3/JA4 fingerprint, making subsequent requests pass Cloudflare. +2. **Why not go-cfscraper?** — The pure-Go JS challenge solver (goja) and external + runtimes (Node.js) both lack browser DOM APIs (`location`, `document`, `window`) + that Cloudflare challenge scripts require. The DOM shims are fragile and break + when Cloudflare updates. FlareSolverr with real Chrome is the only reliable solver. +3. **Why FlareSolverr at all?** — The first request to a Cloudflare site always gets + challenged (no `cf_clearance` cookie yet). FlareSolverr's real Chrome solves it. +4. **Why a singleton?** — Shared httpcloak session = cookies from FS (solved challenges) + benefit all sources. Shared rate limiter prevents hammering the same host. +5. **Why strip FS wrapper?** — FS routes requests through Chrome, which wraps JSON + responses in `…
……`. Without stripping, JSON parsers fail. +6. **Why FS session matters** — With `FLARESOLVERR_SESSION`, FS reuses one Chrome + instance. After the first challenge, `cf_clearance` is cached. Subsequent requests + to the same domain are near-instant. Without a session, each request spawns a + fresh Chrome, solving the challenge from scratch every time (~13s each). + +### Known goquery Limitation — `:has()` + Attribute Selectors + +goquery (via the cascadia CSS engine) does **not** support `:has()` combined with +attribute selectors like `a[href*=/video/]`. This works in Jsoup (Kotlin) but silently +returns 0 matches in Go. + +**Wrong** (CSS-only, works in Kotlin, returns 0 in Go): +```go +doc.Find("figure:not(:has(a[href*=/video/]))") +``` + +**Correct** (programmatic filtering): +```go +doc.Find("figure").Each(func(_ int, el *goquery.Selection) { + if hasAttr(el, "a", "href", "/video/") { return } + // process entry +}) + +func hasAttr(el *goquery.Selection, tag, attr, substr string) bool { + found := false + el.Find(tag).EachWithBreak(func(_ int, a *goquery.Selection) bool { + if v, ok := a.Attr(attr); ok && strings.Contains(v, substr) { + found = true; return false + } + return true + }) + return found +} +``` + +Always check Kotlin references for `:has()` selectors and convert them to +programmatic filtering when porting. Detailed implementation notes for complex sources are in the **Notes** section at the bottom.