docs: update HTTP Client Architecture for httpcloak, document gotchas
- Replace old net/http + FS diagram with httpcloak hybrid approach - Document TLS fingerprint problem and why httpcloak fixes it - Add FLARESOLVERR_SESSION env var docs - Add goquery :has() + attribute selector limitation workaround - Update design decisions to reflect current architecture
This commit is contained in:
+91
-81
@@ -31,105 +31,115 @@ Reference:
|
||||
|
||||
## HTTP Client Architecture
|
||||
|
||||
### Cookie Jar
|
||||
The HTTP client uses a **hybrid** approach: [httpcloak](https://github.com/sardanioss/httpcloak)
|
||||
for Chrome TLS fingerprint emulation, with FlareSolverr as a fallback for challenge solving.
|
||||
|
||||
The `httpclient.Client` creates a `cookiejar.Jar` for every instance. All cookies from
|
||||
direct HTTP responses and from FlareSolverr are stored in this shared jar. Sources can
|
||||
read cookies from the jar using the `Cookie(name, host)` method:
|
||||
### TLS Fingerprint Problem
|
||||
|
||||
```go
|
||||
// After a request, check the jar for a specific cookie (like Kotlin's
|
||||
// client.cookieJar.loadForRequest()):
|
||||
value := client.Cookie("mhub_access", "mangahub.io")
|
||||
```
|
||||
Go's `net/http` has a different TLS fingerprint (JA3/JA4 ciphers, HTTP/2 settings) than Chrome.
|
||||
When Go sends a `cf_clearance` cookie obtained from FlareSolverr's Chrome, Cloudflare **rejects**
|
||||
it because the TLS fingerprint doesn't match Chrome's. Every direct request gets re-challenged.
|
||||
|
||||
### Response Headers from FlareSolverr
|
||||
The fix: use **httpcloak** as the transport, which mimics Chrome's TLS fingerprint perfectly.
|
||||
Now when FlareSolverr solves a challenge and returns `cf_clearance`, the cookie is fed into
|
||||
httpcloak's session. Subsequent requests via httpcloak (with matching TLS fingerprint + cookie)
|
||||
pass Cloudflare without re-challenge.
|
||||
|
||||
When `Do()` falls back to FlareSolverr, the `doFS()` method now properly propagates
|
||||
the actual response headers from FlareSolverr (including `Set-Cookie`) instead of
|
||||
copying request headers into the response. Cookies from FS are also explicitly added
|
||||
as `Set-Cookie` headers and fed into the shared cookie jar.
|
||||
|
||||
### Single Unified Client
|
||||
|
||||
All sources share one unified HTTP client (`internal/httpclient.Client`) that handles both
|
||||
direct requests and Cloudflare/DDoS bypass transparently:
|
||||
### Architecture
|
||||
|
||||
```
|
||||
httpclient.DefaultClient()
|
||||
├── Direct HTTP (net/http + cookie jar + rate limiter)
|
||||
└── FlareSolverr fallback (auto-detected from FLARESOLVERR_URL env var)
|
||||
├── httpcloak.Session (Chrome TLS fingerprint + cookie jar)
|
||||
│ ├── 200 → return response (fast path, no re-challenge)
|
||||
│ └── 403/503 → FlareSolverr fallback
|
||||
└── FlareSolverr raw mode (auto-detected from FLARESOLVERR_URL env var)
|
||||
└── Cookies fed back into httpcloak session for next request
|
||||
```
|
||||
|
||||
The `Do(req)` method implements **adaptive logic** matching the Kotlin `cloudflareClient`:
|
||||
### Flows
|
||||
|
||||
1. **Try direct** — normal HTTP request with shared cookie jar + rate limiting
|
||||
2. **If 403/503** (Cloudflare/DDoS challenge) — falls back to **FlareSolverr raw mode**
|
||||
3. **FlareSolverr solves the challenge** — returns actual server response + cookies
|
||||
4. **Cookies fed into shared jar** — subsequent requests to the same host skip FS
|
||||
5. **Chrome HTML wrapper stripped** — FlareSolverr wraps responses in
|
||||
`<html><head>...<meta...></head><body><pre>body</pre></body></html>` — the client
|
||||
strips this wrapper so callers receive the actual server body (JSON or HTML).
|
||||
- **Non-Cloudflare site**: httpcloak → 200. Fast, ~0.5s.
|
||||
- **Cloudflare site, first request**: httpcloak → 403 → FS solves challenge (~12-60s) →
|
||||
`cf_clearance` cookie stored in httpcloak session.
|
||||
- **Cloudflare site, subsequent requests**: httpcloak (with Chrome TLS + clearance cookie) →
|
||||
200. Fast, ~1-2s.
|
||||
|
||||
### FlareSolverr Integration
|
||||
|
||||
FlareSolverr is auto-configured from `FLARESOLVERR_URL` env var (e.g.,
|
||||
`http://localhost:8191`). If unset, the client works in direct-only mode.
|
||||
|
||||
| Env Var | Default | Description |
|
||||
|---|---|---|
|
||||
| `FLARESOLVERR_URL` | — | FlareSolverr endpoint |
|
||||
| `FLARESOLVERR_SESSION` | `goyomi` | FS browser session ID; reuses one Chrome instance. Set empty for per-request sessions (more parallel, more memory) |
|
||||
| `FLARESOLVERR_LOG_LEVEL` | `info` | FS log verbosity |
|
||||
|
||||
The `flare` package (`internal/httpclient/flare/`) is a backward-compatible alias for
|
||||
`httpclient.Client`. Sources that already import `flare` continue to compile.
|
||||
|
||||
### Kotlin vs Go Mapping
|
||||
|
||||
| Kotlin (`cloudflareClient`) | Go (`httpclient.Client`) |
|
||||
|---|---|
|
||||
| `OkHttpClient` with Cloudflare interceptor | `net/http.Client` with FlareSolverr fallback |
|
||||
| Intercepts 503 → solves via WebView → retries with cookies | Tries direct first → on 403/503 → FS raw → strips wrapper |
|
||||
| Returns raw server response body | Returns stripped FS body or direct body |
|
||||
| Shared across all sources | `DefaultClient()` singleton shared across all sources |
|
||||
|
||||
### How Sources Use the Client
|
||||
|
||||
```go
|
||||
import "goyomi/internal/httpclient"
|
||||
|
||||
// Default: use shared singleton (preferred for most sources)
|
||||
func (s *Source) fetch(ctx context.Context) error {
|
||||
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
|
||||
req.Header.Set("Accept", "application/json")
|
||||
resp, err := httpclient.DefaultClient().Do(req)
|
||||
// ...
|
||||
}
|
||||
|
||||
// Custom rate limit: create a dedicated client
|
||||
func New() *Source {
|
||||
c := httpclient.NewClient(httpclient.WithRateLimit(5, 10))
|
||||
return &Source{client: c}
|
||||
}
|
||||
|
||||
// Custom headers: set on the request (no need for a custom client)
|
||||
func (s *Source) fetch(ctx context.Context) error {
|
||||
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
|
||||
req.Header.Set("Accept", "text/css") // DDOS-Guard bypass
|
||||
resp, err := httpclient.DefaultClient().Do(req)
|
||||
}
|
||||
```
|
||||
|
||||
### FlareSolverr Integration
|
||||
|
||||
FlareSolverr is auto-configured from the `FLARESOLVERR_URL` environment variable.
|
||||
If unset, the client works in direct-only mode (no Cloudflare bypass).
|
||||
|
||||
The `flare` package (`internal/httpclient/flare/`) is a backward-compatible shim —
|
||||
all it does is alias `httpclient.Client` and `httpclient.NewClient`. Sources that
|
||||
imported `flare` before continue to compile without changes.
|
||||
| Android WebView (Chrome) → solves 503 → retries with cookies | httpcloak (Chrome TLS fingerprint) → on 403 → FS fallback → cookies fed to httpcloak |
|
||||
| WebView + OkHttp share Android's TLS stack | httpcloak mimics Chrome's JA3 fingerprint |
|
||||
| Challenge solved once, cookies reused | Challenge solved once, cookies reused via httpcloak session |
|
||||
| Returns raw server response | Returns FS raw body or httpcloak body |
|
||||
| Shared across all extensions | `DefaultClient()` singleton shared across all sources |
|
||||
|
||||
### Design Decisions
|
||||
|
||||
1. **Why a singleton?** — Shared cookie jar means cookies from FlareSolverr (solved
|
||||
challenges) benefit all sources. Shared rate limiter prevents hammering the same host.
|
||||
2. **Why strip FS wrapper?** — FlareSolverr routes requests through Chrome, which wraps
|
||||
JSON responses in `<html><head>...<pre>JSON</pre>...`. Without stripping, JSON parsers
|
||||
fail. For HTML responses, Chrome adds `<meta charset>` tags but the page content
|
||||
remains parseable by goquery.
|
||||
3. **Why direct first?** — Most sites don't have Cloudflare. Direct requests are faster
|
||||
(no Chrome overhead) and return clean response bodies. FS is only invoked on actual
|
||||
challenge responses (403/503).
|
||||
4. **Why not Chrome rendering?** — The Kotlin `cloudflareClient` does NOT render through
|
||||
Chrome. It intercepts 503, solves the challenge, and returns the raw server response.
|
||||
Our approach matches this behavior.
|
||||
1. **Why httpcloak?** — Go's net/http TLS fingerprint doesn't match Chrome's, so
|
||||
Cloudflare clearance cookies from FS are rejected on direct requests. httpcloak
|
||||
emulates Chrome's JA3/JA4 fingerprint, making subsequent requests pass Cloudflare.
|
||||
2. **Why not go-cfscraper?** — The pure-Go JS challenge solver (goja) and external
|
||||
runtimes (Node.js) both lack browser DOM APIs (`location`, `document`, `window`)
|
||||
that Cloudflare challenge scripts require. The DOM shims are fragile and break
|
||||
when Cloudflare updates. FlareSolverr with real Chrome is the only reliable solver.
|
||||
3. **Why FlareSolverr at all?** — The first request to a Cloudflare site always gets
|
||||
challenged (no `cf_clearance` cookie yet). FlareSolverr's real Chrome solves it.
|
||||
4. **Why a singleton?** — Shared httpcloak session = cookies from FS (solved challenges)
|
||||
benefit all sources. Shared rate limiter prevents hammering the same host.
|
||||
5. **Why strip FS wrapper?** — FS routes requests through Chrome, which wraps JSON
|
||||
responses in `<html>…<pre>…</pre>…</html>`. Without stripping, JSON parsers fail.
|
||||
6. **Why FS session matters** — With `FLARESOLVERR_SESSION`, FS reuses one Chrome
|
||||
instance. After the first challenge, `cf_clearance` is cached. Subsequent requests
|
||||
to the same domain are near-instant. Without a session, each request spawns a
|
||||
fresh Chrome, solving the challenge from scratch every time (~13s each).
|
||||
|
||||
### Known goquery Limitation — `:has()` + Attribute Selectors
|
||||
|
||||
goquery (via the cascadia CSS engine) does **not** support `:has()` combined with
|
||||
attribute selectors like `a[href*=/video/]`. This works in Jsoup (Kotlin) but silently
|
||||
returns 0 matches in Go.
|
||||
|
||||
**Wrong** (CSS-only, works in Kotlin, returns 0 in Go):
|
||||
```go
|
||||
doc.Find("figure:not(:has(a[href*=/video/]))")
|
||||
```
|
||||
|
||||
**Correct** (programmatic filtering):
|
||||
```go
|
||||
doc.Find("figure").Each(func(_ int, el *goquery.Selection) {
|
||||
if hasAttr(el, "a", "href", "/video/") { return }
|
||||
// process entry
|
||||
})
|
||||
|
||||
func hasAttr(el *goquery.Selection, tag, attr, substr string) bool {
|
||||
found := false
|
||||
el.Find(tag).EachWithBreak(func(_ int, a *goquery.Selection) bool {
|
||||
if v, ok := a.Attr(attr); ok && strings.Contains(v, substr) {
|
||||
found = true; return false
|
||||
}
|
||||
return true
|
||||
})
|
||||
return found
|
||||
}
|
||||
```
|
||||
|
||||
Always check Kotlin references for `:has()` selectors and convert them to
|
||||
programmatic filtering when porting.
|
||||
|
||||
Detailed implementation notes for complex sources are in the **Notes** section at the bottom.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user