用 Hermes Agent 做本地 AI 助手,网页搜索和内容抓取是刚需。SearXNG 负责聚合多引擎搜索,Firecrawl 负责 JS 渲染抓取,两者用 Docker Compose 一键部署,通过 127.0.0.1:3002 暴露给 Hermes Agent 直连。

为什么需要两层?

直接上结论——单用 SearXNG 搜不到动态渲染页面的内容,单用 Firecrawl 的搜索功能又依赖外部服务。组合起来:

能力 SearXNG Firecrawl
聚合搜索 ✓ (通过 SearXNG)
JS 渲染抓取 ✓ (Playwright)
批量爬取
结构化提取

架构

1
2
3
4
5
6
7
8
9
Hermes Agent
├── web_search → Firecrawl API (127.0.0.1:3002/v2/search)
│ └── SearXNG (容器内 searxng:8080)
│ ├── Google (走代理)
│ ├── Bing (直连 cn.bing.com)
│ └── Baidu (直连)

└── web_extract → Firecrawl API (127.0.0.1:3002/v1/scrape)
└── Playwright 渲染 JS 页面

涉及的容器:

组件 作用 端口
Firecrawl API 网页抓取 + 搜索代理 3002 (本地)
Firecrawl Playwright JS 渲染 容器内 3000
SearXNG 元搜索引擎 8080 (本地)
Redis Firecrawl 任务队列 容器内 6379
RabbitMQ Firecrawl 消息队列 容器内 5672
PostgreSQL Firecrawl 数据存储 容器内 5432

部署

1. 创建目录

1
2
mkdir -p /vol1/1000/docker/firecrawl
mkdir -p /vol1/1000/docker/searxng

2. docker-compose.yaml

创建 /vol1/1000/docker/firecrawl/docker-compose.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
x-common-service: &common-service
restart: unless-stopped
networks:
- backend

x-common-env: &common-env
REDIS_URL: ${REDIS_URL:-redis://redis:6379}
REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379}
PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000/scrape}
POSTGRES_USER: ${POSTGRES_USER:-postgres}
POSTGRES_PASSWORD: "${POSTGRES_PASSWORD:-postgres}"
POSTGRES_DB: ${POSTGRES_DB:-postgres}
POSTGRES_HOST: ${POSTGRES_HOST:-nuq-postgres}
POSTGRES_PORT: ${POSTGRES_PORT:-5432}
USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION:-false}
NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE:-8}
CRAWL_CONCURRENT_REQUESTS: ${CRAWL_CONCURRENT_REQUESTS:-10}
MAX_CONCURRENT_JOBS: ${MAX_CONCURRENT_JOBS:-5}
BROWSER_POOL_SIZE: ${BROWSER_POOL_SIZE:-5}
BULL_AUTH_KEY: ${BULL_AUTH_KEY}
TEST_API_KEY: ${TEST_API_KEY}
SEARXNG_ENDPOINT: ${SEARXNG_ENDPOINT}

networks:
backend:
driver: bridge

services:
playwright-service:
image: ghcr.io/firecrawl/playwright-service:latest
environment:
PORT: 3000
MAX_CONCURRENT_PAGES: ${CRAWL_CONCURRENT_REQUESTS:-10}
networks:
- backend
restart: unless-stopped

api:
<<: *common-service
image: ghcr.io/firecrawl/firecrawl-api:latest
environment:
<<: *common-env
HOST: "0.0.0.0"
PORT: ${INTERNAL_PORT:-3002}
EXTRACT_WORKER_PORT: ${EXTRACT_WORKER_PORT:-3004}
WORKER_PORT: ${WORKER_PORT:-3005}
ports:
- "127.0.0.1:3002:3002"
depends_on:
redis:
condition: service_healthy
rabbitmq:
condition: service_healthy

redis:
image: redis:alpine
restart: unless-stopped
networks:
- backend
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5

rabbitmq:
image: rabbitmq:3-management
restart: unless-stopped
networks:
- backend
healthcheck:
test: rabbitmq-diagnostics -q ping
interval: 30s
timeout: 30s
retries: 3

nuq-postgres:
image: ghcr.io/mendableai/nuq-postgres:latest
restart: unless-stopped
environment:
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
POSTGRES_USER: ${POSTGRES_USER:-postgres}
POSTGRES_DB: ${POSTGRES_DB:-postgres}
networks:
- backend
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5

searxng:
image: searxng/searxng:latest
restart: unless-stopped
volumes:
- /vol1/1000/docker/searxng/settings.yml:/etc/searxng/settings.yml:ro
- /vol1/1000/docker/searxng/limiter.toml:/etc/searxng/limiter.toml:ro
networks:
- backend
extra_hosts:
- "host.docker.internal:host-gateway"

3. 环境变量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cd /vol1/1000/docker/firecrawl

POSTGRES_PASSWORD=$(openssl rand -hex 16)
BULL_AUTH_KEY=$(openssl rand -hex 16)
TEST_API_KEY=$(openssl rand -hex 16)

cat > .env << EOF
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape
POSTGRES_USER=postgres
POSTGRES_PASSWORD=$POSTGRES_PASSWORD
POSTGRES_DB=postgres
POSTGRES_HOST=nuq-postgres
POSTGRES_PORT=5432
USE_DB_AUTHENTICATION=false
NUM_WORKERS_PER_QUEUE=8
CRAWL_CONCURRENT_REQUESTS=10
MAX_CONCURRENT_JOBS=5
BROWSER_POOL_SIZE=5
BULL_AUTH_KEY=$BULL_AUTH_KEY
TEST_API_KEY=$TEST_API_KEY
SEARXNG_ENDPOINT=http://searxng:8080
EOF

echo "TEST_API_KEY: $TEST_API_KEY"

4. SearXNG 配置

/vol1/1000/docker/searxng/settings.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
use_default_settings: true
general:
instance_name: "Firecrawl SearXNG"

search:
safe_search: 0
autocomplete: ""
default_lang: "auto"
formats:
- html
- json

server:
bind_address: "0.0.0.0"
secret_key: "$(openssl rand -hex 32)"
limiter: false
image_proxy: true

ui:
static_use_hash: true

engines:
# 需要代理的引擎
- name: google
disabled: false
proxies:
all://:
- http://host.docker.internal:7890

# 可直连的引擎
- name: bing
disabled: false
base_url: https://cn.bing.com/
- name: baidu
disabled: false

# 禁用不可用的引擎
- name: duckduckgo
disabled: true
- name: brave
disabled: true
- name: startpage
disabled: true
- name: wikipedia
disabled: true
- name: wikidata
disabled: true
- name: qwant
disabled: true
- name: mojeek
disabled: true
- name: yahoo
disabled: true

/vol1/1000/docker/searxng/limiter.toml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[botdetection]
ipv4_prefix = 32
ipv6_prefix = 48
trusted_proxies = [
'127.0.0.0/8',
'::1',
'172.16.0.0/12',
]

[botdetection.ip_limit]
filter_link_local = false
link_token = false

[botdetection.ip_lists]
block_ip = []
pass_ip = [
'127.0.0.0/8',
'::1',
'172.16.0.0/12',
'10.0.0.0/8',
]

5. 启动

1
2
3
cd /vol1/1000/docker/firecrawl
docker compose up -d
docker compose ps

验证:

1
2
3
4
5
# Firecrawl API
curl -s http://127.0.0.1:3002/

# SearXNG 搜索
curl -s 'http://127.0.0.1:8080/search?q=test&format=json' | head -c 200

6. 配置 Hermes Agent

~/.hermes/.env

1
FIRECRAWL_API_URL=http://127.0.0.1:3002

~/.hermes/config.yaml

1
2
web:
backend: firecrawl

重启网关:

1
hermes gateway restart

踩坑记录

SearXNG

问题 原因 解决
JSON API 返回 403 默认禁止 format=json search.formatsjson
启动崩溃 limiter.toml schema invalid limiter.toml 格式错误 用上文格式,挂载到 /etc/searxng/limiter.toml
bing 返回 0 结果 www.bing.com 302 到 cn.bing.com,httpx 不跟重定向 base_url: https://cn.bing.com/
duckduckgo 被 CAPTCHA DDG 对自动化请求激进 禁用,无可靠方案
大量 timeout use_default_settings: true 启用所有引擎 显式禁用不需要的
secret_key 太短 要求 32+ 字符 openssl rand -hex 32

Firecrawl

问题 原因 解决
搜索返回空结果 SEARXNG_ENDPOINT 未传入容器 x-common-env 须含 SEARXNG_ENDPOINT
改了环境变量不生效 docker compose 只 recreate 不 reload env docker compose up -d --force-recreate
USE_DB_AUTHENTICATION 报错 Firecrawl 不识别该值 设为 false
ghcr.io 镜像拉取慢 国内网络 配 registry mirror,拉取后 docker tag 回原名

引擎可用性测试

部署后逐个确认:

1
2
3
4
5
for engine in google bing baidu; do
count=$(curl -s "http://127.0.0.1:8080/search?q=test&format=json&engines=$engine" \
| python3 -c "import sys,json; print(len(json.load(sys.stdin).get('results',[])))")
echo "$engine: $count results"
done

总结

Firecrawl + SearXNG 这套方案,SearXNG 负责聚合 Google/Bing/Baidu 的搜索结果,Firecrawl 负责 JS 渲染和网页抓取。两个容器各自独立,通过内部网络通信,对外只暴露 127.0.0.1:3002 一个端口。部署过程主要的坑集中在 SearXNG 的引擎配置上——默认启用太多引擎会导致 timeout,bing 要用 cn.bing.com 避免重定向问题,DuckDuckGo 直接禁用。

整套方案不需要公网暴露,纯本地运行,适合对数据隐私有要求的场景。