用 Hermes Agent 做本地 AI 助手,网页搜索和内容抓取是刚需。SearXNG 负责聚合多引擎搜索,Firecrawl 负责 JS 渲染抓取,两者用 Docker Compose 一键部署,通过 127.0.0.1:3002 暴露给 Hermes Agent 直连。
为什么需要两层?
直接上结论——单用 SearXNG 搜不到动态渲染页面的内容,单用 Firecrawl 的搜索功能又依赖外部服务。组合起来:
| 能力 |
SearXNG |
Firecrawl |
| 聚合搜索 |
✓ |
✓ (通过 SearXNG) |
| JS 渲染抓取 |
✗ |
✓ (Playwright) |
| 批量爬取 |
✗ |
✓ |
| 结构化提取 |
✗ |
✓ |
架构
1 2 3 4 5 6 7 8 9
| Hermes Agent ├── web_search → Firecrawl API (127.0.0.1:3002/v2/search) │ └── SearXNG (容器内 searxng:8080) │ ├── Google (走代理) │ ├── Bing (直连 cn.bing.com) │ └── Baidu (直连) │ └── web_extract → Firecrawl API (127.0.0.1:3002/v1/scrape) └── Playwright 渲染 JS 页面
|
涉及的容器:
| 组件 |
作用 |
端口 |
| Firecrawl API |
网页抓取 + 搜索代理 |
3002 (本地) |
| Firecrawl Playwright |
JS 渲染 |
容器内 3000 |
| SearXNG |
元搜索引擎 |
8080 (本地) |
| Redis |
Firecrawl 任务队列 |
容器内 6379 |
| RabbitMQ |
Firecrawl 消息队列 |
容器内 5672 |
| PostgreSQL |
Firecrawl 数据存储 |
容器内 5432 |
部署
1. 创建目录
1 2
| mkdir -p /vol1/1000/docker/firecrawl mkdir -p /vol1/1000/docker/searxng
|
2. docker-compose.yaml
创建 /vol1/1000/docker/firecrawl/docker-compose.yaml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
| x-common-service: &common-service restart: unless-stopped networks: - backend
x-common-env: &common-env REDIS_URL: ${REDIS_URL:-redis://redis:6379} REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379} PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000/scrape} POSTGRES_USER: ${POSTGRES_USER:-postgres} POSTGRES_PASSWORD: "${POSTGRES_PASSWORD:-postgres}" POSTGRES_DB: ${POSTGRES_DB:-postgres} POSTGRES_HOST: ${POSTGRES_HOST:-nuq-postgres} POSTGRES_PORT: ${POSTGRES_PORT:-5432} USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION:-false} NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE:-8} CRAWL_CONCURRENT_REQUESTS: ${CRAWL_CONCURRENT_REQUESTS:-10} MAX_CONCURRENT_JOBS: ${MAX_CONCURRENT_JOBS:-5} BROWSER_POOL_SIZE: ${BROWSER_POOL_SIZE:-5} BULL_AUTH_KEY: ${BULL_AUTH_KEY} TEST_API_KEY: ${TEST_API_KEY} SEARXNG_ENDPOINT: ${SEARXNG_ENDPOINT}
networks: backend: driver: bridge
services: playwright-service: image: ghcr.io/firecrawl/playwright-service:latest environment: PORT: 3000 MAX_CONCURRENT_PAGES: ${CRAWL_CONCURRENT_REQUESTS:-10} networks: - backend restart: unless-stopped
api: <<: *common-service image: ghcr.io/firecrawl/firecrawl-api:latest environment: <<: *common-env HOST: "0.0.0.0" PORT: ${INTERNAL_PORT:-3002} EXTRACT_WORKER_PORT: ${EXTRACT_WORKER_PORT:-3004} WORKER_PORT: ${WORKER_PORT:-3005} ports: - "127.0.0.1:3002:3002" depends_on: redis: condition: service_healthy rabbitmq: condition: service_healthy
redis: image: redis:alpine restart: unless-stopped networks: - backend healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 5s retries: 5
rabbitmq: image: rabbitmq:3-management restart: unless-stopped networks: - backend healthcheck: test: rabbitmq-diagnostics -q ping interval: 30s timeout: 30s retries: 3
nuq-postgres: image: ghcr.io/mendableai/nuq-postgres:latest restart: unless-stopped environment: POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres} POSTGRES_USER: ${POSTGRES_USER:-postgres} POSTGRES_DB: ${POSTGRES_DB:-postgres} networks: - backend healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres"] interval: 10s timeout: 5s retries: 5
searxng: image: searxng/searxng:latest restart: unless-stopped volumes: - /vol1/1000/docker/searxng/settings.yml:/etc/searxng/settings.yml:ro - /vol1/1000/docker/searxng/limiter.toml:/etc/searxng/limiter.toml:ro networks: - backend extra_hosts: - "host.docker.internal:host-gateway"
|
3. 环境变量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| cd /vol1/1000/docker/firecrawl
POSTGRES_PASSWORD=$(openssl rand -hex 16) BULL_AUTH_KEY=$(openssl rand -hex 16) TEST_API_KEY=$(openssl rand -hex 16)
cat > .env << EOF PORT=3002 HOST=0.0.0.0 REDIS_URL=redis://redis:6379 PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape POSTGRES_USER=postgres POSTGRES_PASSWORD=$POSTGRES_PASSWORD POSTGRES_DB=postgres POSTGRES_HOST=nuq-postgres POSTGRES_PORT=5432 USE_DB_AUTHENTICATION=false NUM_WORKERS_PER_QUEUE=8 CRAWL_CONCURRENT_REQUESTS=10 MAX_CONCURRENT_JOBS=5 BROWSER_POOL_SIZE=5 BULL_AUTH_KEY=$BULL_AUTH_KEY TEST_API_KEY=$TEST_API_KEY SEARXNG_ENDPOINT=http://searxng:8080 EOF
echo "TEST_API_KEY: $TEST_API_KEY"
|
4. SearXNG 配置
/vol1/1000/docker/searxng/settings.yml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
| use_default_settings: true general: instance_name: "Firecrawl SearXNG"
search: safe_search: 0 autocomplete: "" default_lang: "auto" formats: - html - json
server: bind_address: "0.0.0.0" secret_key: "$(openssl rand -hex 32)" limiter: false image_proxy: true
ui: static_use_hash: true
engines: - name: google disabled: false proxies: all://: - http://host.docker.internal:7890
- name: bing disabled: false base_url: https://cn.bing.com/ - name: baidu disabled: false
- name: duckduckgo disabled: true - name: brave disabled: true - name: startpage disabled: true - name: wikipedia disabled: true - name: wikidata disabled: true - name: qwant disabled: true - name: mojeek disabled: true - name: yahoo disabled: true
|
/vol1/1000/docker/searxng/limiter.toml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| [botdetection] ipv4_prefix = 32 ipv6_prefix = 48 trusted_proxies = [ '127.0.0.0/8', '::1', '172.16.0.0/12', ]
[botdetection.ip_limit] filter_link_local = false link_token = false
[botdetection.ip_lists] block_ip = [] pass_ip = [ '127.0.0.0/8', '::1', '172.16.0.0/12', '10.0.0.0/8', ]
|
5. 启动
1 2 3
| cd /vol1/1000/docker/firecrawl docker compose up -d docker compose ps
|
验证:
1 2 3 4 5
| # Firecrawl API curl -s http://127.0.0.1:3002/
# SearXNG 搜索 curl -s 'http://127.0.0.1:8080/search?q=test&format=json' | head -c 200
|
6. 配置 Hermes Agent
~/.hermes/.env:
1
| FIRECRAWL_API_URL=http://127.0.0.1:3002
|
~/.hermes/config.yaml:
1 2
| web: backend: firecrawl
|
重启网关:
踩坑记录
SearXNG
| 问题 |
原因 |
解决 |
| JSON API 返回 403 |
默认禁止 format=json |
search.formats 加 json |
启动崩溃 limiter.toml schema invalid |
limiter.toml 格式错误 |
用上文格式,挂载到 /etc/searxng/limiter.toml |
| bing 返回 0 结果 |
www.bing.com 302 到 cn.bing.com,httpx 不跟重定向 |
base_url: https://cn.bing.com/ |
| duckduckgo 被 CAPTCHA |
DDG 对自动化请求激进 |
禁用,无可靠方案 |
| 大量 timeout |
use_default_settings: true 启用所有引擎 |
显式禁用不需要的 |
secret_key 太短 |
要求 32+ 字符 |
openssl rand -hex 32 |
Firecrawl
| 问题 |
原因 |
解决 |
| 搜索返回空结果 |
SEARXNG_ENDPOINT 未传入容器 |
x-common-env 须含 SEARXNG_ENDPOINT |
| 改了环境变量不生效 |
docker compose 只 recreate 不 reload env |
docker compose up -d --force-recreate |
USE_DB_AUTHENTICATION 报错 |
Firecrawl 不识别该值 |
设为 false |
| ghcr.io 镜像拉取慢 |
国内网络 |
配 registry mirror,拉取后 docker tag 回原名 |
引擎可用性测试
部署后逐个确认:
1 2 3 4 5
| for engine in google bing baidu; do count=$(curl -s "http://127.0.0.1:8080/search?q=test&format=json&engines=$engine" \ | python3 -c "import sys,json; print(len(json.load(sys.stdin).get('results',[])))") echo "$engine: $count results" done
|
总结
Firecrawl + SearXNG 这套方案,SearXNG 负责聚合 Google/Bing/Baidu 的搜索结果,Firecrawl 负责 JS 渲染和网页抓取。两个容器各自独立,通过内部网络通信,对外只暴露 127.0.0.1:3002 一个端口。部署过程主要的坑集中在 SearXNG 的引擎配置上——默认启用太多引擎会导致 timeout,bing 要用 cn.bing.com 避免重定向问题,DuckDuckGo 直接禁用。
整套方案不需要公网暴露,纯本地运行,适合对数据隐私有要求的场景。