Some checks failed
Build and Deploy Log Center / build-and-deploy (push) Failing after 1m7s
monitor 启动时调用 GET /api/v1/projects 拉取项目列表, 自动生成 app label -> project_id 映射(下划线转短横线 + -dev 变体), 新项目只需在 Log Center 注册即可自动纳入 K8s 监控。 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
802 lines
24 KiB
Markdown
802 lines
24 KiB
Markdown
# Log Center 接入指南
|
||
|
||
## 概述
|
||
|
||
Log Center 是一个集中式错误日志收集与 AI 自动修复平台,提供 REST API 供各项目接入。
|
||
|
||
接入后覆盖三类错误上报:
|
||
|
||
| 类型 | `source` 值 | 说明 | 触发方式 |
|
||
|------|-------------|------|----------|
|
||
| 日常运行错误 | `runtime` | 应用运行时的异常(Python/JS/Dart) | 代码中全局捕获异常自动上报 |
|
||
| CI/CD 错误 | `cicd` | 构建、测试、Lint 等流水线失败 | Gitea Actions 步骤失败时上报 |
|
||
| K8s 部署错误 | `deployment` | Pod 异常状态(CrashLoopBackOff、OOMKilled 等) | K8s CronJob 定时扫描上报 |
|
||
|
||
**完整接入流程:**
|
||
|
||
1. **注册项目信息** — 调用 API 提交项目元信息(名称、仓库地址、本地路径)
|
||
2. **接入日常运行错误上报** — 在应用代码中集成全局异常捕获
|
||
3. **接入 CI/CD 错误上报** — 在 Gitea Actions 流水线中添加失败上报步骤
|
||
4. **接入 K8s 部署错误上报** — 在 K8s Pod 健康监控中添加项目映射
|
||
|
||
> **重要**: 必须先完成步骤 1,否则 Repair Agent 无法定位代码仓库和本地路径。
|
||
|
||
---
|
||
|
||
## 服务地址
|
||
|
||
| 环境 | API 地址 | 仪表盘 |
|
||
|------|----------|--------|
|
||
| 本地开发 | `http://localhost:8002` | `http://localhost:8003` |
|
||
| 生产环境 | `https://qiyuan-log-center-api.airlabs.art` | `https://qiyuan-log-center.airlabs.art` |
|
||
|
||
---
|
||
|
||
## 步骤 1:注册项目信息
|
||
|
||
首次接入 Log Center 时,**必须先注册项目信息**。这是 Repair Agent 正常工作的前提。
|
||
|
||
### 注册方式
|
||
|
||
先上报一条初始化日志(触发项目自动创建),再调用 PUT 接口补全元信息:
|
||
|
||
```bash
|
||
# 1. 上报初始化日志,触发项目自动创建
|
||
curl -X POST "${LOG_CENTER_URL}/api/v1/logs/report" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"project_id": "your_project_id",
|
||
"environment": "production",
|
||
"level": "WARNING",
|
||
"error": {
|
||
"type": "ProjectInit",
|
||
"message": "Project registered to Log Center",
|
||
"stack_trace": ["Project initialization"]
|
||
},
|
||
"repo_url": "https://gitea.airlabs.art/team/your_project.git"
|
||
}'
|
||
|
||
# 2. 补全项目元信息
|
||
curl -X PUT "${LOG_CENTER_URL}/api/v1/projects/your_project_id" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"name": "项目显示名称",
|
||
"repo_url": "https://gitea.airlabs.art/team/your_project.git",
|
||
"local_path": "/absolute/path/to/project",
|
||
"description": "项目描述"
|
||
}'
|
||
```
|
||
|
||
### 各语言注册示例
|
||
|
||
#### Python
|
||
|
||
```python
|
||
import requests
|
||
import os
|
||
|
||
LOG_CENTER_URL = os.getenv("LOG_CENTER_URL", "http://localhost:8002")
|
||
|
||
def register_project():
|
||
"""首次接入时调用,注册项目到 Log Center。"""
|
||
project_id = "your_project_id"
|
||
|
||
# 1. 上报初始化日志触发项目创建
|
||
requests.post(f"{LOG_CENTER_URL}/api/v1/logs/report", json={
|
||
"project_id": project_id,
|
||
"environment": os.getenv("ENVIRONMENT", "production"),
|
||
"level": "WARNING",
|
||
"error": {
|
||
"type": "ProjectInit",
|
||
"message": "Project registered to Log Center",
|
||
"stack_trace": ["Project initialization"],
|
||
},
|
||
"repo_url": "https://gitea.airlabs.art/team/your_project.git",
|
||
}, timeout=5)
|
||
|
||
# 2. 补全项目元信息
|
||
requests.put(f"{LOG_CENTER_URL}/api/v1/projects/{project_id}", json={
|
||
"name": "项目显示名称",
|
||
"repo_url": "https://gitea.airlabs.art/team/your_project.git",
|
||
"local_path": "/absolute/path/to/project",
|
||
"description": "项目描述",
|
||
}, timeout=5)
|
||
```
|
||
|
||
#### JavaScript / TypeScript
|
||
|
||
```typescript
|
||
const LOG_CENTER_URL = import.meta.env.VITE_LOG_CENTER_URL || 'http://localhost:8002';
|
||
|
||
async function registerProject() {
|
||
const projectId = 'your_project_id';
|
||
|
||
// 1. 上报初始化日志触发项目创建
|
||
await fetch(`${LOG_CENTER_URL}/api/v1/logs/report`, {
|
||
method: 'POST',
|
||
headers: { 'Content-Type': 'application/json' },
|
||
body: JSON.stringify({
|
||
project_id: projectId,
|
||
environment: import.meta.env.MODE,
|
||
level: 'WARNING',
|
||
error: {
|
||
type: 'ProjectInit',
|
||
message: 'Project registered to Log Center',
|
||
stack_trace: ['Project initialization'],
|
||
},
|
||
repo_url: 'https://gitea.airlabs.art/team/your_project.git',
|
||
}),
|
||
});
|
||
|
||
// 2. 补全项目元信息
|
||
await fetch(`${LOG_CENTER_URL}/api/v1/projects/${projectId}`, {
|
||
method: 'PUT',
|
||
headers: { 'Content-Type': 'application/json' },
|
||
body: JSON.stringify({
|
||
name: '项目显示名称',
|
||
repo_url: 'https://gitea.airlabs.art/team/your_project.git',
|
||
local_path: '/absolute/path/to/project',
|
||
description: '项目描述',
|
||
}),
|
||
});
|
||
}
|
||
```
|
||
|
||
### 项目元信息字段
|
||
|
||
| 字段 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| `project_id` | string | ✅ | 项目唯一标识,如 `rtc_backend`, `rtc_web` |
|
||
| `name` | string | ✅ | 项目显示名称 |
|
||
| `repo_url` | string | ✅ | Git 仓库地址(Repair Agent 克隆/推送代码用) |
|
||
| `local_path` | string | ✅ | 本地项目绝对路径(Repair Agent 在此目录执行修复) |
|
||
| `description` | string | ❌ | 项目描述 |
|
||
|
||
---
|
||
|
||
## 步骤 2:接入日常运行错误上报
|
||
|
||
> `source: "runtime"`(默认值,可不传)
|
||
|
||
在应用代码中集成全局异常捕获,运行时发生未处理异常时自动上报到 Log Center。
|
||
|
||
### 上报格式
|
||
|
||
```json
|
||
{
|
||
"project_id": "rtc_backend",
|
||
"environment": "production",
|
||
"level": "ERROR",
|
||
"error": {
|
||
"type": "ValueError",
|
||
"message": "invalid literal for int() with base 10: 'abc'",
|
||
"file_path": "apps/users/views.py",
|
||
"line_number": 42,
|
||
"stack_trace": [
|
||
"Traceback (most recent call last):",
|
||
" File \"apps/users/views.py\", line 42, in get_user",
|
||
"ValueError: invalid literal for int() with base 10: 'abc'"
|
||
]
|
||
},
|
||
"context": {
|
||
"url": "/api/users/123",
|
||
"method": "GET",
|
||
"user_id": "u_12345"
|
||
}
|
||
}
|
||
```
|
||
|
||
### Runtime 字段说明
|
||
|
||
| 字段 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| `project_id` | string | ✅ | 项目标识 |
|
||
| `environment` | string | ✅ | 环境:`development`, `staging`, `production` |
|
||
| `level` | string | ✅ | 日志级别:`ERROR`, `WARNING`, `CRITICAL` |
|
||
| `source` | string | ❌ | 默认 `runtime`,无需传 |
|
||
| `timestamp` | string | ❌ | ISO 8601 格式,不传则用服务器时间 |
|
||
| `version` | string | ❌ | 应用版本号 |
|
||
| `commit_hash` | string | ❌ | Git commit hash |
|
||
| `error.type` | string | ✅ | 异常类型,如 `ValueError`, `TypeError` |
|
||
| `error.message` | string | ✅ | 错误消息 |
|
||
| `error.file_path` | string | ✅ | 出错文件路径 |
|
||
| `error.line_number` | int | ✅ | 出错行号 |
|
||
| `error.stack_trace` | array | ✅ | 堆栈信息(数组或字符串) |
|
||
| `context` | object | ❌ | 额外上下文信息 |
|
||
|
||
### Python (Django / FastAPI)
|
||
|
||
```python
|
||
import requests
|
||
import traceback
|
||
import os
|
||
|
||
LOG_CENTER_URL = os.getenv("LOG_CENTER_URL", "http://localhost:8002")
|
||
|
||
def report_error(exc, context=None):
|
||
"""上报运行时错误到 Log Center"""
|
||
tb = traceback.extract_tb(exc.__traceback__)
|
||
last_frame = tb[-1] if tb else None
|
||
|
||
payload = {
|
||
"project_id": "rtc_backend",
|
||
"environment": os.getenv("ENVIRONMENT", "development"),
|
||
"level": "ERROR",
|
||
"error": {
|
||
"type": type(exc).__name__,
|
||
"message": str(exc),
|
||
"file_path": last_frame.filename if last_frame else "unknown",
|
||
"line_number": last_frame.lineno if last_frame else 0,
|
||
"stack_trace": traceback.format_exception(exc)
|
||
},
|
||
"context": context or {}
|
||
}
|
||
|
||
try:
|
||
requests.post(
|
||
f"{LOG_CENTER_URL}/api/v1/logs/report",
|
||
json=payload,
|
||
timeout=3
|
||
)
|
||
except Exception:
|
||
pass # 静默失败,不影响主业务
|
||
```
|
||
|
||
**Django 集成位置** — 修改 `utils/exceptions.py` 的 `custom_exception_handler`:
|
||
|
||
```python
|
||
def custom_exception_handler(exc, context):
|
||
# 上报到 Log Center
|
||
report_error(exc, {
|
||
"view": str(context.get("view")),
|
||
"request_path": context.get("request").path if context.get("request") else None,
|
||
})
|
||
# ... 原有逻辑不变 ...
|
||
```
|
||
|
||
**FastAPI 集成位置** — 添加全局异常处理器:
|
||
|
||
```python
|
||
from fastapi import Request
|
||
from fastapi.responses import JSONResponse
|
||
|
||
@app.exception_handler(Exception)
|
||
async def global_exception_handler(request: Request, exc: Exception):
|
||
await report_error(exc, context={
|
||
"url": str(request.url),
|
||
"method": request.method,
|
||
})
|
||
return JSONResponse(status_code=500, content={"detail": "Internal Server Error"})
|
||
```
|
||
|
||
### JavaScript / TypeScript (React / Vue)
|
||
|
||
```typescript
|
||
const LOG_CENTER_URL = import.meta.env.VITE_LOG_CENTER_URL || 'http://localhost:8002';
|
||
|
||
export function reportError(error: Error, context?: Record<string, unknown>) {
|
||
const stackLines = error.stack?.split('\n') || [];
|
||
const match = stackLines[1]?.match(/at\s+.*\s+\((.+):(\d+):\d+\)/);
|
||
|
||
const payload = {
|
||
project_id: 'rtc_web',
|
||
environment: import.meta.env.MODE,
|
||
level: 'ERROR',
|
||
error: {
|
||
type: error.name,
|
||
message: error.message,
|
||
file_path: match?.[1] || 'unknown',
|
||
line_number: parseInt(match?.[2] || '0'),
|
||
stack_trace: stackLines,
|
||
},
|
||
context: {
|
||
url: window.location.href,
|
||
userAgent: navigator.userAgent,
|
||
...context,
|
||
},
|
||
};
|
||
|
||
const blob = new Blob([JSON.stringify(payload)], { type: 'application/json' });
|
||
if (navigator.sendBeacon) {
|
||
navigator.sendBeacon(`${LOG_CENTER_URL}/api/v1/logs/report`, blob);
|
||
} else {
|
||
fetch(`${LOG_CENTER_URL}/api/v1/logs/report`, {
|
||
method: 'POST',
|
||
headers: { 'Content-Type': 'application/json' },
|
||
body: JSON.stringify(payload),
|
||
keepalive: true,
|
||
}).catch(() => {});
|
||
}
|
||
}
|
||
```
|
||
|
||
**全局错误捕获** — 在 `main.tsx` / `main.ts` 入口文件中:
|
||
|
||
```typescript
|
||
// JS 运行时异常
|
||
window.onerror = (_message, source, lineno, colno, error) => {
|
||
if (error) reportError(error, { source, lineno, colno });
|
||
};
|
||
|
||
// 未处理的 Promise rejection
|
||
window.onunhandledrejection = (event: PromiseRejectionEvent) => {
|
||
const error = event.reason instanceof Error
|
||
? event.reason
|
||
: new Error(String(event.reason));
|
||
reportError(error, { type: 'unhandledrejection' });
|
||
};
|
||
```
|
||
|
||
**Axios 拦截器** — 在 `api.ts` / `request.ts` 中(仅上报 5xx 服务端错误):
|
||
|
||
```typescript
|
||
api.interceptors.response.use(
|
||
(response) => response,
|
||
(error: AxiosError) => {
|
||
if (error.response && error.response.status >= 500) {
|
||
reportError(error, {
|
||
api_url: error.config?.url,
|
||
method: error.config?.method,
|
||
status: error.response.status,
|
||
});
|
||
}
|
||
return Promise.reject(error);
|
||
},
|
||
);
|
||
```
|
||
|
||
### Flutter (Dart)
|
||
|
||
```dart
|
||
import 'dart:convert';
|
||
import 'package:http/http.dart' as http;
|
||
|
||
const logCenterUrl = String.fromEnvironment(
|
||
'LOG_CENTER_URL',
|
||
defaultValue: 'http://localhost:8002',
|
||
);
|
||
|
||
Future<void> reportError(dynamic error, StackTrace stackTrace, {Map<String, dynamic>? context}) async {
|
||
final stackLines = stackTrace.toString().split('\n');
|
||
final match = RegExp(r'#0\s+.*\((.+):(\d+):\d+\)').firstMatch(stackLines.first);
|
||
|
||
final payload = {
|
||
'project_id': 'airhub_app',
|
||
'environment': const String.fromEnvironment('ENVIRONMENT', defaultValue: 'development'),
|
||
'level': 'ERROR',
|
||
'error': {
|
||
'type': error.runtimeType.toString(),
|
||
'message': error.toString(),
|
||
'file_path': match?.group(1) ?? 'unknown',
|
||
'line_number': int.tryParse(match?.group(2) ?? '0') ?? 0,
|
||
'stack_trace': stackLines.take(20).toList(),
|
||
},
|
||
'context': context ?? {},
|
||
};
|
||
|
||
try {
|
||
await http.post(
|
||
Uri.parse('$logCenterUrl/api/v1/logs/report'),
|
||
headers: {'Content-Type': 'application/json'},
|
||
body: jsonEncode(payload),
|
||
).timeout(const Duration(seconds: 3));
|
||
} catch (_) {
|
||
// 静默失败
|
||
}
|
||
}
|
||
```
|
||
|
||
**全局捕获** — 在 `main.dart` 中:
|
||
|
||
```dart
|
||
void main() {
|
||
FlutterError.onError = (details) {
|
||
reportError(details.exception, details.stack ?? StackTrace.current);
|
||
};
|
||
|
||
runZonedGuarded(() {
|
||
runApp(const MyApp());
|
||
}, (error, stack) {
|
||
reportError(error, stack);
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 步骤 3:接入 CI/CD 错误上报
|
||
|
||
> `source: "cicd"`
|
||
|
||
在 Gitea Actions 流水线中添加失败上报,构建/测试/部署失败时自动捕获实际错误日志并上报到 Log Center。
|
||
|
||
### 核心要点
|
||
|
||
1. **用 `tee` 捕获日志** — 构建和部署步骤的输出必须通过 `2>&1 | tee /tmp/xxx.log` 捕获,否则上报的 stack_trace 为空
|
||
2. **用 `github.run_number`** — URL 中必须使用 `${{ github.run_number }}`(仓库维度序号),**不要用 `github.run_id`**(全局ID,会导致跳转到错误页面)
|
||
3. **用 `${{ }}` 模板语法** — 比 `$GITHUB_*` 环境变量更可靠
|
||
4. **单一综合上报步骤** — 一个 `if: failure()` 步骤自动判断哪个阶段失败,收集对应日志
|
||
|
||
### 上报格式
|
||
|
||
```json
|
||
{
|
||
"project_id": "rtc_backend",
|
||
"environment": "main",
|
||
"level": "ERROR",
|
||
"source": "cicd",
|
||
"commit_hash": "abc1234def5678",
|
||
"repo_url": "https://gitea.airlabs.art/zyc/rtc_backend.git",
|
||
"error": {
|
||
"type": "CICDFailure",
|
||
"message": "[build] Build and Deploy failed on branch main",
|
||
"stack_trace": ["...实际构建日志最后 50 行..."]
|
||
},
|
||
"context": {
|
||
"job_name": "build-and-deploy",
|
||
"step_name": "build",
|
||
"workflow": "Build and Deploy",
|
||
"run_id": "24",
|
||
"branch": "main",
|
||
"actor": "zyc",
|
||
"commit": "abc1234def5678",
|
||
"run_url": "https://gitea.airlabs.art/zyc/rtc_backend/actions/runs/24"
|
||
}
|
||
}
|
||
```
|
||
|
||
### CI/CD 特有字段
|
||
|
||
| 字段 | 说明 |
|
||
|------|------|
|
||
| `source` | **必须**设为 `"cicd"` |
|
||
| `environment` | 用分支名 `${{ github.ref_name }}`,如 `main`、`dev` |
|
||
| `repo_url` | 仓库地址,便于 Repair Agent 关联 |
|
||
| `error.type` | 推荐 `CICDFailure`(通用)或 `DockerBuildError` / `TestFailure` / `DeployError` |
|
||
| `error.stack_trace` | **实际错误日志**(通过 `tee` 捕获),不要写死占位文字 |
|
||
| `context.run_id` | **必须用 `${{ github.run_number }}`**(不是 `github.run_id`) |
|
||
| `context.run_url` | 拼接方式:`https://gitea.airlabs.art/${{ github.repository }}/actions/runs/${{ github.run_number }}` |
|
||
| `context.step_name` | 失败的步骤名称 |
|
||
| `context.actor` | 触发者 |
|
||
| `context.commit` | 完整 commit hash |
|
||
|
||
### Gitea Actions 集成方式(推荐)
|
||
|
||
以下是完整示例,关键点:构建步骤用 `tee` 捕获日志,末尾一个综合上报步骤自动判断失败阶段。
|
||
|
||
```yaml
|
||
name: Build and Deploy
|
||
|
||
on:
|
||
push:
|
||
branches: [main]
|
||
|
||
jobs:
|
||
build-and-deploy:
|
||
runs-on: ubuntu-latest
|
||
steps:
|
||
- uses: actions/checkout@v3
|
||
|
||
# ===== 构建步骤:用 tee 捕获日志 =====
|
||
|
||
- name: Build Docker Image
|
||
id: build
|
||
run: |
|
||
set -o pipefail
|
||
docker buildx build \
|
||
--push \
|
||
--provenance=false \
|
||
--tag your-registry/your-app:latest \
|
||
. 2>&1 | tee /tmp/build.log
|
||
|
||
- name: Deploy
|
||
id: deploy
|
||
run: |
|
||
set -o pipefail
|
||
{
|
||
kubectl apply -f k8s/deployment.yaml
|
||
kubectl rollout restart deployment/your-app
|
||
} 2>&1 | tee /tmp/deploy.log
|
||
|
||
# ===== 失败上报(单一综合步骤) =====
|
||
|
||
- name: Report failure to Log Center
|
||
if: failure()
|
||
run: |
|
||
# 判断哪个步骤失败,收集对应日志
|
||
BUILD_LOG=""
|
||
DEPLOY_LOG=""
|
||
FAILED_STEP="unknown"
|
||
|
||
if [[ "${{ steps.build.outcome }}" == "failure" ]]; then
|
||
FAILED_STEP="build"
|
||
if [ -f /tmp/build.log ]; then
|
||
BUILD_LOG=$(tail -50 /tmp/build.log | sed 's/"/\\"/g' | sed ':a;N;$!ba;s/\n/\\n/g')
|
||
fi
|
||
elif [[ "${{ steps.deploy.outcome }}" == "failure" ]]; then
|
||
FAILED_STEP="deploy"
|
||
if [ -f /tmp/deploy.log ]; then
|
||
DEPLOY_LOG=$(tail -50 /tmp/deploy.log | sed 's/"/\\"/g' | sed ':a;N;$!ba;s/\n/\\n/g')
|
||
fi
|
||
fi
|
||
|
||
ERROR_LOG="${BUILD_LOG}${DEPLOY_LOG}"
|
||
if [ -z "$ERROR_LOG" ]; then
|
||
ERROR_LOG="No captured output. Check Gitea Actions UI for details."
|
||
fi
|
||
|
||
# 判断 source
|
||
if [[ "$FAILED_STEP" == "deploy" ]]; then
|
||
SOURCE="deployment"
|
||
ERROR_TYPE="DeployError"
|
||
else
|
||
SOURCE="cicd"
|
||
ERROR_TYPE="DockerBuildError"
|
||
fi
|
||
|
||
curl -s -X POST "https://qiyuan-log-center-api.airlabs.art/api/v1/logs/report" \
|
||
-H "Content-Type: application/json" \
|
||
-d "{
|
||
\"project_id\": \"your_project_id\",
|
||
\"environment\": \"${{ github.ref_name }}\",
|
||
\"level\": \"ERROR\",
|
||
\"source\": \"${SOURCE}\",
|
||
\"commit_hash\": \"${{ github.sha }}\",
|
||
\"repo_url\": \"https://gitea.airlabs.art/zyc/your_project.git\",
|
||
\"error\": {
|
||
\"type\": \"${ERROR_TYPE}\",
|
||
\"message\": \"[${FAILED_STEP}] Build and Deploy failed on branch ${{ github.ref_name }}\",
|
||
\"stack_trace\": [\"${ERROR_LOG}\"]
|
||
},
|
||
\"context\": {
|
||
\"job_name\": \"build-and-deploy\",
|
||
\"step_name\": \"${FAILED_STEP}\",
|
||
\"workflow\": \"${{ github.workflow }}\",
|
||
\"run_id\": \"${{ github.run_number }}\",
|
||
\"branch\": \"${{ github.ref_name }}\",
|
||
\"actor\": \"${{ github.actor }}\",
|
||
\"commit\": \"${{ github.sha }}\",
|
||
\"run_url\": \"https://gitea.airlabs.art/${{ github.repository }}/actions/runs/${{ github.run_number }}\"
|
||
}
|
||
}" || true
|
||
```
|
||
|
||
### 使用 report-cicd-error.sh 脚本
|
||
|
||
项目提供了通用上报脚本 `scripts/report-cicd-error.sh`(需要 `jq`),可在 CI 步骤中使用:
|
||
|
||
```bash
|
||
# 用法: ./scripts/report-cicd-error.sh <project_id> <step_name> <error_message_or_file>
|
||
./scripts/report-cicd-error.sh rtc_backend "Build Docker Image" "Docker build failed: exit code 1"
|
||
./scripts/report-cicd-error.sh rtc_backend "Run Tests" /tmp/test-output.log
|
||
```
|
||
|
||
脚本会自动:
|
||
- 根据步骤名推断 `error_type`(DockerBuildError / NpmBuildError / TestFailure / LintError)
|
||
- 读取 Gitea Actions 环境变量填充 context
|
||
- 如果传入文件路径,读取最后 100 行作为 stack_trace
|
||
|
||
---
|
||
|
||
## 步骤 4:接入 K8s 部署错误上报
|
||
|
||
> `source: "deployment"`
|
||
|
||
通过 K8s Pod 健康监控 CronJob,定时扫描集群中异常 Pod 并上报到 Log Center。
|
||
|
||
### 上报格式
|
||
|
||
```json
|
||
{
|
||
"project_id": "rtc_backend",
|
||
"environment": "production",
|
||
"level": "CRITICAL",
|
||
"source": "deployment",
|
||
"error": {
|
||
"type": "CrashLoopBackOff",
|
||
"message": "CrashLoopBackOff: back-off restarting failed container (pod: rtc-backend-xxx, container: api)",
|
||
"file_path": null,
|
||
"line_number": null,
|
||
"stack_trace": ["...容器崩溃前的日志(最后 50 行)..."]
|
||
},
|
||
"context": {
|
||
"namespace": "default",
|
||
"pod_name": "rtc-backend-xxx-yyy",
|
||
"container_name": "api",
|
||
"deployment_name": "rtc-backend",
|
||
"restart_count": 5,
|
||
"node_name": "node-1"
|
||
}
|
||
}
|
||
```
|
||
|
||
### Deployment 特有字段
|
||
|
||
| 字段 | 说明 |
|
||
|------|------|
|
||
| `source` | **必须**设为 `"deployment"` |
|
||
| `level` | 建议 `"CRITICAL"`,Pod 异常通常较严重 |
|
||
| `error.type` | 取自 K8s 状态:`CrashLoopBackOff`, `OOMKilled`, `ImagePullBackOff`, `ErrImagePull` 等 |
|
||
| `error.file_path` | 可为 `null` |
|
||
| `error.line_number` | 可为 `null` |
|
||
| `error.stack_trace` | 容器崩溃前的日志输出 |
|
||
| `context.namespace` | K8s 命名空间 |
|
||
| `context.pod_name` | Pod 名称 |
|
||
| `context.deployment_name` | Deployment 名称(用于指纹去重) |
|
||
| `context.restart_count` | 重启次数 |
|
||
| `context.node_name` | 节点名 |
|
||
|
||
### 监控的异常状态
|
||
|
||
| 状态 | 说明 |
|
||
|------|------|
|
||
| `CrashLoopBackOff` | 容器反复崩溃重启 |
|
||
| `OOMKilled` | 内存溢出被杀 |
|
||
| `ImagePullBackOff` / `ErrImagePull` | 拉取镜像失败 |
|
||
| `CreateContainerConfigError` | 容器配置错误 |
|
||
| `RunContainerError` | 容器启动失败 |
|
||
|
||
### 接入方式:自动映射
|
||
|
||
K8s Monitor CronJob 已在集群中运行,每 5 分钟扫描一次。Monitor 启动时会从 Log Center API(`GET /api/v1/projects`)动态拉取项目列表,自动生成 app label -> project_id 的映射。
|
||
|
||
**映射规则**:`project_id` 中的下划线替换为短横线作为 app label,同时生成 `-dev` 后缀变体。
|
||
|
||
| project_id | 自动生成的 app label |
|
||
|---|---|
|
||
| `rtc_backend` | `rtc-backend`, `rtc-backend-dev` |
|
||
| `rtc_web` | `rtc-web`, `rtc-web-dev` |
|
||
| `log_center_api` | `log-center-api`, `log-center-api-dev` |
|
||
|
||
**新项目接入 K8s 监控只需两步**:
|
||
1. 在步骤 1 中完成项目注册(确保项目出现在 Log Center 项目列表中)
|
||
2. K8s Deployment 的 `app` label 使用 `project_id` 的短横线形式
|
||
|
||
确保你的 K8s Deployment 有 `app` label:
|
||
|
||
```yaml
|
||
metadata:
|
||
labels:
|
||
app: your-app # 与 APP_TO_PROJECT 中的 key 一致
|
||
```
|
||
|
||
### CronJob 部署配置
|
||
|
||
如果集群中尚未部署 Monitor,使用以下配置:
|
||
|
||
```yaml
|
||
# k8s/monitor-cronjob.yaml
|
||
apiVersion: batch/v1
|
||
kind: CronJob
|
||
metadata:
|
||
name: pod-health-monitor
|
||
spec:
|
||
schedule: "*/5 * * * *"
|
||
jobTemplate:
|
||
spec:
|
||
template:
|
||
spec:
|
||
serviceAccountName: pod-monitor
|
||
containers:
|
||
- name: monitor
|
||
image: your-registry/k8s-pod-monitor:latest
|
||
env:
|
||
- name: LOG_CENTER_URL
|
||
value: "https://qiyuan-log-center-api.airlabs.art"
|
||
- name: MONITOR_NAMESPACE
|
||
value: "default"
|
||
restartPolicy: OnFailure
|
||
```
|
||
|
||
---
|
||
|
||
## 错误去重机制
|
||
|
||
Log Center 使用 **指纹(fingerprint)** 对错误进行去重,三类来源使用不同的指纹策略:
|
||
|
||
| 来源 | 指纹组成 |
|
||
|------|----------|
|
||
| `runtime` | `MD5(project_id \| error_type \| file_path \| line_number)` |
|
||
| `cicd` | `MD5(project_id \| cicd \| error_type \| job_name \| step_name)` |
|
||
| `deployment` | `MD5(project_id \| deployment \| error_type \| namespace \| deployment_name)` |
|
||
|
||
相同指纹的错误只记录一次。已修复的错误再次出现会自动重新打开(回归检测)。
|
||
|
||
---
|
||
|
||
## 错误状态流转
|
||
|
||
```
|
||
NEW → VERIFYING → PENDING_FIX → FIXING → FIXED → VERIFIED → DEPLOYED
|
||
↓ ↓
|
||
CANNOT_REPRODUCE FIX_FAILED
|
||
```
|
||
|
||
| 状态 | 说明 |
|
||
|------|------|
|
||
| `NEW` | 新上报的错误 |
|
||
| `VERIFYING` | 正在验证复现 |
|
||
| `CANNOT_REPRODUCE` | 无法复现 |
|
||
| `PENDING_FIX` | 等待修复 |
|
||
| `FIXING` | AI Agent 正在修复中 |
|
||
| `FIXED` | 已修复,待验证 |
|
||
| `VERIFIED` | 已验证修复 |
|
||
| `DEPLOYED` | 已部署上线 |
|
||
| `FIX_FAILED` | 修复失败 |
|
||
|
||
---
|
||
|
||
## API 参考
|
||
|
||
### 上报错误日志
|
||
|
||
**POST** `/api/v1/logs/report`
|
||
|
||
**响应:**
|
||
|
||
```json
|
||
// 新错误
|
||
{"message": "Log reported", "id": 123}
|
||
|
||
// 重复错误(去重)
|
||
{"message": "Log deduplicated", "id": 123, "status": "NEW"}
|
||
|
||
// 回归(已修复的错误再次出现)
|
||
{"message": "Regression detected, reopened", "id": 123}
|
||
```
|
||
|
||
### 项目管理 API
|
||
|
||
| 方法 | 路径 | 说明 |
|
||
|------|------|------|
|
||
| GET | `/api/v1/projects` | 获取项目列表 |
|
||
| GET | `/api/v1/projects/{project_id}` | 获取项目详情 |
|
||
| PUT | `/api/v1/projects/{project_id}` | 编辑项目配置 |
|
||
|
||
---
|
||
|
||
## 最佳实践
|
||
|
||
1. **设置超时**: 上报请求设置 3 秒超时,避免影响主业务
|
||
2. **静默失败**: 上报失败不应影响用户体验,所有 catch 块静默处理
|
||
3. **异步上报**: 使用异步方式上报,不阻塞主流程
|
||
4. **添加上下文**: 尽量添加有用的上下文信息(用户ID、请求URL等)
|
||
5. **环境区分**: 正确设置 `environment` 字段区分开发/生产
|
||
6. **CI/CD 用 `|| true`**: 上报步骤失败不应阻断流水线
|
||
|
||
---
|
||
|
||
## 环境变量配置
|
||
|
||
### Python 项目
|
||
```bash
|
||
# .env
|
||
LOG_CENTER_URL=http://localhost:8002
|
||
ENVIRONMENT=development
|
||
```
|
||
|
||
### JavaScript 项目
|
||
```bash
|
||
# .env
|
||
VITE_LOG_CENTER_URL=http://localhost:8002
|
||
```
|
||
|
||
### Flutter 项目
|
||
```bash
|
||
# 编译时传入
|
||
flutter run --dart-define=LOG_CENTER_URL=http://localhost:8002
|
||
flutter run --dart-define=ENVIRONMENT=development
|
||
```
|
||
|
||
### Gitea Actions
|
||
```yaml
|
||
env:
|
||
LOG_CENTER_URL: https://qiyuan-log-center-api.airlabs.art
|
||
```
|
||
|
||
---
|
||
|
||
## 完整 API 文档
|
||
|
||
访问: [http://localhost:8002/docs](http://localhost:8002/docs)
|