n8n workflow automation for building analytics including SkySpark multi-agent systems, FastAPI tool servers, workflow orchestration, and automated building system alert triage
Install via CLI
openskills install mbcoalson/skills---
name: n8n-automation
description: n8n workflow automation for building analytics including SkySpark multi-agent systems, FastAPI tool servers, workflow orchestration, and automated building system alert triage
---
# n8n Automation Skill
This skill provides expertise in building automation workflows using n8n, specifically focused on SkySpark integration and multi-agent systems for building analytics.
## Core Capabilities
### SkySpark Multi-Agent Architecture
- Gatekeeper agent with specialist routing
- Automated spark triage and prioritization
- HVAC specialist for mechanical system analysis
- Energy calculation agent for savings estimates
- Automated report generation workflows
### n8n Workflow Development
- REST API integrations with building systems
- Multi-agent coordination patterns
- Error handling and logging strategies
- Scheduled automation and notifications
### SkySpark Integration
- Haystack REST API connectivity
- ZINC and JSON data format handling
- Spark and trend data extraction
- Authentication and security patterns
## Available Scripts
- scripts/fastapi_tool_server.py: Template for FastAPI tool servers
- scripts/n8n_workflow_templates/: Common workflow patterns
- scripts/skyspark_mock_data.py: Generate test data for development
## Reference Materials
- references/multi_agent_patterns.md: Proven multi-agent architectures
- references/skyspark_api.md: SkySpark Haystack API documentation
- references/n8n_best_practices.md: Workflow development guidelines
- references/roadmap.md: 5-phase project roadmap (Foundation → Production)
## Asset Templates
- assets/workflow_exports/: n8n workflow export files
- assets/api_schemas/: OpenAPI schemas for tool servers
- assets/mock_data/: Sample SkySpark data for testing
## Current Project: SkySpark Multi-Agent System
### 5-Phase Roadmap
1. **Foundation**: FastAPI tool server + basic agent interaction
2. **Mock Integration**: Agent logic with simulated SkySpark data
3. **Real SkySpark**: Live Haystack API connection
4. **Multi-Agent**: Full specialist coordination
5. **Production**: Error handling, scheduling, notifications
### Multi-Agent Pattern
```
User Request → Gatekeeper Agent (routes by intent)
↓
┌───────────┼───────────┬──────────────┐
↓ ↓ ↓ ↓
Triage HVAC Energy Report
Agent Specialist Calc Agent Generator
```
## Agent Architecture for Future RL Orchestration
### Stateless Agent Design Principles
All SkySpark agents should be designed with future dynamic orchestration in mind:
**Statelessness Requirements:**
- Agents receive complete diagnostic context as input (no persistent memory)
- All necessary context passed in, new state returned out
- Enables any agent to follow any agent (maximum routing flexibility)
- Simplifies debugging and testing
**Standardized I/O Schema:**
Input: `DiagnosticState` object containing:
- `alert_context`: Current alert/issue details (Haystack tags, spark data)
- `prior_actions`: Array of prior agent outputs/reasoning
- `available_tools`: List of tool endpoints and capabilities
- `token_budget_remaining`: Tokens available for continued diagnosis
- `confidence_score`: Current diagnostic confidence (0.0-1.0)
- `metadata`: Alert severity, facility info, timestamps
Output: `AgentResponse` object containing:
- `updated_state`: Refreshed DiagnosticState with new findings
- `agent_findings`: This agent's analysis, calculations, or actions
- `recommended_next_agents`: Suggested routing (even if not used in Phase 0-1)
- `token_cost`: Tokens consumed by this agent call
- `confidence_delta`: Change in diagnostic confidence (+/- 0.0-1.0)
- `actions_taken`: List of tools called or queries executed
**Role-Based Architecture:**
Separate agent identity (role prompts) from action logic:
*Tool-Use Agents:*
- `SkySpark_Query_Agent`: Execute Axon queries against SkySpark database
- `Python_Analysis_Agent`: Statistical analysis, calculations, trending
- `File_Reader_Agent`: Parse equipment specs, CSV logs, JSON configs
- `Weather_Data_Agent`: Fetch external weather data via API
- `Documentation_Agent`: Search equipment manuals, ASHRAE standards
*Reasoning Agents:*
- `Diagnostic_Planner_Agent`: Decompose complex issues into sub-diagnostics
- `Root_Cause_Analyzer_Agent`: Synthesize data to identify probable causes
- `Validation_Agent`: Verify diagnostic logic and calculations
- `Reflection_Agent`: Assess diagnostic trajectory, propose refinements
- `Summary_Agent`: Generate concise diagnostic reports
- `Resolution_Agent`: Recommend corrective actions
- `Modification_Agent`: Correct errors in prior reasoning
### Orchestration Evolution Roadmap
**Phase 0-1: Static n8n Workflows (Current)**
- Hardcoded agent routing via n8n nodes is acceptable
- BUT design workflows to call centralized routing functions
- Log every routing decision with full context
- Focus: Get agents working with standardized I/O
**Phase 2-3: Rule-Based Orchestrator**
- Implement `agent_router` function (Python/FastAPI endpoint)
- Route based on: alert type/severity, confidence score, prior sequence
- Manually tune rules based on logged diagnostic patterns
- A/B test against static workflows
- Focus: Data-driven routing without ML complexity
**Phase 4-5: RL-Based Orchestrator**
- Replace rule-based router with learned policy
- Train on logged diagnostic trajectories (minimum 200 episodes)
- Optimize for: `R = accuracy - λ × token_cost - β × time_penalty`
- Enable online learning from production outcomes
- Focus: Continuous optimization and adaptation
### Instrumentation & Logging Requirements
**Critical: Start logging from Day 1, even with static workflows**
Log every diagnostic episode with schema:
```json
{
"episode_id": "alert_123456_2025-12-02",
"facility_name": "Building XYZ",
"alert_context": {
"alert_type": "High AHU static pressure",
"severity": "warning",
"haystack_refs": [...],
"triggered_at": "2025-12-02T10:15:00Z"
},
"agent_sequence": [
{
"timestamp": "2025-12-02T10:30:00Z",
"agent_type": "SkySpark_Query_Agent",
"input_state": {...},
"output_state": {...},
"tokens_used": 1500,
"duration_ms": 3400,
"tools_called": ["read_trends", "eval_axon"]
},
...
],
"final_outcome": {
"diagnostic_correct": true,
"root_cause_identified": "Filter 80% loaded",
"resolution_actions": ["Schedule filter replacement"],
"total_tokens": 8500,
"total_duration_ms": 15200,
"human_intervention_required": false,
"human_feedback": "Diagnosis accurate, action appropriate"
}
}
```
**Storage Strategy:**
- Store in n8n database OR dedicated TimescaleDB
- Retain minimum 200 episodes per alert type for RL training
- Enable easy export for offline analysis and model training
- Version control state schemas alongside agents
**What to Log:**
- Complete agent call sequence per diagnostic episode
- Full input/output states at each step
- Token consumption per agent (measure via API or estimate)
- Human validation/corrections (critical for reward labels)
- Final diagnostic outcome and resolution effectiveness
- Time to resolution relative to SLA thresholds
### Reward Function Design
**Preliminary specification for future RL training:**
```
R = accuracy_score - λ × token_cost - β × time_penalty
```
Where:
- `accuracy_score ∈ [0, 1]` - Validated against human review OR resolution outcome
- `token_cost = tokens_used / max_token_budget` - Normalized efficiency metric
- `time_penalty = max(0, (time_to_resolve / sla_threshold) - 1)` - Timeliness factor
- `λ = 0.1` - Efficiency weight (tunable based on production costs)
- `β = 0.05` - Timeliness weight (varies by alert severity)
**Accuracy Validation:**
- Binary: Did diagnostic identify correct root cause? (0 or 1)
- Graduated: Partial credit for narrowing to subsystem (0.0-1.0 scale)
- Ground truth: FM confirmation or system behavior post-resolution
**Efficiency Considerations:**
- Critical alerts: Prioritize accuracy over token cost (λ → 0)
- Informational alerts: Prioritize efficiency (λ → 0.2)
- Balance prevents both over-analysis and premature conclusions
**Timeliness Thresholds:**
- Critical: 15 min SLA → high β penalty if exceeded
- Warning: 60 min SLA → moderate β penalty
- Info: 4 hour SLA → minimal β penalty
### SkySpark Agent Taxonomy
Agents aligned with Puppeteer research categories:
**Tool-Use Agents (External data access):**
1. SkySpark_Query_Agent - Axon query execution
2. Python_Analysis_Agent - Statistical calculations, trending
3. File_Reader_Agent - Equipment specs, logs, configs
4. Weather_Data_Agent - External weather APIs
5. Documentation_Agent - Manuals, standards, knowledge base
**Reasoning Agents (Internal cognition):**
1. Diagnostic_Planner_Agent - Task decomposition
2. Root_Cause_Analyzer_Agent - Data synthesis
3. Validation_Agent - Logic verification
4. Reflection_Agent - Meta-reasoning
5. Summary_Agent - Report generation
6. Resolution_Agent - Action recommendation
7. Modification_Agent - Error correction
### Critical Don'ts
**Avoid these pitfalls:**
- ❌ Don't jump to RL before having logged data (minimum 200 episodes)
- ❌ Don't hardcode agent sequences into prompts (keep routing external)
- ❌ Don't skip instrumentation "for now" (you'll never add it later)
- ❌ Don't design agents that assume specific predecessors
- ❌ Don't sacrifice debuggability for sophistication
- ❌ Don't optimize prematurely (get working system first)
- ❌ Don't force-fit RL where rules work fine
**Do these instead:**
- ✅ Keep current n8n approach for Phase 0-1 (it's pragmatic)
- ✅ Design agents with standardized I/O from day 1
- ✅ Start logging immediately, even with static workflows
- ✅ Validate every agent output initially (build ground truth dataset)
- ✅ Build tooling to visualize agent sequences (aid debugging)
- ✅ Document reward function assumptions (evolve with data)
- ✅ Plan human override mechanisms (safety net)
### Success Metrics by Phase
**Phase 1 (Static n8n + Logging):**
- [ ] 5+ agents implemented with standardized I/O
- [ ] 100% of diagnostic episodes logged with schema
- [ ] Static workflow resolves 70%+ of test alerts correctly
- [ ] Average 8-12k tokens per diagnostic episode
**Phase 2 (Rule-Based Router):**
- [ ] Rule-based router handles 80%+ of alert types
- [ ] 10% reduction in average tokens vs. static workflow
- [ ] Diagnostic accuracy maintained or improved (≥70%)
- [ ] Routing decisions explainable to FM stakeholders
**Phase 3 (RL Orchestrator):**
- [ ] RL policy trained on 200+ episodes
- [ ] Matches or exceeds rule-based accuracy
- [ ] 20%+ token reduction vs. static baseline
- [ ] Cyclic reasoning patterns emerge for complex diagnostics
**Phase 4 (Production RL):**
- [ ] Online learning maintains 90%+ diagnostic accuracy
- [ ] 30%+ token reduction vs. baseline
- [ ] Handles novel alert types through adaptation
- [ ] Human intervention required <10% of cases
### Integration with Existing Architecture
**Compatible with current project structure:**
- Project instructions: `Project-Specific-Instructions/SkySpark-n8n-Workflow.md`
- n8n workflows: `n8n/`
- Agent design aligns with FastAPI tool server architecture
- State schema accommodates SkySpark Haystack data model
- Reward function maps to ComEd program requirements
**Backward Compatibility:**
- Phase 0-1 static workflows remain fully functional
- Standardized I/O is additive, not breaking
- Logging infrastructure runs alongside existing logic
- Can defer RL indefinitely if rule-based routing suffices
## Usage Examples
- "Build n8n workflow to triage SkySpark alerts"
- "Create FastAPI server for energy calculations"
- "Set up multi-agent routing for HVAC diagnostics"
- "Design workflow for automated building system reports"
- "Integrate n8n with SkySpark Haystack API"
- "Design stateless agent with standardized I/O for future RL"
- "Set up diagnostic episode logging infrastructure"
- "Implement rule-based agent router for Phase 2"
---
## Model Selection Constraints (Added 2025-01-01)
### Hard Requirement: US/EU Origin Only
Company works with US government - **no Chinese-origin models permitted**.
This excludes: Qwen, DeepSeek, Yi, Baichuan, ChatGLM, and derivatives.
### Approved Local Models
| Agent Role | Model | Origin | Size | Rationale |
|------------|-------|--------|------|-----------|
| Classifier | Microsoft Phi-4 | 🇺🇸 US | 14B | Fast, accurate structured output, MIT license |
| Test Designer | Mistral Codestral 22B | 🇫🇷 EU | 22B | Purpose-built for code |
| Physics Explainer | Meta Llama 3.1 70B | 🇺🇸 US | 70B | Best open reasoning model |
| Reporter | Mistral Small 24B | 🇫🇷 EU | 24B | High quality summaries |
| Triage (local option) | Meta Llama 3.1 8B | 🇺🇸 US | 8B | Good balance for routine alerts |
| Coder Agent | Claude API | 🇺🇸 US | - | Best for code generation (keep on API) |
### Alternative Smaller Models (CPU/Light GPU)
- Microsoft Phi-4-mini-reasoning (3.8B) - Classification tasks
- Mistral 7B Instruct v0.3 - General purpose
- Meta CodeLlama-7B-Instruct - Code understanding
---
## Physics Checker Design (Added 2025-01-01)
### Two-Layer Architecture
**Layer 1: Rule Engine (Python, deterministic)**
- ASHRAE bounds checking
- Equipment capacity limits
- Thermodynamic constraints (energy balance)
- Returns: PASS/FAIL + which rule violated
- **No LLM involved** - pure Python logic
**Layer 2: Explanation Agent (local LLM, only if FAIL)**
- Takes: rule violation details + equipment context
- Returns: human-readable explanation of why it's impossible
- Model: Meta Llama 3.1 8B Instruct (US origin)
- Only invoked when Layer 1 fails
### Initial Physics Bounds
```python
PHYSICS_BOUNDS = {
"zone_temp": {"min": 55, "max": 85, "unit": "°F"},
"supply_air_temp": {"min": 50, "max": 65, "unit": "°F"},
"chilled_water_delta_t": {"min": 8, "max": 16, "unit": "°F"},
"hot_water_delta_t": {"min": 15, "max": 40, "unit": "°F"},
"airflow_per_sqft": {"min": 0.5, "max": 3.0, "unit": "CFM/sqft"},
"cooling_efficiency": {"min": 0.3, "max": 1.0, "unit": "kW/ton"},
"boiler_efficiency": {"min": 0.75, "max": 0.98, "unit": "fraction"},
"fan_static_pressure": {"min": 0.5, "max": 6.0, "unit": "inWG"},
}
```
### Integration with Agent Taxonomy
Physics Checker slots into the **Reasoning Agents** category:
- Input: Recommendation from Triage Agent + equipment context
- Output: PASS (proceed) or FAIL (block + explain)
- Invoked before any recommendation reaches the Reporter
---
## Calculation Tool Verification Pipeline (Added 2025-01-01)
### Core Principle: Agents Don't Do Math
- All calculations performed by human-approved Python scripts
- Scripts wrapped as callable tools for agents
- Results are auditable and reproducible
- Minimizes token count and eliminates hallucinated math
### Verification Workflow
```
Coder Agent (Claude) → Test Designer (Codestral) → Static Analyzer (AST)
↓ ↓ ↓
Initial script Edge cases, bounds Security check
Reference values Purity check
Type hints
↓ ↓ ↓
└────────────────────────┴─────────────────────────┘
↓
Test Executor (pytest)
↓
Documentation Generator
↓
Human Review Package
[Approve / Reject / Edit]
```
### Verification Types (Diversity = Trust)
| Type | What It Catches | Implementation |
|------|-----------------|----------------|
| Generated tests | Logic errors, edge cases | Local LLM (Codestral 22B) |
| Reference tests | Wrong formulas | Human-provided ASHRAE examples |
| Static analysis | Security, side effects | Python AST parsing |
| Type checking | Interface mismatches | mypy |
| Unit conversion | Dimensional errors | pint library |
| Bounds checking | Impossible outputs | Physics Checker rules |
### CalcTool Registration Schema
```python
@dataclass
class CalcTool:
id: str
name: str # e.g., "chw_energy"
version: str # Semantic versioning
description: str
script_path: str # Path to .py file
function_name: str # Entry point
# Interface
input_schema: dict # JSON schema for inputs
output_schema: dict # JSON schema for outputs
# Approval chain
approved_by: str # Human who reviewed
approved_at: datetime
test_cases: List[dict] # Input/output pairs that must pass
# Usage tracking
invocation_count: int
last_invoked: datetime
error_count: int
```
---
## Value Economy System (Added 2025-01-01)
### Core Concept
Treat diagnostic strategies as an economy:
- Strategies have **value scores** that change based on outcomes
- Usefulness pays rent; wrongness gets evicted
- High-value strategies get promoted to **Skills**
- Unused strategies decay over time
### Strategy Artifact Schema
```python
@dataclass
class StrategyArtifact:
id: str
created_at: datetime
updated_at: datetime
# What it matches
alert_pattern: dict # Conditions this strategy applies to
building_types: List[str] # Hospital, university, office, etc.
system_types: List[str] # AHU, VAV, chiller, etc.
# What it recommends
diagnosis_template: str
action_template: str
calc_tools: List[str] # Which calculation scripts to invoke
# Value economy
status: Literal["candidate", "validated", "skill"]
value_score: float
confidence: float
# Evidence counters
times_matched: int
times_confirmed: int # Human said "yes, correct"
times_rejected: int # Human said "no, wrong"
energy_saved_kbtu: float # Cumulative attributed savings
last_used: datetime
# Lineage
source: Literal["human", "synthesized", "promoted"]
parent_ids: List[str]
```
### Promotion Gates
**Candidate → Validated:**
```python
VALIDATION_THRESHOLDS = {
"min_matches": 3,
"min_confirmations": 2,
"max_rejection_rate": 0.3,
"min_distinct_buildings": 2,
"physics_check_pass": True,
}
```
**Validated → Skill:**
```python
SKILL_PROMOTION_THRESHOLDS = {
"min_matches": 10,
"min_confirmations": 7,
"max_rejection_rate": 0.15,
"min_distinct_buildings": 3,
"min_age_days": 14,
"attributed_savings_kbtu": 1000,
}
```
### Decay Function
```python
def calculate_decay(artifact: StrategyArtifact, days_since_use: int) -> float:
"""Value decays 2% per day of non-use, floor at 0.1"""
decay_factor = 0.98 ** days_since_use
return max(artifact.value_score * decay_factor, 0.1)
```
### Value Update Logic
```python
REWARDS = {
"confirmed_correct": 1.0,
"energy_saved_per_1000_kbtu": 0.5,
"reused_different_building": 0.3,
}
PENALTIES = {
"rejected_wrong": -2.0,
"physics_violation": -3.0,
"caused_comfort_complaint": -1.5,
}
```
---
## Governance Layer (Added 2026-01-05)
### Core Principle
**Derive "goodness" from observable deltas, not model judgments.**
The governance layer monitors agent and strategy performance through measurable signals, uses a small local LLM for classification only, and modifies routing weights and configuration—never prompts or training data.
### Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ TELEMETRY LAYER │
│ Captures: latency, retries, overrides, tool failures, │
│ user edits, session completion │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ CLASSIFICATION ENGINE (Ollama) │
│ Model: Llama 3.1 8B Instruct (Q8) │
│ Task: Categorize failure types, edit substantiveness │
│ Output: Strict JSON enums only (no explanatory text) │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ SCORE CALCULATOR (Deterministic) │
│ Computes health scores from weighted signals │
│ NO LLM INVOLVED - pure Python logic │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ CONFIG MUTATOR │
│ Writes to: routing weights, canary patterns, deprecation │
│ NEVER modifies: agent prompts, training data │
└─────────────────────────────────────────────────────────────┘
```
### Measurable Signals (Observable Deltas)
**What we measure:**
- **User acceptance:** Did user accept recommendation without edits?
- **Edit size:** How much did user modify agent output? (character delta)
- **Override frequency:** How often did user reject and provide alternative?
- **Session completion:** Did diagnostic episode complete successfully?
- **Latency:** Time to produce recommendation
- **Retry count:** How many attempts before acceptable output?
- **Tool failures:** Number of failed tool calls during episode
**What we DON'T measure:**
- ❌ LLM judgment of "quality"
- ❌ Subjective ratings
- ❌ Vibes-based assessment
### Classification Engine Specs
**Model:** Llama 3.1 8B Instruct (Q8 quantization)
**Deployment:** Ollama (local)
**Temperature:** 0.0 (deterministic)
**Max Tokens:** 50
**Output Format:** Strict JSON schema with enums only
**Example Classification Task:**
```json
{
"edit_type": "substantive" | "formatting" | "none",
"failure_mode": "incorrect_diagnosis" | "missing_context" | "tool_failure" | "physics_violation" | "null",
"confidence": 0.0-1.0
}
```
**Acceptable Failures:**
- Misclassification (wrong enum selected)
- Refusal to classify (null output)
**Unacceptable Failures:**
- Hallucinated categories not in schema
- Explanatory text instead of enum
- Non-JSON output
### Health Score Calculation (Deterministic)
```python
health_score = (
success_rate * 0.40 +
non_replacement_rate * 0.25 +
non_override_rate * 0.20 +
completion_rate * 0.15
)
```
**Where:**
- `success_rate` = accepted recommendations / total recommendations
- `non_replacement_rate` = 1 - (substantive edits / total recommendations)
- `non_override_rate` = 1 - (explicit rejections / total recommendations)
- `completion_rate` = completed episodes / started episodes
**Thresholds:**
- **Reinforce** (≥0.80): Increase routing weight, consider promotion
- **Stable** (0.50-0.80): Maintain current routing
- **Decay Warning** (<0.50): Flag for review, reduce routing weight
- **Deprecation Candidate** (<0.35 for 14+ days): Human review required
### Strategy Lifecycle Management
**Canary Pattern (Gradual Rollout):**
```
New Strategy → 10% of matching alerts
↓ (5 successful uses)
50% of matching alerts
↓ (20 successful uses)
100% of matching alerts (promoted)
```
**Deprecation Workflow:**
1. Automated flag when health <0.35 for 14+ days
2. Generate deprecation report with metrics
3. Human review REQUIRED before archiving
4. Archive strategy with audit log
**Skill Cap:** Maximum 50 active strategies
- Prevents strategy sprawl
- Forces prioritization and consolidation
- Must deprecate before adding when at cap
### Phased Implementation
**Phase 1: Telemetry Foundation** (2-3 weeks)
- [ ] Add telemetry emission endpoints to tool_server.py
- [ ] Define event schema (JSON)
- [ ] Create append-only event store (SQLite or JSONL)
- [ ] Instrument agent call sites with telemetry
- [ ] Verify data collection end-to-end
**Phase 2: Deterministic Scoring** (1 week)
- [ ] Implement health score calculation (pure Python)
- [ ] Create scoring dashboard/report script
- [ ] Test with historical data (if available)
- [ ] Define threshold policies
**Phase 3: Classification Engine** (1-2 weeks)
- [ ] Deploy Ollama locally
- [ ] Install Llama 3.1 8B Instruct
- [ ] Create classification service (FastAPI endpoint)
- [ ] Define JSON schemas for classification tasks
- [ ] Test enum-only output enforcement
- [ ] Integrate with telemetry pipeline
**Phase 4: Config Mutation** (1 week)
- [ ] Implement routing weight adjustment logic
- [ ] Create canary pattern manager
- [ ] Build deprecation workflow with human approval
- [ ] Add audit logging for all config changes
- [ ] Test end-to-end governance loop
### Integration Points
**SkySpark n8n Workflow:**
- Monitor agent tool calls via telemetry
- Capture user acceptance/rejection signals
- Log session completion status
**Tool Server (FastAPI):**
- Add `/telemetry/emit` endpoint
- Add `/governance/health-score` endpoint
- Add `/governance/classify` endpoint (calls Ollama)
**Event Store Schema:**
```json
{
"event_id": "uuid",
"timestamp": "2026-01-05T10:30:00Z",
"event_type": "agent_recommendation" | "user_feedback" | "session_complete",
"agent_id": "triage_agent_v1",
"strategy_id": "ahu_vfd_savings_v2",
"signals": {
"latency_ms": 3400,
"retry_count": 0,
"tool_failures": 0,
"user_accepted": true,
"edit_delta_chars": 0,
"override": false,
"session_completed": true
},
"classification": {
"edit_type": "none",
"failure_mode": null
}
}
```
### Observability Stack (LangSmith + Telemetry)
**Two-Layer Observability:**
1. **LangSmith** - Agent execution tracing and evaluation
- Traces all LangGraph agent calls within n8n workflows
- Captures input/output for each agent step
- Evaluation metrics for agent performance
- Token usage tracking per agent call
2. **Governance Telemetry** - Outcome measurement
- User acceptance signals
- Edit deltas and overrides
- Session completion status
- Feeds into health score calculation
**Integration Architecture:**
```
n8n Workflow
↓
LangGraph Agent Node
├─→ LangSmith (trace execution, token costs)
└─→ Governance Telemetry (emit outcome signals)
↓
Event Store → Health Score → Config Mutation
```
**LangSmith Setup:**
```bash
# Install LangSmith
pip install langsmith langchain-anthropic
# Set API key
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-key-here"
export LANGCHAIN_PROJECT="skyspark-alert-triage"
```
**LangGraph Integration in n8n:**
```python
# In n8n Code node or custom LangGraph service
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph
from langsmith import traceable
@traceable(name="alert_router")
def route_alert(state: DiagnosticState) -> str:
"""Route alert to appropriate specialist agent."""
# LangSmith automatically traces this function
# Governance telemetry emits outcome after completion
router = ChatAnthropic(model="claude-sonnet-4")
routing_decision = router.invoke(state)
# Emit telemetry event
emit_telemetry({
"event_type": "routing_decision",
"decision": routing_decision,
"confidence": state.confidence_score
})
return routing_decision
```
**What LangSmith Captures:**
- Agent call sequences and routing decisions
- Token usage per agent (for cost optimization)
- Latency per agent step
- Input/output schemas validation
- Error traces and retry patterns
**What Governance Telemetry Captures:**
- User accepted/rejected recommendation
- Edit delta size (character count)
- Override frequency
- Session completion status
- Final diagnostic outcome
**Why Both?**
- **LangSmith:** Observability during execution (what agents did)
- **Telemetry:** Observability of outcomes (did it work?)
- Together: Complete feedback loop for continuous improvement
### Ollama Deployment
**Installation:**
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Llama 3.1 8B Instruct
ollama pull llama3.1:8b-instruct-q8_0
# Verify
ollama run llama3.1:8b-instruct-q8_0 "Classify this edit: {'type': 'formatting'}"
```
**FastAPI Integration:**
```python
import requests
def classify_edit(edit_context: dict) -> dict:
"""Call Ollama for classification."""
prompt = f"""Classify this edit strictly using the schema:
{json.dumps(edit_context)}
Output ONLY valid JSON with these exact enums:
{{"edit_type": "substantive" | "formatting" | "none"}}
"""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1:8b-instruct-q8_0",
"prompt": prompt,
"temperature": 0.0,
"max_tokens": 50,
"format": "json" # Enforce JSON output
}
)
return response.json()
```
### Success Metrics
**Phase 1 (Telemetry) Complete When:**
- [ ] 100% of agent calls emit telemetry events
- [ ] Event store captures all signals defined in schema
- [ ] No data loss or missing events
- [ ] Dashboard shows real-time event stream
**Phase 2 (Scoring) Complete When:**
- [ ] Health scores calculated for all active strategies
- [ ] Thresholds trigger correct workflow actions
- [ ] Scores correlate with manual assessment
- [ ] Report generation automated
**Phase 3 (Classification) Complete When:**
- [ ] Ollama classification endpoint working
- [ ] 95%+ enum-only outputs (no hallucinated categories)
- [ ] Classification latency <500ms per event
- [ ] Integrates with telemetry pipeline
**Phase 4 (Mutation) Complete When:**
- [ ] Routing weights auto-adjust based on health scores
- [ ] Canary pattern deployed for new strategies
- [ ] Deprecation workflow tested end-to-end with human approval
- [ ] Audit log captures all config changes
### Critical Don'ts
❌ **DON'T** use governance layer to modify agent prompts
❌ **DON'T** let classification LLM make "goodness" judgments
❌ **DON'T** skip human review for deprecation decisions
❌ **DON'T** use cloud-based LLMs for governance (privacy/cost)
❌ **DON'T** allow governance to modify training data directly
✅ **DO** use observable signals only
✅ **DO** keep classification deterministic (temp=0.0)
✅ **DO** require human approval for all deprecations
✅ **DO** use local Ollama for privacy and zero cost
✅ **DO** modify routing config, not agent logic
---
## Related Documentation
- **[NEXT-STEPS.md](./NEXT-STEPS.md)** - Implementation roadmap and action items
- **[PHASE_2B_SUMMARY.md](./PHASE_2B_SUMMARY.md)** - Current SkySpark endpoint status
- **[references/integration-architecture.md](./references/integration-architecture.md)** - n8n + pyHVAC design
- **[../../engineering_calcs/README.md](../../engineering_calcs/README.md)** - Calculation tools project
## Saving Next Steps
When n8n-automation work is complete or paused:
```bash
node .claude/skills/work-command-center/tools/add-skill-next-steps.js \
--skill "n8n-automation" \
--content "## Priority Tasks
1. Build n8n workflow for SkySpark alert triage
2. Test multi-agent system integration
3. Deploy FastAPI tool server"
```
See: `.claude/skills/work-command-center/skill-next-steps-convention.md`
Scanned 2/12/2026
No comments yet. Be the first to comment!