Show HN: Agentic Reliability Framework – Multi-agent AI self-heals failures
I built ARF after seeing the same pattern repeatedly: production AI systems fail silently, humans wake up at 3 AM, take 30-60 minutes to recover, and companies lose \$50K-\$250K per incident.
ARF uses 3 specialized AI agents:
Detective: Anomaly detection via FAISS vector memory Diagnostician: Root cause analysis with causal reasoning Predictive: Forecasts failures before they happen
Result: 2-minute MTTR (vs 45-minute manual), 15-30% revenue recovery.
Tech stack: Python 3.12, FAISS, SentenceTransformers, Gradio Tests: 157/158 passing (99.4% coverage) Docs: 42,000 words across 8 comprehensive files
Live demo: https://huggingface.co/spaces/petter2025/agentic-reliability...
The interesting technical challenge was making agents coordinate without tight coupling. Each agent is independently testable but orchestrated for holistic analysis.
Happy to answer questions about multi-agent systems, production reliability patterns, or FAISS for incident recall!
GitHub: https://github.com/petterjuan/agentic-reliability-framework
(Also available for consulting if you need this deployed in your infrastructure: https://lgcylabs.vercel.app/)
1 comments
Almost there! We're setting everything up for you.