Journal
Field notes from across the four surfaces.
Practical writing on observability for hardware, software, web, and AI agents. Cited sources, real numbers, no fluff.
- AI agent observability·April 15, 2026·10 min
Anthropic's OpenClaw cutoff — what changes when subscriptions become per-token invoices
On 6 April 2026 Anthropic announced OpenClaw — third-party agent tools moving off the flat $20–$200/mo Claude subscription tiers and onto per-token billing. The economics, the timing, and what it means for teams who built unit economics on flat fees.
Read - Alert fatigue·April 8, 2026·10 min
After-hours interruption load — the statistic PagerDuty doesn't publish
50 alerts a week. 2-5% actionable. 71% of SREs respond to dozens-or-hundreds of un-ticketed incidents per month. The off-hours dimension of alert fatigue, and why ship-faster culture compounds it.
Read - AI agent observability·March 28, 2026·11 min
Hamel Husain was right — eval tooling is commodified, and that has implications for vendor selection
The 15 January 2026 update to Hamel Husain's eval FAQ argued prefab evals are the wrong primitive. Custom annotation tools are 10x faster. The implications for which LLM observability vendor you pick — and which you don't.
Read - Alert fatigue·March 17, 2026·14 min
Alert fatigue is cognitive fragmentation, and it's the top-3 SRE concern in 2026
70% of SREs report alert fatigue as a top-3 concern. The cause isn't volume — it's the cognitive fragmentation of repeated low-grade interruptions. A walk through Google's urgent / actionable / imminent rule and what tuned-by-default alerting looks like.
Read - AI agent observability·March 12, 2026·12 min
Hard budget caps for AI agents — the architecture options
From the RelayPlane $0.80→$47 stuck loop to the org-wide failure mode of provider spend caps — the four places you can put a budget cap, and why only one of them actually works.
Read - Status pages·March 4, 2026·10 min
Atlassian Statuspage's 21-day outage — and what it means
From 2 February to 23 February 2026, Statuspage's System Metrics feature was broken because Librato — a deprecated upstream — finally went away. Three weeks. On a paid product whose entire job is honesty about infrastructure.
Read - AI agent observability·February 18, 2026·14 min
Helicone vs LangSmith vs Langfuse vs Phoenix — what each one actually gets wrong
A 4-way honest comparison of the leading LLM observability tools, the gateway-plus-eval hybrid pattern that emerged, and where Sutrace fits.
Read - ESP32 + hardware·February 18, 2026·9 min
The $99 industrial monitoring bench — full BOM and where to source it
A line-by-line bill of materials for the Sutrace ESP32 industrial monitoring bench. $28 in parts, $99 retail. We tell you exactly what we pay and exactly where we buy from. No mystery, no markup-hiding.
Read - SCADA & industrial·February 11, 2026·11 min
Rockwell FactoryTalk 2026 pricing decoded — what every tier actually costs
A line-by-line walkthrough of the 2025 Rockwell software price list, what each tier does, and when it's actually required. With the real numbers from the AutomateAmerica investigation.
Read - Alert fatigue·February 4, 2026·11 min
Tuned by default — the five alerting defaults most observability vendors skip
Why "alert on everything" is the structural cause of alert fatigue, and the five defaults that should ship in the box. With concrete config you can copy.
Read - ESP32 + hardware·January 30, 2026·11 min
ESP32 + MQTT Sparkplug B — proper industrial payloads, not raw JSON
Why Sparkplug B beats raw MQTT for industrial telemetry, the topic structure that matters, birth and death certificates explained, and a working ESP32 implementation that publishes to any Sparkplug-aware MQTT broker.
Read - Datadog alternatives·January 22, 2026·11 min
Cardinality cost attribution, before the bill arrives
Why label sprawl is an architectural problem, how Datadog's pricing reacts to it, and what cost attribution before ingest looks like in practice.
Read - AI agent observability·January 22, 2026·13 min
EchoLeak, CamoLeak, and the GPT-5 7-vuln chain — prompt injection is shipping in named products
The 2025–2026 prompt-injection CVEs in Microsoft 365 Copilot, GitHub Copilot Chat, and ChatGPT. What changed, why "we'll fix it later" is no longer an answer, and what telemetry actually catches it.
Read - Status pages·January 14, 2026·14 min
Why most status pages lie — the evidence
A pillar piece tracing 2025's biggest cloud outages and the gap between what the status page said and what was actually happening. With timestamps, dashboards, and the case for auto-driven status.
Read - Status pages·December 8, 2025·11 min
When the status page failed too — Cloudflare, AWS, Azure 2025
A timeline analysis of the 2025 outages where the vendor's own status page went down alongside the production stack. With the relevant Cloudflare admission quote about coincidental dependencies.
Read - ESP32 + hardware·December 4, 2025·12 min
Modbus RTU over RS-485 from an ESP32 — the 30-minute version
A practical, no-fluff Modbus RTU walkthrough for the ESP32. MAX485 transceiver, UART2 wiring, DE/RE control, 120Ω termination, holding registers, slave addressing. Real Carel chiller, Schneider PowerLogic, Loxone Air examples.
Read - SCADA & industrial·November 19, 2025·9 min
No per-tag pricing — the buyer's filter most SCADA vendors still fail
Why "no per-tag pricing" has become a literal search filter in SCADA buying decisions, why per-tag pricing exists in the first place, and why it fails for modern OT customers.
Read - Compliance·November 14, 2025·12 min
EU-resident observability and the Data Privacy Framework — survival strategy
Post-Schrems II, the legal architecture for EU customer data is not "trust the DPF." It's belt-and-braces — EU residency by default, 2021 SCCs as the legal mechanism, UK IDTA and Swiss FDPIC overlay, and zero reliance on a single transfer mechanism.
Read - Datadog alternatives·November 14, 2025·9 min
Sutrace vs SigNoz vs ClickStack — an honest 3-way take
A direct comparison of three OpenTelemetry-native observability stacks. Where each wins, where each loses, and which one fits your team.
Read - OpenTelemetry·November 12, 2025·14 min
OpenTelemetry won the protocol war. Now it needs a backend.
OTel adoption is universal — 40% YoY PR growth, 21M monthly Python SDK downloads. The backend war is fragmented. A field guide to who's OTel-native, who's bolted on, and where Sutrace fits.
Read - AI agent observability·November 4, 2025·11 min
Multi-provider LLM routing — which provider actually served that trace?
OpenRouter, AWS Bedrock, and the gateway pattern made multi-provider routing the default. Without span-level provider attribution, your eval baseline is a coin flip. The OTel GenAI semantic conventions are the answer.
Read - ESP32 + hardware·October 22, 2025·9 min
4-20 mA into an ESP32 — the 165 Ω trick (and the dead zone you must avoid)
Why 165 Ω is the correct shunt value for reading 4-20 mA industrial sensors on an ESP32, how to wire it, how to calibrate it into NVS, and why you must never use a 250 Ω shunt with a stock ESP32 ADC.
Read - SCADA & industrial·September 30, 2025·8 min
Sparkplug B without the buzzword soup
A practical explainer of MQTT Sparkplug B — the namespace, device birth/death certificates, the session-state model, and what it's actually solving. For engineers who already know MQTT.
Read - Status pages·September 22, 2025·13 min
SSL certificate expiry — Microsoft Teams, Bazel, and you
A pillar piece on why expired SSL certificates remain one of the most embarrassing and most preventable outages in 2025. Microsoft Teams in February. Bazel in December. Two Let's Encrypt API outages. Apple's 47-day cert lifespan move. Keyfactor's $2.86M-per-outage number.
Read - OpenTelemetry·September 19, 2025·13 min
Prometheus at scale, the Cloudflare 200-sample rule, and the day you graduate
Why a single Prometheus instance hits a wall around 1–2M active series, what Cloudflare's sample_limit:200 actually defends against, and how to know when you've graduated to Thanos / Cortex / Mimir.
Read - ESP32 + hardware·September 18, 2025·18 min
ESP32 industrial monitoring — the 30-minute end-to-end build
Box-open to first datapoint on a real dashboard in half an hour. ESP32-S3 with three input classes — digital GPIO, analog ADC (4-20 mA + 0-10 V), and I2C — over USB-C, WiFi, MQTT to an EU-resident broker. Full BOM, wiring, code, and deployment.
Read - Datadog alternatives·September 8, 2025·12 min
Migrating from Datadog to OTel — the week-one checklist
A concrete day-by-day plan for moving off the Datadog Agent to a vendor-neutral OpenTelemetry collector. With config, traps, and what to skip.
Read - Datadog alternatives·August 19, 2025·12 min
Log volume, cardinality, and the 50M time-series budget
A technical deep-dive on the architectural ceiling of single-instance Prometheus, why Cloudflare's sample_limit:200 matters, and how to design a 50M time-series budget.
Read - Compliance·August 8, 2025·12 min
EU AI Act Article 12 logging — the tooling question for observability vendors
Most observability vendors are neither providers nor deployers of AI systems under the EU AI Act. They are the tools that help deployers meet Article 12 logging and Article 14 human-oversight obligations. Here's the distinction that matters.
Read - SCADA & industrial·August 7, 2025·9 min
Retrofit vs. rip-and-replace — why phased modernization wins for mid-market plants
Why a 5–100-person plant should retrofit observability and dashboards first, and replace PLCs only when they fail. With the real $230k–$690k SCADA TCO range and the integrator playbooks that back it up.
Read - OpenTelemetry·August 4, 2025·12 min
OTel Collector to ClickHouse — a quickstart you can run in an hour
The architecture, the YAML, the ClickHouse schema, and the gotchas. A working OpenTelemetry Collector → ClickHouse pipeline you can deploy today.
Read - Compliance·June 22, 2025·13 min
DORA for ICT third-party observability vendors — what actually changes
EU Regulation 2022/2554 has been in force since 17 January 2025. If you sell observability to financial entities — banks, insurers, payment institutions, investment firms — Article 28-30 obligations cascade onto you via contract. Here's the actual addendum.
Read - Datadog alternatives·June 4, 2025·10 min
The $83,000 Datadog renewal thread — what actually caused it
A line-by-line analysis of HN thread 41357726. Cardinality, custom metrics, log volume, and synthetics — which one actually broke the bill.
Read - SCADA & industrial·May 22, 2025·8 min
OPC UA explained for people who only know REST
An honest explainer for software engineers — what OPC UA actually is, what an address space is, what a subscription does, and why it's nothing like REST. Practical, not academic.
Read - OpenTelemetry·May 21, 2025·10 min
Cardinality, explained with examples your finance team will understand
What cardinality actually is, why high-cardinality labels break Prometheus and inflate Datadog bills, and the concrete arithmetic for HTTP method × status × path × user_id × region.
Read - Datadog alternatives·April 29, 2025·9 min
Datadog synthetics cost calculator — real pricing from Checkly's data
A walkthrough of synthetics pricing using Checkly's real numbers. How 16 routes turn into $66K/year and what to do about it.
Read - Compliance·April 12, 2025·14 min
NIS2 supplier-cascade — what observability vendors actually have to commit to
NIS2 has been in force since October 2024, with national transpositions through 2025. Most observability vendors are not directly regulated, but their customers are. The cascade obligations under Article 21(2)(d) and the 24-hour rule of Article 23 hit vendors via contract.
Read