My Debugging Workflow When Everything is on Fire

3w ago3 weeks ago

7mins read

My debugging workflow when everything is on fire

If you’ve ever walked into a production incident and felt the heat rising faster than your CPU fan, you know that a calm, repeatable debugging workflow is worth more than a spare tire. In Nigeria, where internet speeds wobble, power outages interrupt your best-laid plans, and teams juggle multiple time zones, having a pragmatic approach to debugging isn’t a luxury - it’s a survival skill. Here’s a window into how I troubleshoot when the system is screaming and every minute counts.

start with the human, then the system

The first instinct in chaos is to blame the cloud, the database, or the latest deploy. I’ve learned to flip that around. I start by identifying who is affected and what their pain looks like. In a Lagos-based fintech kiosk, a sudden payment failure wasn’t just a code problem; it meant a merchant’s day stalled, customers angry, and a call to the support line from a queue that never ends. I ask quick questions: what error is shown to the user, what did the logs say, and what was the last action before it broke? Framing it as a user and business impact keeps you from chasing phantom bugs and helps you decide what to fix first.

In practice, I write down a short triage note: user impact, symptom, suspected component, and a single concrete next step. This keeps everyone aligned when nerves are frayed and the on-call pager is buzzing louder than a drum at a wedding reception.

reproduce the problem in the smallest credible unit

A classic Nigerian telecom outage story is when the system looks fine in staging but fails under real traffic. Reproduction becomes your best friend. I try to reproduce the issue with the smallest possible set of moving parts. If a payment failure happens only for transactions above a certain threshold, I’ll reproduce with a single test account and a small amount to observe the failure mode, then scale up.

For example, in a recent e-commerce outage, the issue appeared only for orders using a specific wallet integration. I created a synthetic order that used the same wallet path and, crucially, ran it through a local mock that mimicked the wallet's responses. If I can reproduce without the entire production stack, I can test hypotheses faster, and I can do it with less risk to customers who are already waiting for a fix.

isolate with confidence, not fear

Isolation is the heart of debugging, especially when everything feels on fire. I start by removing variables one by one. In a Nigerian banking app, I would disable a non-essential feature flag or switch off a non-critical integration to see if the problem persists. If the issue vanishes, I know where to look more deeply. If it remains, I tighten the scope further.

A practical trick is to use feature flags not just for new features but for debugging paths themselves. In a hurry, you can enable a shadow feature flag to route traffic through a safe, parallel path that lets you observe behavior without impacting real users.

logs, metrics, and the tiny signals that matter

When the power cuts or internet hiccups hit, you can’t rely on a single log line to tell the whole story. I gather logs from all layers - client, server, and third-party services - and look for the first abnormal timestamp. In a Nigerian API gateway, a spike in latency or a sudden rise in error codes often precedes a full outage. I correlate log events with system metrics: CPU spikes, memory pressure, database queue length, and cache hit rates.

If logs are noisy, I add a minimal, well-scoped log enrichment at the suspect boundary. For instance, in a payments flow, I might log the exact wallet response code for every transaction in a small window to see patterns. The goal is to surface meaningful signals without turning the log into a war zone of noise.

communicate clearly, not poetically

As the clock ticks, you’ll be tempted to switch to blunt messages with expletives about the “stupid bug.” I’ve learned that calm, precise communication is more valuable. I write a concise incident update for the on-call rotation and a separate post-mortem-friendly note for after the heat fades. When you’re Nigerian, you’re likely coordinating with teams in Lagos, Abuja, and even diaspora colleagues in the UK. A clear, actionable update - what happened, what we did, what we’re watching next - reduces back-and-forth and speeds resolution.

leverage local constraints, not pretend they don’t exist

Nigeria’s real-world constraints often shape debugging decisions. If your cloud region is slow or the VPN is flaky, you’ll want to minimize network-bound debugging. I prefer local tests first, then remote checks. Hardware limitations can also influence choices. I’ve had to run light-weight debugging sessions on a laptop with a meager RAM, so I structure tests to fit within those constraints. If you’re in a shared office with unstable power, power-saver mode can cause subtle timing issues in sometimes surprising ways. Turn off power-saving features during deep debugging, if possible, to minimize timing quirks.

a practical workflow you can actually adopt

Triage with a one-sentence impact statement and a single, concrete next step. Share that with the team immediately.
Reproduce in the smallest unit possible. If you can reproduce in a local, do it; avoid firing up all the microservices at once unless you must.
Isolate by removing variables, using feature flags to create safe debugging paths when needed.
Collect logs and metrics with a focus on the suspect boundary and time window. Don’t drown in data - use targeted signals.
Communicate in plain language, with responsibilities and next steps clearly assigned. Use a quick runbook-style note for the team and a lightweight post-mortem later for learning.

real-world scenario: the flaky payment gateway

We once faced a situation where payments would succeed for some cards but fail for others, and the front-end showed a generic error. The team was scrambling because customer service was fielding calls by the minute. I started with triage: customers impacted were merchants using a specific card network; the symptom was a generic payment failed message at the final settlement step.

I reproduced with a single test card of that network in a controlled sandbox, then gradually added real traffic in a canary-like fashion. I found a narrow window where the gateway would occasionally time out when the connection pool was saturated. We tightened the pool configuration, added a small retry with exponential backoff for that path, and instrumented the gateway to surface the timeout skew. The fix was small, but the impact was huge: outages dropped to near-zero, and customer calls fell dramatically.

wrap-up and takeaways

Debugging when everything is on fire isn’t about heroic feats or clever hacks. It’s about keeping a steady head, using a disciplined approach, and making decisions that don’t cause more harm. In Nigeria, where environments are often more chaotic than a Monday morning, a repeatable workflow matters more than any fancy tool.

If you take away one idea, let it be this: start with human impact, reproduce in the smallest unit, isolate deliberately, collect meaningful signals, and communicate clearly. With those habits, you’ll find your footing even when the system feels like it’s spluttering under a load that would overwhelm a full stadium.

Practical next steps you can try this week:

Create a one-page triage note template you can fill in during incidents.
Build a minimal reproduction for any new bug you encounter, then test your hypothesis against that narrow setup.
Introduce a safe debugging path via a feature flag in your next major release so you can observe behavior without risking customers.
Logging that matters can save you hours - add one new, precise log line at the boundary of the suspect path and monitor the results.
Document a quick post-mortem-friendly summary after the incident, even if it’s a rough draft. You’ll appreciate the clarity later when you revisit the issue for a proper wrap-up.

Technology

start with the human, then the system

reproduce the problem in the smallest credible unit

isolate with confidence, not fear

logs, metrics, and the tiny signals that matter

communicate clearly, not poetically

leverage local constraints, not pretend they don’t exist

a practical workflow you can actually adopt

real-world scenario: the flaky payment gateway

wrap-up and takeaways

Comments (0)