“When the Ariane 5 fails, make sure it emails the devs.”- Nobody ever
I consider one of the biggest mistakes I made during the construction of an enterprise application framework to be the logging API. Effectively, we were asynchronously logging to the file system, to a database and to email. The API was quite sophisticated and handled redundant exception types, throttling and flooding and the like. Exceptions were (generally) queued, logged to the event log (if possible), written in a canonical form to the file system (if possible) while asynchronously being written to a database (if possible) and then picked up by a messaging service which sent the exceptions to an email distribution list (if possible).
All of the developers were members of the distribution list. We were sent hundreds of messages daily from multiple environments: Dev, QA, UAT, Prod, not to mention the disaster recovery site. Daytime and nighttime processes, start ups, shutdowns, pauses and restarts all pumped messages into the queue. Sometimes we left informational or warning messages in code because… why not? Occasionally, when the moon was right and stars aligned, a perfect storm would hit and every environment would spontaneously barf out an overly detailed diatribe like some spastic, drunk, friend-of-a-friend nobody really wanted to invite to the party but there he is, filling the air with hateful, ignorant nonsense.
Our mailboxes would hit their quota.
I now consider logging exceptions to email a cardinal sin.
But… The VISIBILITY!
The nice thing about receiving an exception through email is that the state of your processes are right there in front of you. You see it. If you’re paying attention, you know what’s going on.
But are you really paying attention? You just received 300 email messages in the last ten minutes. Are they all important? Can you successfully dig through the worthless warnings for the real disasters?
No. You can’t.
With such a flood of text constantly filling your inbox, you’re eventually going to miss something critical. Even the best filters, organization and dedication will ultimately fail.
So… what then?
Build the Appropriate Workflow
The architecture we built to email exceptions in a robust fashion took about a week. Maybe two. Maybe three with developer testing, QA, etc. It certainly wasn’t an extraordinary effort. But the effort was misplaced.
We should have spent those one or two or three weeks on building the right tool, an exception workflow.
It looks something like this:
- An exception is raised
- It’s logged to the event log, database, etc.
- An exception workflow process scans incoming exceptions
- Rules determine notification strategy
If you have services, you should have monitors sitting around watching those services (which is a no brainer). Establish a well-defined business rules about who should receive which type of urgent message and when: Not every single email needs to be sent during off-hours to a 24×7 support channel. Similarly, not every single monitoring app needs to send a text message to every developer at night.
The worst thing you could do is blindly dump every email you get to every developer on your team.
You’re creating unnecessary noise.
Noise isn’t actionable.
How do you address rapid changes in environment, business and code? One aspect of your notification strategy should be some sort of rules-based system. How you implement your rules-based processing is up to you, and I’m going to spend some cycles talking about it later, but the point is that you should be able to plug in new notification handlers at any time.
Similarly, you should be able to disable notifications or redirect notifications quickly.
Without a release.
Change your exception workflow. Email is not an acceptable exception management system.