Error Monitoring in Elm
renee balmert - july 30, 2020
bugsnag-elm has just been released! This package will let you report errors/logs to bugsnag from within your Elm code. Check out the docs and example app to see how it works!----------
After working as a Customer Engineer at bugsnag, helping devs configure error monitoring tools on dozens of platforms, it was quite a shock to land a job on an engineering team that believed it’s possible to build an app that simply does not error in production (and therefore doesn’t really need in-depth error monitoring).
And yet… this engineering team at NoRedInk had written a package to report elm “errors” to Rollbar. So clearly even the most confident of Elm programmers was aware of “things” (not errors!) that would benefit from being monitored, so I kept digging and trying to wrap my head around the whole concept.
And this is the beauty of working in Elm. Its creators put a lot of time and thought into its complier error messages. It gives you all the info you would need to identify the error (file, line number, relevant code snippet, what’s wrong) and a hint at what might solve the problem. It captures everything a developer could possibly ask for in an error monitoring tool ♥️, except that it only exists at compile time.
And yet… I couldn’t help but notice that our front end code was far noisier than our backend code in our Rollbar dashboard. (“Oh, that’s just from legacy React code”) And I certainly noticed that Rollbar’s dashboard UI, search capabilities, grouping algorithms and traffic control features were not as robust as bugsnag’s, so I started spending my free time on a POC to see what our front-end error traffic would look like in bugsnag.
And guess what — there were errors reporting from our compiled Elm code! 😱 unpossible!!!
Despite the gargantuan strides that Elm has made towards reducing errors in production code, it isn’t perfect. Errors are possible, although exceedingly rare. So, it would make sense to monitor your compiled Elm code to maybe catch a new, true error which the Elm community would be very keen to hear about.
Taking a step back, let’s again examine the “no runtime errors in Elm” idea. As mentioned, our team was already using elm-rollbar to report… something to an error monitoring service. What’s that about?
Sometimes as you are building something in Elm, and you think you’ve got it, but the compiler warns you of a case you haven’t addressed. Usually it’s super helpful — you had genuinely forgotten a valid case, and now the compiler has saved you. (There was much rejoicing! 🎉) But - sometimes you read the error message and can only scratch your head. “That state would NEVER happen!”, you cry “just let the code run!” The compiler looks down, and whispers "No.”
And this is the agony of working in Elm. If you are making a small change in one specific area of the code, but your change touches a module used throughout the codebase - well, you have to account for that change EVERYWHERE. Sometimes the work is just mildly tedious and repetitive. Sometimes you start wandering into areas of the code where you have no context and how to fix it becomes murky. Sometimes your first round of fixes triggers a second wave of needed fixed and you fall into a rabbit/worm hole and begin to question everything.
What do? 🤷🏼♀️ Well, one school of thought - and often the correct path - is to truly fix it. Follow where it leads, maybe rethink the way your code is structured and how the dependencies flow. Make that state, which is impossible in the real world, impossible in your Elm code. I have seen some breathtaking examples where following the threads of confusion lead to an elegant refactor that leaves the code in a much more coherent, readable state. yay! 🍾
But, sometimes, that work is not feasible. Not that it can’t possibly be done, but is the product team going to be happy you spent 3 code days doing a refactor which ultimately only helped a teeny tiny feature change come to fruition? Are your colleagues going to appreciate a swarm of code conflicts on their PRs when you touched code way outside your area of ownership? Ironically, I’m currently reading Sandi Metz’s Practical Object Oriented Design right now, and am struck by her discussion of when is the right time to refactor? As an engineer, it’s helpful to develop an instinct for when code has “smell”, but that isn’t the end of the story. Having identified a problematic area in the codebase, you then need to evaluate if it is worth fixing right now? This is a conversation I have had over and over again on my team. Is the fix valuable to the engineering team? Is it valuable to the product team? How valuable (in code hours)? One of the hardest choices I’ve had to make as an engineer is to walk away.
So, on the occasions when it isn’t practical to rewrite the entire universe, we’re back to square one — what do we do with this really weird case that should never happen? Log it. Write the code that will report its occurrence to your error monitor and then you can keep tabs on it. If your instincts are right, it will never happen and never be logged. Maybe it happens once in a blue moon. Or maybe, as time goes by, the ground shifts and someone’s work in a different part of the code suddenly makes your edge case happen A LOT. Well, you are monitoring the situation, and at that point your product team can decide if it’s now worth the time to fix. Voila!
All of which is to say, there is value in monitoring errors-like-things in your Elm code.
Now, let’s loop back to what a typical error report looks like. Namely: a stacktrace, a code snippet, etc. None of these existed in the elm-rollbar package. I had - naively - thought I could get those working in my bugsnag-elm package. Hoooo boy. 😬
In speaking with Elm engineers, many mentioned that, rather than a traditional stacktrace, what would be most helpful would be to know the last few messages that occurred. Or the current state of the model. In this notifier I poked around with trying to automatically include the page’s model state in each bugsnag report, but it didn’t go very well. Debug has some lovely tools that can convert any Elm record into a nice string, but it is VERY explicitly not intended for use in production code. And it seems like one could accumulate an array of all messages called in an app session, but again -- the question, on both model and "message-trace", is how could something helpful be accomplished in a package, rather than asking engineers to explicitly write this code in each of their update functions? 🤔
So, for now, in Elm errors, there is no meaningful stacktrace. Most of the time, this is ok. We have a few crucial pieces of information that we can report:
- error - a string explaining what happened
- context - a string of the name of the module where the error happened
- metadata - a record of any other details you'd like to attach
- severity - error, warning or info
Also, there aren’t even any error classes in Elm. 🤷🏼♀️ Makes sense if no errors are expected to occur, but within the usual debugging world, an error class is important. In fact bugsnag uses error classes as part of its grouping algorithm. In this notifier, we are reporting what is really the error message (a human-friendly statement) as the error class (should be something like: “ReferenceError”) so that we can take advantage of its visibility in the dashboard and use it for grouping, but - as a future improvement on this package - I wonder if it would be worth trying to define some general Elm (not-error) error classes.
Another common debugging tool is source maps. During my time at bugsnag by far the most frequent questions users asked involved configuring source maps correctly. Most frontend production code is minified in some way to take up less space and load faster. A source map, generated at the time of minification, is literally a map that lets you take a known spot in the minified code, and them point to the original, un-minified code which will always be easier to read and debug.
Although mapping was an area I was keen on pursuing at the beginning of my project, after many conversations and explorations, it’s clear that this would require a heck of a lot of work for only a small benefit. I don’t plan on digging any further into this, but I bring it up here to show another place where Elm deviates from the typical tooling of a front-end language.
In conclusion, bugsnag-elm doesn’t fully succeed as a bugsnag notifier, in that it can’t report all the expected datapoints (because they don’t exist in Elm). And certainly it’s an odd bird in the Elm world, where production errors are never expected to happen. But I had loads of fun exploring this strange little niche. As it is, this is still a very powerful tool that will let you observe your Elm code's behavior in its production environment.
Writing bugsnag-elm was quite a journey for me! As I was learning Elm for the first time at NoRedInk, I kept noodling on this project. The kinds of questions it brought up really helped me understand the underlying mechanisms behind Elm, in a way that typical feature-building wouldn’t have. What a unique opportunity, as a newbie, to ask my bizarre questions directly to the core developers of this amazing language. 🙏 And although I wasn’t able to build the perfect solution I imagined, I understand the tradeoffs being made and am happy with the result. More importantly — it works! All of NoRedInk’s Elm code (the largest codebase in production) is now using bugsnag-elm to report errors, and so can you! 🎉🐛