Error Monitoring in Elm

renee balmert - july 30, 2020

----------

TLDR:

bugsnag-elm has just been released! This package will let you report errors/logs to bugsnag from within your Elm code. Check out the docs and example app to see how it works!

----------

After working as a Customer Engineer at bugsnag, helping devs configure error monitoring tools on dozens of platforms, it was quite a shock to land a job on an engineering team that believed it’s possible to build an app that simply does not error in production (and therefore doesn’t really need in-depth error monitoring).

How could a small edtech startup achieve what every other tech company failed to do? Elm. Elm is an up-and-coming frontend programming language that is strictly-typed and functional. The idea is that by switching from loosey-goosey JavaScript/React to Elm, a perfectly stable codebase can be built and deployed. “One of the guarantees of Elm is that you will not see runtime errors in practice.”

Elm!

And yet… this engineering team at NoRedInk had written a package to report elm “errors” to Rollbar. So clearly even the most confident of Elm programmers was aware of “things” (not errors!) that would benefit from being monitored, so I kept digging and trying to wrap my head around the whole concept.

If you’ve never worked in Elm, it is a language that compiles to JavaScript. So if you introduce a breaking change to the code, running the compiler will rebuild the entire ecosystem in JavaScript and notice any place within the entire Elm codebase where this new change has introduced a potential problem. The compiler checks not only that the given code works, but that any possible state for that code will work if they ever happen to occur. (E.g. if a value can be null, even if nowhere in your code ever lets it actually be null, you need to handle that possible null state in your code.) So the idea with Elm is not that it magically prevents developers from making mistakes, it’s that you should be able to catch and fix every possible error in the development phase, so that by the time you deploy, everything is copacetic in every possible state.

And this is the beauty of working in Elm. Its creators put a lot of time and thought into its complier error messages. It gives you all the info you would need to identify the error (file, line number, relevant code snippet, what’s wrong) and a hint at what might solve the problem. It captures everything a developer could possibly ask for in an error monitoring tool ♥️, except that it only exists at compile time.

Error messages should always be so kind.

And yet… I couldn’t help but notice that our front end code was far noisier than our backend code in our Rollbar dashboard. (“Oh, that’s just from legacy React code”) And I certainly noticed that Rollbar’s dashboard UI, search capabilities, grouping algorithms and traffic control features were not as robust as bugsnag’s, so I started spending my free time on a POC to see what our front-end error traffic would look like in bugsnag.

Now bugsnag supports almost every single platform under the sun. Almost. It does not have a notifier for Elm. Moreover no one in the Elm community had built their own (because production errors just aren’t that much of a priority). I knew I’d have to write my own Elm notifier, but in the meantime I opted to set up our JavaScript code to report to bugsnag. No reason not to capture our janky React errors while they’re still happening, but also — Elm compiles to JavaScript, so I made sure our bugsnag was setup to cover ALL our JavaScript code, just in case.

And guess what — there were errors reporting from our compiled Elm code! 😱 unpossible!!!

yup - bugs

Despite the gargantuan strides that Elm has made towards reducing errors in production code, it isn’t perfect. Errors are possible, although exceedingly rare. So, it would make sense to monitor your compiled Elm code to maybe catch a new, true error which the Elm community would be very keen to hear about.

But, this elusive creature was not what we found in our bugsnag dashboard. 🦄 What we did find was frequent occurrences where a user’s browser extension had manipulated the compiled elm code in ways that caused an unhandled exception to be thrown. In this case, it wouldn’t be fair to say that this was an error in the Elm code per se, but it’s still something that negatively impacts our users. This knowledge sparked some interesting discussions within the company. Any team has to make decisions about which exact browser versions are worth the effort to support, and in this case — we can make a decision around whether or not we want to support usage of a specific browser extension that seems to be pretty popular with our users. It also suggests maybe compiled elm code is in need of additional “protection” from third party JavaScript that can cause unpleasant interactions. Stuff like this is why I love data transparency in general, and robust error monitoring in particular — you deserve to be able to make an informed decision about what things your engineers should be working on.

So the JavaScript portion of my experiment proved very fruitful, and I continued work on an elm notifier that could report to bugsnag. This is where we circle back to the original conundrum of “Elm doesn’t error!” and me wanting to build a tool that would… report Elm errors. 🤦🏼‍♀️ Let's take a moment here to clarify the idea of "errors". Most platforms have tools to report both handled (logging activity) and unhandled (real crashes/problems) errors. To replicate this coverage on an Elm project, you will need to use bugsnag-js in your compiled Elm code; this is where the truly bizarre and unpredictable (unhandled) things will be discovered, and now -- bugsnag-elm to capture handled "errors". In these instances there will be no crash, no runtime exception - it's not that kind of error. These will be handled errors, where your code can log that an unpleasant thing happened, but the Elm compiler still would have required you to handle the state in some way, so the user will be oblivious to any shenanigans that occur.

So the goal here is to report handled errors/problems that occur in an Elm app. In my first efforts I had imagined writing a notifier that would produce error data similar to all the other frontend notifiers I had worked with at bugsnag. Like a stacktrace. A stacktrace is very helpful in debugging a problem in your code. And Elm compiles to JavaScript, and JavaScript has lovely, built-in error objects with stacktraces so - piece of cake, right?

Nope. 😬

Taking a step back, let’s again examine the “no runtime errors in Elm” idea. As mentioned, our team was already using elm-rollbar to report… something to an error monitoring service. What’s that about?

Sometimes as you are building something in Elm, and you think you’ve got it, but the compiler warns you of a case you haven’t addressed. Usually it’s super helpful — you had genuinely forgotten a valid case, and now the compiler has saved you. (There was much rejoicing! 🎉) But - sometimes you read the error message and can only scratch your head. “That state would NEVER happen!”, you cry “just let the code run!” The compiler looks down, and whispers "No.”

And this is the agony of working in Elm. If you are making a small change in one specific area of the code, but your change touches a module used throughout the codebase - well, you have to account for that change EVERYWHERE. Sometimes the work is just mildly tedious and repetitive. Sometimes you start wandering into areas of the code where you have no context and how to fix it becomes murky. Sometimes your first round of fixes triggers a second wave of needed fixed and you fall into a rabbit/worm hole and begin to question everything.

Elm is like the time-knife sometimes

What do? 🤷🏼‍♀️ Well, one school of thought - and often the correct path - is to truly fix it. Follow where it leads, maybe rethink the way your code is structured and how the dependencies flow. Make that state, which is impossible in the real world, impossible in your Elm code. I have seen some breathtaking examples where following the threads of confusion lead to an elegant refactor that leaves the code in a much more coherent, readable state. yay! 🍾

But, sometimes, that work is not feasible. Not that it can’t possibly be done, but is the product team going to be happy you spent 3 code days doing a refactor which ultimately only helped a teeny tiny feature change come to fruition? Are your colleagues going to appreciate a swarm of code conflicts on their PRs when you touched code way outside your area of ownership? Ironically, I’m currently reading Sandi Metz’s Practical Object Oriented Design right now, and am struck by her discussion of when is the right time to refactor? As an engineer, it’s helpful to develop an instinct for when code has “smell”, but that isn’t the end of the story. Having identified a problematic area in the codebase, you then need to evaluate if it is worth fixing right now? This is a conversation I have had over and over again on my team. Is the fix valuable to the engineering team? Is it valuable to the product team? How valuable (in code hours)? One of the hardest choices I’ve had to make as an engineer is to walk away.

If you wish to make an apple pie from scratch, you must first invent the universe -- Carl Sagan

um... we have a deadline?

So, on the occasions when it isn’t practical to rewrite the entire universe, we’re back to square one — what do we do with this really weird case that should never happen? Log it. Write the code that will report its occurrence to your error monitor and then you can keep tabs on it. If your instincts are right, it will never happen and never be logged. Maybe it happens once in a blue moon. Or maybe, as time goes by, the ground shifts and someone’s work in a different part of the code suddenly makes your edge case happen A LOT. Well, you are monitoring the situation, and at that point your product team can decide if it’s now worth the time to fix. Voila!

All of which is to say, there is value in monitoring errors-like-things in your Elm code.

Now, let’s loop back to what a typical error report looks like. Namely: a stacktrace, a code snippet, etc. None of these existed in the elm-rollbar package. I had - naively - thought I could get those working in my bugsnag-elm package. Hoooo boy. 😬

First, stacktrace. This a very common attribute of error reports in almost any programming language. It tells the story of how the heck we got into this predicament. Time and again I spoke to Elm engineers about my stacktrace dreams, but it just isn’t feasible within the Elm architecture. From the compiled JavaScript's perspective, all errors will occur within the update function -- i.e. they'll all have the same, unhelpful stacktrace.

In speaking with Elm engineers, many mentioned that, rather than a traditional stacktrace, what would be most helpful would be to know the last few messages that occurred. Or the current state of the model. In this notifier I poked around with trying to automatically include the page’s model state in each bugsnag report, but it didn’t go very well. Debug has some lovely tools that can convert any Elm record into a nice string, but it is VERY explicitly not intended for use in production code. And it seems like one could accumulate an array of all messages called in an app session, but again -- the question, on both model and "message-trace", is how could something helpful be accomplished in a package, rather than asking engineers to explicitly write this code in each of their update functions? 🤔

So, for now, in Elm errors, there is no meaningful stacktrace. Most of the time, this is ok. We have a few crucial pieces of information that we can report:

error - a string explaining what happened
context - a string of the name of the module where the error happened
metadata - a record of any other details you'd like to attach
severity - error, warning or info

From the error message and module name, usually we can assess what happened and get to work. But sometimes, if a module is imported extensively throughout the code base and can be approached from many different angles, it can be frustrating to see an error popping up, but with no clue as to which exact workflow ended up in that error state. 😞

folks who understand how important it is to trace your stack

Also, there aren’t even any error classes in Elm. 🤷🏼‍♀️ Makes sense if no errors are expected to occur, but within the usual debugging world, an error class is important. In fact bugsnag uses error classes as part of its grouping algorithm. In this notifier, we are reporting what is really the error message (a human-friendly statement) as the error class (should be something like: “ReferenceError”) so that we can take advantage of its visibility in the dashboard and use it for grouping, but - as a future improvement on this package - I wonder if it would be worth trying to define some general Elm (not-error) error classes.

Another common debugging tool is source maps. During my time at bugsnag by far the most frequent questions users asked involved configuring source maps correctly. Most frontend production code is minified in some way to take up less space and load faster. A source map, generated at the time of minification, is literally a map that lets you take a known spot in the minified code, and them point to the original, un-minified code which will always be easier to read and debug.

source maps

Elm, again, since it has no runtime errors — sees no need for source maps. Moreover, even if there suddenly were an interest in creating this sort of map for Elm compiled code, what we’re really talking about is transpiling Elm to JavaScript which is not exactly the same thing. For example at NoRedInk, our Elm code is first transpiled into JavaScript, and then later that Javascript is further minified. So even if we somehow did have a stacktrace of an Elm error, we would still have to make two leaps - from minified to original JavaScript via source map, and then from original JavaScript to Elm via a mapping tool that doesn’t exist yet and wouldn’t even be supported by services like bugsnag. 😬

Although mapping was an area I was keen on pursuing at the beginning of my project, after many conversations and explorations, it’s clear that this would require a heck of a lot of work for only a small benefit. I don’t plan on digging any further into this, but I bring it up here to show another place where Elm deviates from the typical tooling of a front-end language.

In conclusion, bugsnag-elm doesn’t fully succeed as a bugsnag notifier, in that it can’t report all the expected datapoints (because they don’t exist in Elm). And certainly it’s an odd bird in the Elm world, where production errors are never expected to happen. But I had loads of fun exploring this strange little niche. As it is, this is still a very powerful tool that will let you observe your Elm code's behavior in its production environment.

Writing bugsnag-elm was quite a journey for me! As I was learning Elm for the first time at NoRedInk, I kept noodling on this project. The kinds of questions it brought up really helped me understand the underlying mechanisms behind Elm, in a way that typical feature-building wouldn’t have. What a unique opportunity, as a newbie, to ask my bizarre questions directly to the core developers of this amazing language. 🙏 And although I wasn’t able to build the perfect solution I imagined, I understand the tradeoffs being made and am happy with the result. More importantly — it works! All of NoRedInk’s Elm code (the largest codebase in production) is now using bugsnag-elm to report errors, and so can you! 🎉🐛

Rosy Maple Moth - this is a real bug that exists! It's pink and fluffy.

Bugs can be beautiful!