Preventing Massive Data Messes

I’ve been hired several times to clean up a database that is a big mess. I really enjoy these projects, and I’m really interested in the patterns that show up across these projects.

Unsurprisingly, one thing most of these projects share is that work was done in the past without a well though out plan. The internet is full of explanations about how a good, written plan pays off for everyone involved, so I won’t repeat those here. They’re true, though! Without a plan, everyone is just a six year old throwing their foot up as high in the air as they can to try and kick the soccer ball. With a plan, any of us have the potential to be the older, better player who can stop the ball and decide where to kick it. 

Another thing these projects often have is that the person closest to the data has developed a focus on edge cases in the data. I suspect I’ll never figure out whether the person shows up with a focus on edge cases as part of their personality or if the environment creates that in the person. However it happens, once the person who is closest to the data has this focus, things begin to warp around that. Part of cleaning up the system is to refocus the whole team.

Most systems I’ve worked in have a lot of fairly typical data – for instance, online donations. These records get created automatically and are generally consistent. There can be all sorts of issues with this kind of thing, but that’s a topic for another day. Most systems also have a small set of records that are truly atypical and that need some handling that the typical records don’t. 

A team needs to recognize that the first priority is to get the typical data right, and that this happen with as little human intervention as possible. A team also needs to be able to talk about typical vs. atypical data and recognize that there’s a difference! I’ve had several experiences where people have simply stopped seeing the typical data. I don’t necessarily want people to spend more time on the typical data, but I do want them to understand it and see how it is being handled automatically. 

I think that saying people need to recognize something and talk about it sounds pretty squishy, but the reason it matters is that when we start talking about atypical data, the easiest thing in the world is for the team to ask for a change based on atypical data, and in the process break the things that are working for the typical data. 

If teams were incredibly precise in their requests, perhaps this wouldn’t be an issue. If people implementing the changes were fully aware of the typical and atypical data and how the team worked day to day and felt comfortable saying, “actually, your request isn’t possible without breaking something more important,” perhaps this wouldn’t be an issue. But usually a team is making somewhat imprecise requests and handing the off to implementors who are missing a lot of context and who are incentivized to just complete the change rather than to figure out if it is a good idea or not.

If you’re stuck in this cycle, there are a couple of questions you can ask an answer that may help people refocus. 

  • Can you personally explain typical vs. atypical data in your system?
  • What works for typical data?
  • What doesn’t work for typical data?
  • What kinds of data cause your team the most problems? For each of the kinds of problems you’ve talked about recently, do you know what percentage of your records are like that? I once worked with a team that was spending a massive amount of time on records that felt very common to them because they spent so much time on them, but were in fact vanishingly rare in their system. Just pointing out how few of them there were changed the conversation entirely.

When cleaning up a troubled system, I like to start with the typical data and get those integrations and automations working really, really well so that people can stop working on them and just use the data. Doing this usually turns up some things contortions in the automations for atypical data, and those contortions seem to always cause problems for all the data – not to mention how difficult they are to maintain over the long-term.

Once the typical data is in order, then we can get to how to handle atypical data in the simplest possible way, with a plan and documentation so that we can maintain it as atypical things evolve. When dealing with atypical data, I like to start with the highest volume issues and move towards the lowest volume issue. If you’ve got some type of atypical data that someone has to spend 10 minutes a quarter updating because there’s no automated way to handle it, that’s probably not even a problem we should solve, and once we’ve gone through this process, nobody even has to say that out loud because it is so obvious. When that 10 minutes is just part of a constant avalanche of manual updates, it seems like a bug. 

Leave a comment