Email Address


How to build a data governance tool

Published on 5 November 2020

In my last article, I suggested that I would put forward a model which would support nearly all foreseeable data governance workloads.

While at DocAuthority we focus solely on unstructured data – most of the principles below are just as applicable to structured data too.

Data governance workloads

Before we begin, let’s just reflect briefly on what those data governance workloads could comprise.

That’s a non-exhaustive list and doubtless there are some crucial omissions. These however are typically the workloads that come up time and time again in discussions with DocAuthority customers. There’s a broad variety of workloads here and the potential for unmanageable complexity.

Finding common ground

How might we address all that with in one simple and durable model?

First, let’s start to think about what actions it is actually possible to take on a file or document. I’m not sure I have them all – suggestions welcome.

To achieve any data governance outcome, we have the set of operations above. Utilizing one or more of these operations, singly or in combination are the opportunities we have to effect the outcomes we want.

Because it simpler, I’m already feeling a lot better in terms of one unified approach. I have a file or document. I can do a fixed set of actions to that file. Those actions (and only these actions) are the tools we have available to achieve our data governance objectives.

Scaling up

I stated in my last post that data governance needed to be economic, efficient and effective. Our model is effective, but it isn’t yet economic or efficient. Because, we are talking about millions (or billions) of files, we can’t simply process one file at a time through one or other workflows. The costs and resource requirements are simply not affordable.

What we need to focus on next are our inputs. We need to industrialize this. One file, two files, ten files just isn’t going to cut it. We need a thousand files at a time and ideally, even bigger volumes. Then, we need to scale our processes similarly to ensure that we don’t have any bottlenecks.

We do that by collating a set of files which all share some key characteristic or multiple characteristics. An example would be collating a set of files which all contain personally identifiable data (PII) for instance. That’s not going to be just 1 file or 1000 files. It’s going to be tens of thousands of files. Possibly millions. We’re going to determine an appropriate way to classify those files or flag them in some way as containing PII. Then, we’re going to process the whole lot in one go. Ideally, we want any process to be evergreen. So rather than having to go through this exercise repeatedly, we have some automation (or better yet, orchestration) so that all files which fulfill a defined criteria are automatically processed appropriately.

Next time…

My next post will be on the topic of making the case for an investment in data governance.

Talking points:

  1. Can you think of any data governance workloads this general approach won’t support?
  2. Does this approach have anything to offer in regard to data governance maturity?
  3. What does this approach mean in terms of stakeholder engagement?


DocAuthority tackle challenges associated with unstructured data – files, documents and emails which, over decades, may have become a substantial business challenge. We run demos and insight sessions for anyone who’s interested in knowing more. For those who are considering a technology investment, we’ll run an end to end ‘try before you buy’ implementation in your business, on your hardware with your data delivering your stated objectives. We do all this for free!


Storage saving calculator

Wondering how much you could save on your unstructured data storage?

Find out now with our Storage Saving Calculator.

Quick Saving Calculator

Enter how many terabytes of unstructured data your company manages?

5 year cost reduction

{{ previewCost() }}

Tailor your savings