Garbage in, Garbage out: how to find treasure between the trash

Congratulations! You've identified your first data and AI use cases and want to start creating value for your business as quickly as possible. You've followed the steps from my previous blogs and have now reached the point where you want to evaluate your data sources to see if they contribute to your use case and whether you can actually use them.

In this blog, we'll cover several relevant concepts around data quality, and I'll explain how to map out your organization's data!

Garbage Everywhere

"Garbage in, garbage out" - this well-known expression in the data world perfectly summarizes why data evaluation is so important. It's virtually impossible to make optimal decisions based on poor data. But what actually makes data "bad"? And more importantly: how do you evaluate data quality in a systematic way?

Now that AI and ML are becoming increasingly common within organizations, it's becoming even more crucial that the data we feed into these models is of high quality. If the input data for these models is of poor quality, we can't expect these models to provide good advice or predictions based on that data.

These models are typically trained on high-quality data that reflects reality (as much as possible). Not every edge case you introduce can be handled by such a model.

To be completely honest, almost all data I encounter in my work as a Data Architect is initially "messy". So it's mainly about how we can ensure this data becomes valuable!

Low-hanging Fruit

The most obvious examples of bad data are often related to incorrect registration or entry.

Take, for example, an inventory system that shows six items in stock, while in reality there are only five. Something may have gone wrong with the (manual) entry of products in the ERP. The result? Lost sales due to low inventory, or delays in product delivery, resulting in a negative customer experience.

We want to prevent this!

Solving these types of problems is relatively simple - we have a clear picture of where the error lies and can therefore make good adjustments. In the example above, it's conceivable that additional verification of the input, or registering products with a scanner would bring improvement. There might even be a possibility to automate the process by processing purchase orders using computer vision.

Although this example clearly shows what we mean by bad data, reality is (unfortunately) often more complex.

Looking for the Source

In most organizations, decisions are not made directly based on relatively raw data from a single source, but from multiple data sources - often modified in intermediate steps, which together form a picture of the reality in which the organization operates.

Not all data is accessible, but there are a few sources that are typically present in most organizations and are crucial for business operations:

A Financial System
A Customer Relationship Management (CRM) System
Enterprise Resource Planning (ERP) System

With just these sources, we can already paint a more complex picture of how decisions can be made within organizations and how data quality influences decisions.

You can probably think of several other sources within your organization that are of great importance in decision-making. Below, I've given an example of how an organization determines its purchasing budget (among other things) based on the aforementioned data sources.

A look at this process shows that a small error in inventory numbers can have a major impact on other processes. The same applies to Financial data and CRM data.

It's also important to realize that each step in this process adds value to the decision but can also certainly have a negative influence. If an incorrect calculation occurs in one of the steps, this may have an even greater impact than an error in the source. One person's "valuable" output can become another's "garbage" input.

You can imagine that as an organization grows, more explicit and implicit steps are added to this graph. I've seen plenty of organizations where a lot of data is still manually copy-pasted and entire departments are run based on Excel files shared via email - those are many implicit processes that are difficult to control.

That's all fine, as long as the organization continues to achieve its goals and remain competitive. However, it's important to understand that this is very error-prone and thus there's a high chance that "garbage" is inadvertently introduced somewhere in the process. With all the consequences that entails.

Start by Drawing

One way to get a grip on this process is to do a similar exercise to what I've shown above. To understand which data influences decisions for your specific use case, it's important to create a data model and uncover the properties of this data.

I personally like to start with a high-level process model, which primarily describes processes, and then create a detailed data model describing specific aspects and relationships. For example, create a BPMN of the process, or draw out the Enterprise Data Model in UML.

If you want to approach it less theoretically, a drawing like I made above is also perfectly fine. The main thing is that you do it!

Dare to Ask!

Below I describe several questions you can ask about each data process within your organization, whether it concerns storage, processing (by human or machine), or decisions made based on data:

Data Input

Who or what enters the data?
Who is responsible for this input?
At what frequency?
Does the data remain the same, or does it change (sometimes)?
What validations take place here?
Can we formalize these validations?
How consistent is the input process? (automatic versus manual)

Data Processing

What transformations does the data undergo?
Who is responsible for these transformations?
Where is data combined?
What analyses are performed?

Data Consumption

Who uses the data for decision-making?
Which systems depend on this data?

The data model and the questions above will help you (among other things) understand where value is added, provide documentation about how the organization works, and will help you have conversations about data with stakeholders in the organization.

If it's difficult to create a (basic) data model for your organization, or to formulate or get clear answers to the above questions, this might be a sign of fluctuating or poor data quality and data management. For your first use case, be sure to choose a source and processes where clarity already exists!

Pick a data use-case where data quality and processes are relatively well organised, lowering complexity.

Get Started

In this blog, I've given you some tools to get started with Data Lineage and Data Quality for your data and AI use cases. It's important to use an integrated approach for analyzing data sources and processes. Draw the flow of your data from source to insight and ask questions about the use and production of data. After this, you'll be able to proceed with your data use case because you know which sources are being used and what the quality of these sources is.

Need help turning your garbage into valuable insights so you can make better decisions about your organization? Contact me via LinkedIn or send me a message. In a first free consultation, we can already sharpen your first use case!

Garbage in, Garbage out: how to find treasure between the trash

Garbage Everywhere

Low-hanging Fruit

Looking for the Source

Start by Drawing

Dare to Ask!

Get Started

More articles

Rapid Development with Next.js + FastAPI + Vercel + Neon Postgres

Measuring and Maintaining CI/CD Success

An Introduction to Data Lakes: Wolk’s Practical Guide to Taking a Dip in the Water

Get in touch