sms sudamerica logo

Sex? Lots of it! And other data quality mistakes that ruin your AI.

When we talk about data quality, we tend to think of something technical or distant. But sometimes, the problem blows up in your face… and even makes you laugh.

A few years ago, working with data from a chatbot, I came across this gem:

  • Name: Juan
  • Age: 25
  • Sex: Very much

At first I laughed, of course. Then I got worried – how much more data like that was there? How does that affect any analysis or AI modeling that relies on that basis?

In a world increasingly driven by artificial intelligence, the quality of the data we use is not a technical detail: it is the first step to good results.

Errors of origin: design vs. use

Over time I came to understand that mistakes usually come from two sides:

  • Design errors ➔ as in the chatbot, which allowed typing anything in the “sex” field.
  • Usage errors ➔ such as when in a bakery they always charge sales with the same product or customer code.

In both cases, we end up with useless data for any analysis. And worse: they lead us to wrong conclusions.

What if AI learns from dirty data?

Going back to the example of the bakery, if all the registered products are “sourdough bread”, the algorithm will only recommend you to make more of that. But not because it sells more, but because it is the only data you gave it.

📊 Fact: According to Harvard Business Review, data scientists spend up to 80% of their time cleaning and preparing data before they can analyze it or train models.

Checklist: The most common data errors (and how to prevent them).

Duplicate data: they ruin any analysis.
Inconsistencies: “ Juan” vs “JUAN”, and from there to chaos is a step.
Incomplete fields: ever forget to ask for the email?
Outdated data: make decisions with old information and you’re headed for failure.
Typos: the classic “sourdough bread”.
Outliers: ages of 150 years or sales of -10 products.
Coding problems: those weird characters that break reports.

How to improve the quality of your data?

  • Define which fields should be checked (drop-down lists, formats).
  • Validates the information at the time of loading.
  • Perform periodic audits and base cleaning.
  • Train your team: quality starts with the data uploader.

At SMS Sudamérica we help companies transform dirty data into smart decisions. We love to ask the questions you may not have asked yourself yet.

Want to get the most out of your data? Write us and we will schedule a meeting.

Note by: Lautaro Cantar