What Is Data Profiling? Types and Why It Matters


As a business owner, you understand the importance of collecting data to learn more about your customers. You’ve maybe implemented post-checkout surveys or customer registration on your website—or maybe you’re tracking transactional data through your point-of-sale (POS) system.

But it isn’t enough to simply gather this data. You need to aggregate and analyze it, determine what’s useful, and what to toss—a process called data profiling. It can help increase revenue, capitalize on leads, and improve operational efficiency. Here’s what you need to know.

Table of contents

What is data profiling?

Data profiling is the process of consolidating your existing data, removing errors and inconsistencies, and analyzing it to better understand its structure, content, and quality. Data profiling might also involve enriching the data with additional information, like geographic or demographic data.

3 types of data profiling

  1. Structure discovery
  2. Content discovery
  3. Relationship discovery

Data profiling comes in several different flavors. Specific data profiling techniques may be more or less appropriate depending on your business’s industry, size, and needs. Here are three common types of data profiling used by ecommerce businesses.

1. Structure discovery

Structure discovery, also known as structure analysis, involves vetting your data to ensure it’s formatted correctly and consistently. For example, do all of your customer phone numbers have the correct number of digits, and do your customer email addresses all have an “@” symbol? Structure discovery can provide a basic statistical evaluation of your data, giving you values like mean, median, mode, frequency distribution, or standard deviation.

2. Content discovery

Content discovery is the process of identifying the content or context of data elements. It’s also the practice of combing through data to find errors, inaccuracies, and other data quality problems like null values (i.e., unknown or missing values).

Whereas structure discovery is more quantitative, content discovery is more qualitative—its goal is to ensure accuracy, clarity, and consistency in the data set.

Say you have a database of all addresses for every order you’ve shipped in the past year, which you want to analyze to determine where your highest concentration of customers lives. At the same time, you’re planning to launch an ad campaign on social media to reach more people in those areas. You need to ensure that every element in the address is formatted consistently—use nine-digit ZIP codes instead of five-digit ZIP codes, and use two-letter state abbreviations rather than spelling them out. Inconsistencies, improper formatting, and misspellings could lead to records being missed and an incomplete data set.

3. Relationship discovery

Relationship discovery, as the name suggests, is the process of identifying and analyzing connections between different variables or data elements in a dataset to identify patterns or trends. If you collect purchase history and customer location, you may find a correlation between the two that informs how you advertise in specific locations or how you stock products. For example, sales of a particular SKU might be higher in one area code than in another.

Benefits of data profiling

High-quality data is an essential ingredient for success in ecommerce. Without it, your business will be flying blind. Benefits include:

  • Increased revenue. While good data profiling can help your company’s bottom line, data quality issues can be costly, with estimates suggesting that businesses spend 10% to 30% of their revenue on addressing these problems. When marketing campaigns or analyses are based on inaccurate data, it can result in missed opportunities and lost conversions.
  • Improved decision-making. Making choices for your business—how to advertise, when to expand, who to target with marketing—based on flawed or incomplete data is a recipe for disaster. Data profiling can help ensure the accuracy and legitimacy of your data and provide precious insights into the quality, relationships, patterns, and gaps in the data.
  • Better organization. Data profiling includes consolidating your information in one easy-to-navigate system called a data warehouse. Bringing all the data you collect—from social media to surveys—into a single system can give you a more complete view of customer perceptions.

Charlie Gower, co-founder of supplement company The Nue Co., sums up the benefits of high-quality data: “If you’re an early stage ecommerce business and you can start to capture data in an innovative way, it’s really going to help you as you build and scale.”

How to conduct data profiling

  1. Identify what data you want to use
  2. Identify issues in the data
  3. Use tools to discover beneficial relationships in the data

While some parts of the data profiling process—such as calculating metrics like standard deviation—might be automated, it’s essential to devise a plan of attack to make the most efficient use of your time and labor. Here are the three main steps involved:

1. Identify what data you want to use

Just because you collected the data doesn’t mean you need to profile it or use it. Specific datasets may be incomplete or irrelevant to your current interests and not worth spending time and energy combing through. Before you start profiling, determine what you hope to achieve and what data will help you meet these goals.

2. Identify issues with the data

Organizing the information from your data warehouse into spreadsheets or searchable databases, deleting duplicates, filling in null values, completing fields, and searching for major issues often represents the bulk of the data profiling workflow. This process—often called extract, transform, load (ETL)—prepares your data for analysis with tools like machine learning (programs that provide automated insights) or chart and graphic creation platforms.

3. Use tools to discover beneficial relationships in the data

Manual data analysis takes time that could be better spent improving other aspects of your business. Software tools and apps can automate the data profiling process, streamline structure, content, and relationship discovery, and provide quick, cost-effective insights to guide your decision-making.

4 data profiling tools for ecommerce

  1. Talend Open Studio
  2. Informatica
  3. IBM InfoSphere Information Analyzer
  4. Clear Analytics

Software platforms can help your business profile data, analyze metadata, and generate useful data quality assessment reports and graphics. Before choosing one, decide what capabilities you need (such as historical data storage, integration among multiple data sources, and data encryption) and how much you’re willing to spend. Here are a few options to get you started:

1. Talend Open Studio

Open Studio, Talend’s data quality tool, integrates with several SaaS (software-as-a-service) platforms, including Marketo, Salesforce, and NetSuite, and offers relationship and metadata analytics from an easy-to-use, searchable interface. Talend offers a free version of its tool, with a paid version available for more complex data sets and further integrations.

2. Informatica

Informatica offers several data profiling capabilities, including continual analysis, giving you up-to-the-minute insights as you collect and import new information. It also includes address verification (essential for any business gathering customer email data) and cloud storage to access your data on the go. Informatica offers a free 30-day trial, with prices starting at around 19¢ per hour of server time.

3. IBM InfoSphere Information Analyzer

IBM’s InfoSphere Information Analyzer can deliver 80 different types of reports for visualizing trends within your data. InfoSphere also includes a browser version for small-scale profiling or off-premise analysis. It starts at $16,500 per month.

4. Clear Analytics

Clear Analytics offers an affordable, Excel-based data analysis tool for smaller or less-technical ecommerce businesses. The software includes a complete audit trail that tracks where data came from, when it was imported, and who handled it. It also integrates with Microsoft Power BI to create interactive data visualizations like graphs and charts. It starts at $29 per month.

How Shopify can help with data profiling

Shopify offers integrations with third-party apps and tools that can help with data profiling, including:

  • Google Analytics: When you integrate your Shopify store with Google Analytics, you can track and analyze customer behavior including demographics, location, and device usage.
  • Klaviyo: Klaviyo is an email marketing platform that can help you collect customer data, segment your audience, and create targeted email campaigns based on customer behavior.
  • Yotpo: Yotpo is a user-generated content platform that integrates with Shopify and can help you collect customer reviews and other user-generated content—valuable insights into customer preferences and behavior.
  • Recharge: Recharge is a subscription billing and recurring payments platform that integrates with Shopify and can help you track customer subscription data, including purchase history and frequency.

Shopify also offers a robust API that can be used to integrate with other third-party tools. Developers can build a custom integration that pulls data from Shopify's API and feeds it into a data profiling tool like Talend Open Studio or IBM InfoSphere Information Analyzer—giving you deeper insights into customer behavior, improvements in operational efficiency, and revenue growth.

Data profiling FAQ

How does data profiling make big data easier?

Modern data profiling uses automation to organize, analyze, and provide valuable insights into large and complex data sets, saving time and money.

What are some common data profiling techniques?

Data profiling techniques include column profiling (frequency of data points in a table), cross-column profiling (advanced analysis of data connections), and cross-table profiling (sorting based on similarities and differences).

What are some challenges of data profiling?

Automated software can assist with cleaning data, but manual intervention may still be necessary. Data privacy is paramount and requires safeguards such as encryption and server backups to protect customer information.