Not All Data is Created Equal. Here are 4 Things to Look for at All Times
It goes without saying that data is virtually everywhere. Unfortunately, a lot of it is bad or unusable. Not every dataset has the same capacity to spark new innovations, drive critical insights for timely and life-changing solutions, and answer tough or complex questions.
If you rely on data to do your work well, it’s important to have a fool-proof way to distinguish one data source from another—in essence, the “good” versus the “bad”—to know whether it’s worth your time and investment. Here are a few tips to get you moving in the right direction.
4 Key Steps for Evaluating Data Sources
We’ve discussed at length why maintaining data standards is so critical for businesses today. For starters, some data sources can be messy, inaccurate, or simply hard to join together with other datasets. This requires data scientists to take the extra step to scrub and clean the data before being able to extract any real value from it. And because accurate data is an essential element in today’s analytics-obsessed world, having objective criteria in place to assess whether a given dataset is fit for use is more important now than ever before.
Then, of course, there’s also the question of cost. Although data is everywhere, good and clean data often comes with a price tag—and for good reason. But there are also times when even subpar data lives behind a paywall. So, it’s important to know what kind of data you’re actually purchasing before going too far down the data “rabbit hole.”
Good news for you: If you follow these four steps, you can avoid falling into the bad data trap:
Step 1: Determine if the data comes from credible sources
First, verify how dependable and trustworthy the data is based on its original source and be able to defend it should your findings ever come under scrutiny:
What is the true source of the data? Some data intermediaries take source data and process it to add value. You need to know this, as it could influence your findings or potentially even be used against you should others want to challenge your findings.
What assumptions are being applied? Datasets are often filtered to fit a set of assumptions, which can alter your results if left unchecked.
What is the depth, breadth, and cadence of the data? Some datasets are aggregated while others represent individual transactions. Some datasets are captured from a specific moment in time while others span over a time period. Some datasets are generated from large panel sizes while others reflect only a small subset. Either way, you need to know this information up front to be able to defend and justify your results.
Step 2: Establish what the data can (and can’t) tell you
Next, determine the limits of the data, so you can put it to good use:
What does the data represent? Depending on the source, data can shine a spotlight on transactions, consumer intent, foot traffic, or behavioral patterns.
What observations can the data allow? Data can reveal explicit relationships or implied behavioral patterns just as much as it can leave gaps to be filled by assumptions. But it can’t do everything. You need to be aware of a dataset’s constraints at all times.
What are the unique characteristics of the data? Some data providers may be the only source for a particular type of dataset. Or they may treat the data in a unique way that renders a dataset more immediately usable or joinable to other datasets. Or is this a proxy to other data not readily available or accessible in the market but can nonetheless be used to infer important insights?
Step 3: Assess the genuine usability of the data
Not all data is immediately usable without a little scrubbing. To establish how much cleaning, sorting, or processing may be required to make a dataset useful, ask the following questions:
How is the data presented? The data may be available via GUIs and charts or packaged up as raw data. This can influence the amount of processing required.
How easy is it to work with the data? Complex datasets often require an advanced level of expertise to work with, while others are more plug-and-play and can integrate seamlessly with data analytics platforms via APIs, thereby making it easier for even a non-technical user to analyze and glean immediate insights from the data.
How much additional work is needed to make the data usable? If a lot of extra work is required to expand, clean, or sort a dataset, that could quickly become a roadblock for driving meaningful, actionable, or timely insights.
Step 4: Be clear about how you plan to use the data
Finally, you need to define how you intend to put this data to work:
How many companies, metrics, or regions does the data apply to? A deep, dependable, and highly accurate dataset can allow you to answer a wider range of questions that could ultimately be applicable to multiple businesses, sectors, and beyond. This is less of a limitation as it is a way of creating new value out of existing data sources.
Can the data be combined with other datasets? Joining one set of data with another can reveal unexpected, yet incredibly valuable insights that you couldn’t have gained from analyzing those two datasets independently. Unfortunately, based on how clean or organized those datasets may be, joining them can become a feat in and of itself.
Are there opportunities to get creative? Some datasets can be put to use in ways you might not expect, which oftentimes generate the most immediate and long-term value. Think about it like this: Your imagination is your canvas, while data is your paint; not all “color” combinations will yield the results you anticipate, but when all the pieces fall into the right places, incredible things can happen. You just have to be creative.
Don’t Ever Be Duped Into Using Subpar Data Again
Every dataset is inherently different. Some are good while others are just not worth anyone’s time. Before deciding to invest in any data source for your own research or analysis, you must ask yourself the questions above to avoid potentially paying a lot for nothing at all.
Being crystal clear upfront about the credibility, usability, and applicability of any given dataset—including its limitations—will help you determine what you can get out of it before spinning too many wheels. This is a fool-proof way to ensure you always get the data you need.