At SafeGraph, we are entirely focused on curating the highest quality location data to serve as a source of truth for what is happening in the real world. We ingest raw data from a variety of sources, then clean, de-dupe, and standardize it so data scientists can spend their time on actual analysis. We curate our Places data monthly to ensure it is fresh and an accurate representation of the real world. We include open/close dates to provide context on when a place began operations, or stopped them. This empowers data scientists to perform market analysis to see exactly how certain brands, business categories, or regions are expanding or contracting.
But in such a dynamically changing world, where businesses close, change names, and various other changes happen at millions of places around the world, we must constantly be vetting our data to make sure it is a true representation of reality.
We’re not the only provider of POI data out there, so we often hear of organizations comparing our data to similar datasets to assess coverage and quality. We decided to perform a similar analysis comparing the quality and coverage of SafeGraph Places and another leading POI dataset in the UK.
TL;DR: We found that 17% of the other provider’s POIs were invalid.
We first matched SafeGraph Places data to the other providers'. For POIs that did not match, we began a thorough investigation to identify if we were indeed missing these places (no one is perfect), or if the other provider’s data was inaccurate.
Some of the real world factors that can affect POI counts include business closures or acquisitions. To determine if the businesses showing up in the other dataset but not SafeGraph fall into either of these categories, we looked for relevant news articles and checked website domains.
For example, we found that Debenhams department store purchased the Principles brand, which was not properly reflected in the other providers’ dataset and so produced inaccurate brand counts. Similarly, the brand MK One was still present in the other dataset even though it has been sold and rebranded multiple times.
There are also some discrepancies between how SafeGraph classifies brands and the methodology of other providers. To be a valid brand in SafeGraph data, a brand must have more than one location and their own storefront or dedicated space in a store (e.g,. a Sunglass Hut inside Macy's is still a valid brand, but a Clinique makeup counter is not). We classify brands this way based on feedback from data scientists about how they think of brands and separate store locations. For example, if a kiosk inside a store has its own phone number and hours of operation, it is considered its own brand location.
Many brands included in the other provider’s data do not meet this standard of accuracy. For instance, the other dataset includes POI data for Envy, an online-only shoe store, as well as Carlton Sports, a brand that is only sold in other stores and does not have its own store locations. Another example is Connections, an educational consulting service for boarding schools that does not actually own or operate the schools themselves (the other dataset associated the brand with the school locations).
Businesses open and close frequently, and to stay on top of these dynamic changes, data providers need to be thorough in their research and data curation. SafeGraph ensures data accuracy and freshness by thoroughly researching and vetting the raw data we ingest. Our sole focus is curating location data of the highest quality so data scientists can spend their time analyzing, not cleaning, their data. We deliver monthly updates of our data to provide an up-to-date view of the real world, unlike other providers whose quarterly- or annually-updated data becomes quickly stale.