DaaS Bible 2.0: How Standards Increase Data Flow and Benefit Everyone

Note: Will Lansing is the CEO of FICO and Auren Hoffman is the CEO of SafeGraph. This piece is a follow-up to Auren’s 2019 piece: The Data-As-A-Service Bible -- the most widely read piece on the business of selling data.

Data is all the rage these days. Yes, we’ve heard that it is the new oil.

But a single dataset on its own has limited value. The real value from data comes from connecting it across multiple disparate datasets. And to accelerate the connecting of data, it is really helpful if data producers and data consumers agree on a common standard.

In this piece, we will dive into:

How to make data more valuable
What makes a good standard
What standards have worked well in the past
How new standards in the future can accelerate collaboration around data

If you want to stop reading right now, the tl;dr is:

Linking data to other data makes all the data more valuable
Standards (also known as join keys) are the most valuable ways to link data together
Good standards are platforms that create value for everyone (because everyone uses the standard)
Successful standards have some common traits both in product design and go-to-market execution
Perfect is the enemy of the standard -- it is better to focus on something that is good-enough
Metcalfe's Law also applies to standards: the value of the standard increases exponentially with adoption
Non-openness and collecting rents impede the success of a standard, because it impedes adoption
Standards should be SIMPLE

The easiest way to increase data’s value is by linking it together.

Metcalfe’s Law shows that the value of a network grows in proportion to the square of the number of users of the system (n²).

What most people don’t realize is that Metcalfe’s Law applies to data too. The more connected a dataset is to other data elements, the more valuable it is. And the easier it is to join to other data, the more it will be joined.

The reason for this is simple: data is only as useful as the questions it can help answer. Joining, connecting, and linking different datasets exponentially increases the number of potential questions that can be addressed.

No one company or organization has a monopoly on data. Even mighty Amazon only knows less than 0.1% of the facts about its own customers (Amazon may know what they buy on Amazon, but it doesn’t know where they went to dinner last night, what they are saying on Slack, or who they voted for).

So, to truly understand something, you need to bring together data from as many different sources as possible.

Join keys are the secret to connecting datasets.

Join keys are really valuable. They are just simple connectors that make it super easy to take many disparate datasets and connect them together.

If you are an investor and you are trying to value a dataset, the easiest thing you can do is first find out how many other datasets it can be easily joined with. The value of the dataset is highly correlated with its “join-ability.”

By definition, join keys are derived. They’re also fairly simple, and as a consequence, also imperfect.

Data is most powerful when it’s standardized.

Case study: Unix time as a standard.

One great join key is time. Unix time (or other standards like UTC) standardize time zones so that we can all have a common understanding of when something happened.

Unix time is a standard convention around time, but it’s not perfect. Unix time might say it is Tuesday, but depending on where you are in the world, the sun may be up or it may be down. And Unix time doesn’t perfectly account for leap seconds.

Unix time is represented by a simple integer that is the number of seconds since January 1, 1970. It is very simple to calculate and very simple to store.

Unix time’s main power is that it is accepted as the convention to measure time. This means applications and databases can all easily talk to each other. A database created in Japan can be easily joined with a database created in Germany because they both likely use the same standard for time.

One of the nice things about Unix time is that it can be represented as a string of numbers -- which means it is easy to store and easy to communicate.

To reiterate, the power of Unix time is that everyone else uses it. Yes, it is clever. Yes, it is useful. But it is a standard because of its widespread adoption.

Case study: the Meter and measuring distance

A long time ago, one measured how big their farm was by taking steps. This was obviously an imperfect and non-standard way of measuring distance (some people have bigger steps than others).

Today we have the meter as a standard.

Developed in post-revolutionary France in the 1790s, the meter has conquered the world (at least everywhere except the U.S., Liberia, and Myanmar).

Like all standards, the meter isn’t perfect. Why should the meter be the length it is? Would it be better if it was 5% longer or 10% shorter?

The meter is clever. It can be easily subdivided (centimeters, etc) and expanded (kilometers, etc.).

But the cleverness of the meter is only a small part of its success. Its main reason for success is that people agreed to use it.

The meter is successful because enough people think it will be successful. All great standards are built on a similar type of trust.

To standardize a data set, it helps to be Free, Open, and Usable.

One of the advantages of Unix time and the meter is that the standards are free and open. In fact, the meter was given to the world by the people of France.

It is doubtful the meter would have taken over the world if Napoleon decided to charge a small tax for every time someone used a meter. That tax (no matter how small) would have been a friction to adoption.

It is also easier for a standard to be adopted if it is locally storable under a simple license.

Some data might seem open but there can be a hidden tax that can impede wide adoption. Data licenses that do not allow for the data to be stored or that do not allow for the data to be used for commercial purposes are examples of taxes that can impede adoption.

A better open-source license for standards is the MIT license which allows commercial and non-commercial use with no strings attached.

Case study: FICO® Score as standardized data.

The FICO^® Score has become a standard to measure the overall likelihood that someone will repay a loan. The score is used by the vast majority of banks and credit granters in the U.S.

Typically, the higher your score, the lower the risk and the more likely creditors are to lend to you. Because almost all lenders use the same score, it has become a standard.

The nice thing about the FICO Score is that it is simple and storable. It is a three-digit number between 300 and 850. It is easy to understand and easy to communicate.

Of course, the FICO Score is far from perfect. Two people with the exact same score may end up having very different repayment behaviors.

Another way the FICO Score isn’t perfect is that it is not free. Lenders need to pay to get the FICO Score. But despite not being free, it is still a standard because of its widespread adoption and because it is relatively low-cost.

Unix time has gotten a wee bit better over the last 40 years (a few leap seconds have been added here and there). Similarly, FICO Scores have also improved over time (as more data becomes available and more sophisticated models are developed).

So like all standards, the FICO Score isn’t perfect. But remember, perfect is the enemy of the standard. It is better to have a good-enough standard that everyone uses than to have a perfect standard that no one uses.

Standards unlock massive value for the networks that use them.

Standards are really important because they create a common language to foster communication. If everyone speaks the same language, it is much easier to collaborate.

One of the advantages of Unix time is that it is both a standard and also a useful join key. Let’s say you have a dataset of all the people who visited a store and another dataset of all the people who were in a particular city. If both datasets use Unix time, you can easily join them to see how many people in that city visited that store at a specific time.

Another standardized join key that is super useful is the U.S. dollar. While different exchanges and different countries may have their own currencies, the dollar is the standard currency for international trade. This makes it much easier for a company in Japan to sell products to a company in Brazil.

Language itself can standardize as well. The World Economic Forum brings together leaders from all over the world. While they all speak different languages, they almost all speak English. English has become the standard language for international business.

Standards unlock value in data in three key ways:

Enables understanding -- use of standards promote common and clear meanings for data
Democratizes access and availability -- standards make the exchange, interpretation and integration of data much more efficient
Increases use --> which drives access --> which in turn drives more use/reuse of data; more the data is used, the more it is joined, the more valuable it becomes

Standards accelerate collaboration around data

The easier it is to join data, the more data will be transacted, moved, and used.

Because it is so easy to join data on price (the dollar is a common-enough measure), it becomes easy to compare the price of a home in Palo Alto with the price of a home in London.

But let’s say there is a world where people get paid in Bitcoin but they buy homes with platinum coins. It would be very difficult to understand the true value of anything because the join key (the currency) is not standardized.

As we said earlier:

The more connected a dataset is to other data elements, the more valuable it is. And the easier it is to join to other data, the more it will be joined.

Even the simplest questions may have very complicated operations to answer. For example, let’s say you want to know which zip codes in the U.S. have the most Starbucks. To answer that, you need a dataset of all the Starbucks and a dataset of all the zip code boundaries.

The easier it is to join the data, the more the data will be joined … and the more the data will eventually be used to answer important questions.

What makes a great standard?

The very best standards act as join keys that unlock data in multiple datasets.

From the DaaS Bible:

If the value of Dataset A is X and the value of Dataset B is Y, the value of joining the two datasets together is usually much greater than X + Y.

Data becomes much more valuable the more additional datasets it can be joined to. And no, data owners should not be afraid to have their data joined.

This is the #1 thing that most people who work at data companies do not understand. Most people think their data is so special that they should keep it in a silo and charge a lot for it.

Joining your data to other datasets is what makes your data more valuable … and it makes sense to support a standard join key that makes it easy for others to join to your data.

The best join key standards are SIMPLE.

The SIMPLE acronym for data companies helps guide the creation of a universal identifier that is:

Storable. You should be able to store the ID offline. For instance, I know my SSN and my payroll system stores it too.
Immutable. It should not change over time. An SSN on a person is usually the same from birth until death (except in very rare cases).
Meticulous (high precision). The same entity in two different systems should resolve to the same ID. It shouldn't be "close" or "probable."
Portable. I can easily move my SSN from one payroll system to another.
Low-cost. The ID needs to be cheap (or even free). If it is too expensive, the transaction costs will make joining data difficult.
Established (high recall). It needs to cover almost all of its subjects. An SSN covers basically every American.

One example: the Placekey is a join key that has a common identifier for all physical places. Prior to the Placekey, it took a lot of work to join two datasets that had addresses.

Some ideas on creating a standard in your industry.

If a standard does not already exist in your industry, it might be a good idea to help create one. Creating a standard is not easy, but if you are successful, you will create a lot of value for everyone.

Your standard should lift all boats.

The definition of a standard is that it lifts all boats. Remember how the U.S. dollar added value to every single person and company that used it?

Even companies can be standards: if they put their customers first

Insurance Services Office (ISO) (which is now a division of Verisk) was started in the 1970s to help property and casualty insurers standardize their forms and their data. This made it much easier for insurers to share data and to understand risk.

Visa is another example of this. After Visa was created (spun out of Bank of America in 1970), it became a standard way for banks to issue credit cards and for merchants to accept them.

Open-source first companies are also an example of creating the product as a standard. Red Hat was successful because it helped to standardize Linux for the enterprise.

Your standard should be low-cost.

One of the best ways to FAIL at creating a standard is to try to take too many rents or make it proprietary. The more you charge for a standard, the less likely it is to be adopted.

Your standard needs the support of industry competitors, regulatory bodies, etc.

Let’s say you run a company FoodDataGraph that has data on what people eat. Collecting this data and organizing it is a lot of work. You might be tempted to keep it all for yourself.

But what if you created a standard for food data? What if you made it easy for every restaurant and every food delivery company to use your standard?

It is not clear. But one thing is for sure, if you want to create a standard you cannot do it yourself. You need the support of your competitors and you need the support of the industry.

Huge food delivery companies (like Sysco and U.S. Foods -- and also Doordash, Grubhub, and UberEats) would all need to agree to use the standard.

Adoption will be much faster if everyone (including your direct competitors) have open access to the standard.

When standards disappear, entire ecosystems can die.

Once standards get going, it is important that they last. Often billions of dollars are relying on a standard.

We often want a standard for an industry (like Verisk in insurance) but we are afraid to bet on one because we are afraid it won’t last.

There is power in a standard. But as Voltaire (sometimes attributed to Spiderman’s Uncle) said: “with great power comes great responsibility.”

Adding standards is hard … and humbling

The first thing you’ll find when working to start a standard is that there are 20 other projects that also want to be the standard.

It is humbling to start a standard because the chance of success is low. The more the standard is adopted, the more valuable it becomes. But getting those first few people to adopt it is very difficult.

Your standard should be built to exist forever

The great paradox of standards-building is that organizations and individuals who may benefit from the standard are the least likely to want to invest in it.

Creating a standard is really hard. The chicken-egg problem exists tenfold when one is developing a standard.

Your standard should be thought of and built so that it will last forever. There are various ways to do that (e.g. non-profit, open-source, or a company that puts the standard first).

Your standard may need to continually adapt

Some standards are like the meter -- you set it and forget it.

Some standards need to change and evolve over time -- like the FICO Score -- or like an OS like LINUX.

The more your standard is in the “set it and forget it” camp -- the more you need to get it correctly from the start.

The more your standard needs to evolve, the more it looks like a company or ongoing project. In this case, the standard needs a way to be funded so that it can continue to improve.

Think of your standard as a platform.

Bill Gates’ famous definition: “a platform is when the economic value of everybody that uses it, exceeds the value of the company that creates it.”

A standard is the OG of platforms.

Make your standard SIMPLE.

If you want to create a standard, try to keep it as SIMPLE as possible. The closer it is to SIMPLE, the more likely it will be adopted.

Storable.
Immutable.
Meticulous.
Portable.
Low-cost.
Established.

And heed our advice, don’t try to make the standard perfect. Don’t try to please everyone. That will lead to a standard that is too complex and too difficult to adopt.

The perfect is the enemy of the standard.

Thank you for reading this. We’d love your comments, ideas, and critiques. We also would love to hear from you if you are working on a standard in your industry.

Special thanks to the following people for their help and edits: Lauren Spiegel, Ryan Fox Squire, Russell Jurney, and more.

Note: SafeGraph is hiring. If you want to work at a data company, consider applying for SafeGraph careers.

DaaS Bible 2.0: How Standards Increase Data Flow and Benefit Everyone

Data is most powerful when it’s standardized.

Standards unlock massive value for the networks that use them.

Some ideas on creating a standard in your industry.

More in News

Why We Invested in Strapi

Datavant: Perfect Storm of Data, Middleware and Business Friendships