Newer / Older

3 Important Reasons Why You Shouldn’t Centralize Your Data

Companies often put all their application data into a single, centralized datastore. But this is a bad idea, and here’s why.

By Lee Atchison | May 18, 2023 | App Architectures, Best Practices | Data ArchitectureDevelopersArchitecture

Modern applications and systems are commonly built using microservice architectures.

Their distinguishing feature is the division of the business responsibility of a complex application into discrete, self-contained units that can be developed, managed, operated, and scaled independently.

Microservice architectures offer a viable approach for scaling an application, enabling larger and less connected development teams to work autonomously on their respective components while contributing to a cohesive application build. Under this type of architecture, individual services are designed to encompass a particular subset of business logic. When combined, the entire set of microservices forms a comprehensive, large-scale application that incorporates the complete business logic.

This model is great for the code, but what about the data? Often, companies that create individual services for specific business logic decide to put all the application data into a single, centralized datastore. The idea is to ensure all the data is available for each service that might need it. Managing a single datastore is easy and convenient, and the data modeling can be consistent for the entire application to use, independent of the service that is using it.

Don’t do this. Here are three important reasons why centralizing your data is a bad idea.

1. It’s harder to scale centralized data

When the data for your entire application resides in a single, centralized datastore, as your application grows you must scale the entire datastore to meet the needs of all the services in your application. This is illustrated in the left side of Figure 1. If you use a separate datastore for each service, only the services that have increased demand need to scale, and the database being scaled is a smaller database. This is illustrated in the right side of Figure 1.

It’s a lot easier to scale a small database bigger than it is to scale a large database even larger.

Figure 1. Partitioning data by services simplifies scaling.

2. It’s more difficult to partition centralized data later

When launching a new app, developers often think, I don’t need to worry about scaling now; I can worry about it when I need it later. This viewpoint, while common, is a recipe for scaling issues at the most inopportune time. Just as your application gets popular, you have to worry about rethinking architectural decisions simply to meet incremental customer demand.

One common architectural change that comes up is the need to split your datastore into smaller datastores. The problem is, this is much easier to do when the application is first created than it is later in the application’s life cycle. When the application has been around for a few years, and all parts of the application have access to all parts of the data, it becomes very difficult to determine what parts of the dataset can be split into a separate datastore without requiring a major rewrite of the code that uses the data. Even simple questions become difficult: Which services are using the Profiles table? Are there any services that need both the Systems and the Projects tables?

And, even worse, is there any service that performs a join using both tables? What is it used for? Where is that done in the code? How can we refactor that change?

The longer a dataset stays in a single datastore, the harder it is to separate that datastore into smaller segments later.

By separating data into separate datastores by functionality, you avoid issues related to separating data from joined tables later, and you reduce the possibility for unexpected correlations between the data to exist in your code.

3. Data ownership is impossible with centralized data

One of the big advantages of dividing data into multiple services is the ability to divide application ownership into distinct and separable pieces. Application ownership by individual development teams is a core tenet of modern application development that promotes better organizational scaling and improved responsiveness to problems when they occur. This ownership model is discussed in the Single Team Oriented Service Architecture (STOSA) development model.

This model works great when you have a large number of development teams all contributing to a large application, but even smaller applications with smaller teams can benefit from this model.

The problem is, for a team to have ownership of a service, they must own both the code and the data for the service. This means one service (Service A) should not directly access the data of another service (Service B). If Service A needs something stored in Service B, it must call a service entry point for Service B, rather than accessing the data directly.

Figure 2. Service A should never directly access Service B’s data.

This allows Service B to have complete autonomy over its data, how it’s stored, and how it’s maintained.

So, what’s the alternative? When you construct your service-oriented architecture (SOA), each service should own its own data. The data is part of the service and is incorporated into the service.

Figure 3. Each service has its own data.

That way, the owner of the service can manage the data for that service. If a schema change or other structural change to the data is required, the owner of the service can implement the change without the involvement of any other service owner. As an application (and its services) grows, the service owner can make scaling decisions and data refactoring decisions to handle the increased load and the changed requirements, without any involvement of other service owners.

A question that often comes up is, What about data that truly needs to be shared between applications? This might be data such as user profile data, or other data commonly used throughout many parts of an application. A tempting, quick solution might be to share only the needed data across multiple services, such as shown in Figure 4. Each service might have its own data, and also have access to the shared data.

Figure 4. Sharing data between services is not recommended.

A better approach is to put the shared data into a new service that is consumed by all other services, shown in Figure 5.

Figure 5. Using a service is the proper way to access shared data.

The new service—Service C—should follow STOSA requirements as well. In particular, it should have a single, clear team that owns the service, and hence owns the shared data. If any other service, such as Service A or Service B in this diagram, needs to access the shared data, it must do so via an API provided by Service C. This way, the owner of Service C is the only team responsible for the shared data. They can make appropriate decisions on scaling, refactoring, and updating. As long as they maintain a consistent API for Service A and Service B to use, Service C can make whatever decisions it needs to about updating the data.

This is different from the scenario illustrated in Figure 4, where both Service A and Service B access the shared data directly. In this model, no single team can make any decisions about the structure, layout, scaling, or modeling of the data without involving all other teams that access the data directly, thus limiting scalability of the application development process.

Using microservices or other SOAs is a great way to manage big development teams working on large applications. But the service architecture must also encompass the data of the application, otherwise true service independence—and therefore, true scaling independence of the development organization—will not be possible.

More articles from Lee Atchison:

This post, written by Lee, first appeared on InfoWorld.

Photo by Joshua Sortino on Unsplash.

3 Important Reasons Why You Shouldn’t Centralize Your Data

1. It’s harder to scale centralized data

2. It’s more difficult to partition centralized data later

3. Data ownership is impossible with centralized data

Categories

Tags