5 Rules for Getting Your Data Architecture Right

To avoid serious problems in the future, be sure to address these key considerations early in the process.

By Lee Atchison | July 27, 2023 | App Architectures | Architecting for ScaleData Architecture

Architecting modern applications is a tough job, and architecting a solid data model for modern applications is one of the toughest, yet most important, parts of modern application architecture.

Failure to create a reasonable data architecture can cause your application to fail in many bad ways, including issues related to performance, data integrity, data sovereignty, data safety, and scalability. Poor data architecture can leave your application and your company in bad shape.

Building a proper data architecture is critical to the long-term success of all modern architectures. To assist in your application modernization process, here are five rules to follow when architecting—or rearchitecting—your application data.

Use the right type of database

The first and most important decision in architecting your data is to understand what type of database you need to store and access your data. Will you need to:

Store highly structured data or simple key-value data?
Persist data permanently or for only a short period of time?
Access the data randomly or sequentially?
Use a fixed schema, a flexible schema, or a simple flat file?
Use a relational database that supports SQL queries?

You need answers to these questions to determine the type of database you need to use. Depending on those answers, you might choose an SQL database, a simple key-value store, a memory-resident cache, a simple object store, or a highly structured data store.

The type of database you select will dictate what your database is ultimately capable of doing and how well it will perform in your application use case. Things as integral to your application as determining your scalability and availability requirements are significantly impacted by your database choice.

Store data in the right location

A deceptively simple but important question is, Where should the data be stored? Depending on the data and your application, do you need to store the data, for example, in the front end of your application or in the back end? Can you store the data local to the consumer, or do you need to share the data with many other consumers?

Most data is stored in the back end. But some data must be stored at the edge or in a client. Storing data in the front end is often needed in order to optimize performance, availability, reliability, and scalability.

Think about scaling from the start

Modern applications must be able to scale to meet the growing needs of a business’s customers. This is true for all businesses and all applications.

The absolute hardest part of building an application that can scale to meet your expanding needs is scaling the data store. Whether it’s scaling to increase the quantity of data you need to store for your growing customer base, or it’s scaling to allow more people to use your application simultaneously without degrading performance, data scaling is hard unless you plan for it from the start.

Yet most application architectures seem to consider data scaling as a side requirement that can be left for later. It’s something the application developers think about once the main application architecture is established.

Force-fitting scaling into a data architecture later is an extremely difficult task, and it becomes harder as your dataset grows in size. By far, the easiest time to build in scalability is at the start, before your application needs to scale. Waiting until later can make scaling harder, and potentially impossible, without major data refactoring.

Distribute your data across services

A number of cloud experts suggest that centralizing your application data is the right model for managing a large dataset for a large application. Centralizing your data, they argue, makes it easier to apply machine learning and other advanced analytics to get more useful information out of your data.

But this strategy is faulty. Centralized data is data that can’t scale easily. The most effective way to scale your data is to decentralize it and store it within the individual service that owns the data. Your application, if composed of dozens or hundreds of distributed services, will store your data in dozens or hundreds of distributed locations.

This model enables easier scaling and supports a full-service ownership model. Service ownership enables development teams to work more independently, and encourages more robust SLAs between services. This fosters higher-quality services and makes data changes safer and more efficient through localization.

But what if your business needs to perform analytics or machine learning on all of this data? I still recommend the distributed data model described here. However, to make your data useful for analytics and machine learning, send a copy of the relevant data to a back-end data warehouse system. In that data warehouse system, structure the data in an appropriate way for your analytics purposes, and use this version for your analytics and machine learning algorithms. This data warehouse version is separate and distinct from your application data of record, which is still stored within the individual services.

Distribute your data geographically

Finally, determine who will use the data, and where they will be located geographically. Determining data and user locations is becoming increasingly important as global commerce introduces increased opportunities while regional data governance restrictions make managing global data more difficult.

Before you create your data architecture, you must answer these key questions:

Is it important that your data be available globally, or will a regional version of data be more important to your business? For example, do you want the same or different data available in the United States and Germany? Many applications find a mixture of both models is important, and this answer is acceptable as long as you know which data must be globalized and which must be regionalized.
Do you have regional restrictions on what data you can store and where you can store it? Some localities have restrictions that prevent customer data from leaving the country where the customer resides. Others have restrictions on what data can be transferred across country and regional borders. Some areas have tighter privacy restrictions than other areas. What data restrictions apply to what parts of your data?
For data that is shared across regions, how important is it that the exact same data be shown in each region? In other words, does the data have to be exactly synchronized between regions? Different models put different burdens on your dataset. An eventual consistency model has very different performance characteristics than an ACID-compliant, transactional integrity model.

The answers to these questions will dictate whether you provide global or regional data, where that data can and cannot be used, and when and how to synchronize data between regions.

Data architecture is a critical part of architecting a highly scaled, highly available, globally accessible, modern application. Mistakes in your data architecture can cause issues with scaling, availability, and even legal compliance. Changing your data architecture after your application has grown is difficult and painful. It’s far easier to address your key data requirements up front.

By following these five rules early in your data architecture process, you can avoid serious problems in the future.

More articles from Lee Atchison:

An earlier version of this article first appeared on InfoWorld.