"No data yet," he answered. "It is a capital mistake to theorize before you have all the evidence. It biases the judgment."
"You will have your data soon," I remarked, pointing with my finger.
Sir Arthur Conan Doyle, A Study in Scarlet
We shall begin by looking at data -which we originally called uninteresting. So let's see if we can spice it up a little-nipples.
Data was once central to many people's view of computing-certainly mine. I was once indoctrinated with Jackson's Structured Design and Programming. His central premise was that computer processing was all about the gap between two different but related data structures. The design process was all about identifying the data structures and the functionality required to transform one into the other. This seemed viable to me when I was working largely in a batch-processing mainframe environment. However, when I moved to working in a client/server model I encountered a dichotomy. Client/server wants to make data services available to all-usually through memory-loaded working copies-but the prime data, the source of these copies has to have a high level of data integrity. There is a tension here.
I came to see that the location of data, its scope, and how it is used combine to define the kind of data it is and how we ought to work with it. For me, there are four major types of data.
Master or prime data
This is the data that is almost invariably in a shared database server. It is the most current and most highly trusted data in the system, and represents the currently committed instance of each data entity. Usually this sort of data is accessed interactively and concurrently by many data consumers, possibly including batch processes. It is important stuff, and has to be secured properly, backed up, and have a restore-from-backup process that actually works. Because it is shared, it is usually placed on a shared server at the same or a higher scope as its usage. Thus, if the system is a departmental one, its prime data will be located on a departmental server, or one that can be accessed by the whole business. It is possible to split this data across a number of servers, but we'll talk a little more about distributed data later on.
Management or business copy data
This kind of data is most often used for reference. It is usually a read-only snapshot of prime data. It is frequently frozen in time or includes some derived summary or aggregation data, so there may be redundancy here. It is usually updated by being recreated through a regular refreshment process, so there might be versions of the same data kept for a period. It is essentially static, unchanging data, so no problems of synchronization arise. This data could be centralized-there's a current vogue for data warehouses-but it could just as easily be distributed, even to the point of being in individual spreadsheets on particular users' machines.
User or operational copy data
In addition to management having copies of the prime data for particular uses, it is quite common for users to have their own copies, usually made with an eye to improving some level of operational efficiency, performance, or service. For instance, a laptop user might carry a copy of his or her own client's records while on the road, to increase response time rather than having to phone the office or wait until returning before checking something. This data is a duplicate. Whether it will need to be synchronized or simply refreshed periodically in the same way as Management Copy Data will depend on its scope and usage. If the responsibility of updating the data is shared, this increases the complexity of synchronization.
However, operational copy data could equally be static data, such as look-up data, which is rarely if ever changed, and if so might be copied to a departmental server from a central source, or even down to individual user PCs to speed the process of consulting it.
Operational copy data can be a subset of volatile prime data copied to a specific location server. Most updates might be made to this local copy, but some might be made to the prime data, in which case a process of synchronizing and reconciling the local copy and the prime data needs to be set up. This is usually done via some transaction logging mechanism, but here the issue of just how real-time the synchronization has to be raises itself.
Work-in-progress data
This is where (usually, we hope) a copy of a subset of prime data is made in order to carry out some specific task. For instance, imagine a software project manager accessing all the timesheets for a given project in order to update the project's progress, write a weekly report, or reconcile progress in time against progress measured by testing. Here a number of changes might be accumulated against a subset of the prime data, and after a particular series of steps for the task is completed, the data with changes is all committed by sending a single transactional unit back to update the prime data. Depending on its size, it can be kept in memory while it is being worked on (an option not without its challenges). However, even if this is the intention, the data can find itself spilling into temporary storage on disk if the machine is under-specified for the task-merely through the ministrations of the operating system. Alternatively it may be stored on disk anyway, in an attempt to avoid major recovery problems. This is usually the data with the shortest working life.