July 30, 2014

From Scratch: A New Definition of Data Integrity

Blogs

van sickle One purpose of geospatial products is to communicate actual conditions in the real world. Today, this most often involves the digital representation of the actual. However, there is now and always will be a discrepancy between the real world and any representation of it. There is no perfect position and there is no perfect attribute information. Recognition of this limitation is illustrated when a position is expressed with an uncertainty and reliability. For example, “The position of this well is 24987910.03 N and 5983475.28 E +/- 10 feet at a 50% reliability,” tells the user that it is equally likely that the coordinate expressing the position of the well is within 10 feet of its true position as it is that the coordinate expressing the position of the well is not within 10 feet of its true position. Clearly, as the radius of uncertainty grows, the percentage expressing the reliability grows too. Of course, it is impossible to be 100% certain of any position or any information.

Given this, here is a provisional definition of data integrity for your consideration: Data integrity is the degree of accuracy and reliability of spatial and informational data. The phrase, “The degree of…” is included to recognize that perfect information, positions and reliability are not only impossible, they are not the goal. The goal of this version of data integrity is systems and procedures, standards and processes that provide a solid foundation for statements such as: “The position of this well is 24987910.03 N and 5983475.28 E +/- 30 feet at a 95% reliability.”

It might be worthwhile to step back a moment and ask why there needs to be any discussion of data integrity. After all, there is much geospatial data available at the touch of a button without error or reliability indicators. These datasets sans statistics seem to communicate their representation of the real world very well. Many users seem to find them entirely fit for purpose and frequently assure one another that, while the positions and attributes presented may not be perfect, they are certainly close enough. The question, “How do you know?” is often answered with a comparison. The user introduces data from another source and, finding that its positions and attributes match the base within acceptable limits, declares both proven. The fact that the accuracy and reliability of both are unknown is swept away by them closely resembling one another.

Geospatial professionals know – or should know – that this reasoning is dangerously false. However, pointing it out is hardly useful unless that can be followed by a practical alternative. That alternative must start with the writing, implementation and enforcement of standards. In the end, it is the methods used in the capture, processing and presentation of data that are the only grounds on which its accuracy and reliability can be stated. If they are unknown, the true quality of the data is unknown and unknowable.