In some ways, data is a very transient thing – the conversation that prompted this post was with a colleague who was concerned that the layout and content of the organisation’s web site was not as ‘pixel perfect’ as they would like. Whilst this was undoubtedly true, the fact that the site was supported by a Content Management System (CMS) means that changes can be easily and quickly accomplished. Unlike a print run of 10,000 leaflets (or similar) where you want to make sure that everything is perfect before going to print, or else you have to scrap lots of incorrect literature, a web site can be quickly and easily updated with, usually, few people aware of the change.
Similarly, the constant rate of change of web sites means that it is sometimes difficult to go back to review old items from a corporate web site or to review a former brand and web identity for a company, however, the Wayback Machine allows a limited capability to go back in time to how a web site used to look. For example, take a look at the Microsoft Home Page from 1999 to see how much sparser web sites used to be.
Conversely, much other data is very permanent and/or has a habit of returning long after the event:
- Most transactions are now logged and recorded in audit trails
- Wikis keep details of all the updates ever undertaken on them which helps when removing inappropriate edits
- Email servers keep logs of all emails sent and received even if they appear to have been deleted – useful in fraud investigations etc.
- Contact details stored in caches and mobile phones may keep appearing when you try to contact a work colleague you have not spoken to for some time
So what relevance does this have with data quality?
The permanence of data means that any incorrect data that exists within an organisation can take significant time and effort to resolve. For example, an incorrect data entry that is identified and corrected in one system may have been copied to numerous other systems, spreadsheets and documents.
Perceived difference between the correct data and any of the incorrect copies can lead to perceptions that, as two data entries do not agree with each other, that they both must be wrong.
Even worse, if the incorrect data exists in many places and the correct data in only a few, a less informed user may believe that the data represented in fewer places is the incorrect data and needs changing to match that in most places!
So what can be done to counteract this?
- Strenuous efforts should be made to update all the incorrect data entries you can, once they are identified
- Clear identification of the master data source is essential
- Controls on who can update data should be suitable to prevent ‘ill informed’ data corrections
- ‘Kite marking’ or some other method of marking data as verified correct
Any more suggestions to add?