The permanence of data

A number of recent events have lead me to consider a little bit more about the permanence of data. Is data a permanent or a temporary thing?

In some ways, data is a very transient thing – the conversation that prompted this post was with a colleague who was concerned that the layout and content of the organisation’s web site was not as ‘pixel perfect’ as they would like. Whilst this was undoubtedly true, the fact that the site was supported by a Content Management System (CMS) means that changes can be easily and quickly accomplished. Unlike a print run of 10,000 leaflets (or similar) where you want to make sure that everything is perfect before going to print, or else you have to scrap lots of incorrect literature, a web site can be quickly and easily updated with, usually, few people aware of the change.

Similarly, the constant rate of change of web sites means that it is sometimes difficult to go back to review old items from a corporate web site or to review a former brand and web identity for a company, however, the Wayback Machine allows a limited capability to go back in time to how a web site used to look. For example, take a look at the Microsoft Home Page from 1999 to see how much sparser web sites used to be.

Conversely, much other data is very permanent and/or has a habit of returning long after the event:

Most transactions are now logged and recorded in audit trails
Wikis keep details of all the updates ever undertaken on them which helps when removing inappropriate edits
Email servers keep logs of all emails sent and received even if they appear to have been deleted – useful in fraud investigations etc.
Contact details stored in caches and mobile phones may keep appearing when you try to contact a work colleague you have not spoken to for some time

So what relevance does this have with data quality?

The permanence of data means that any incorrect data that exists within an organisation can take significant time and effort to resolve. For example, an incorrect data entry that is identified and corrected in one system may have been copied to numerous other systems, spreadsheets and documents.

Perceived difference between the correct data and any of the incorrect copies can lead to perceptions that, as two data entries do not agree with each other, that they both must be wrong.

Even worse, if the incorrect data exists in many places and the correct data in only a few, a less informed user may believe that the data represented in fewer places is the incorrect data and needs changing to match that in most places!

So what can be done to counteract this?

Strenuous efforts should be made to update all the incorrect data entries you can, once they are identified
Clear identification of the master data source is essential
Controls on who can update data should be suitable to prevent ‘ill informed’ data corrections
‘Kite marking’ or some other method of marking data as verified correct

Any more suggestions to add?

Tagged on: Data data quality Data Zoo

2 thoughts on “The permanence of data”

Ken Hansen

28th March 2012 at 19:45

Permalink

Agreed with the exception of the kite mark to identify data as “correct”. “Correct” depends on the measure. Is a name correct if it conforms to one item of identification and has it changed since? Is the revenue posting correct if it reconciles to the Geneuresral Ledger even though the calculation failed? Organisations have to decide the business rules/meaures they want/need to apply and may require multiple kite marks.

On a practical point – I would expect/want the vast majority of data in a DW to conform so rather than waste space with multiple flags on every record/column create seperate tables identifying the errant data and the reason. Such tables can join to the prime tables for presentation and can also record action taken for resolution and dates of alerts, reminders and resolution to measure DQ improvements. This technique also enables easy addition of new measures.

Julian Schwarzenbach
1st April 2012 at 11:29
Permalink

Ken,

Thanks for your comment.

It is worth expanding what I mean by ‘kite marking’ as you are correct in stating that this could significantly increase the amount of data stored but may only add little value.
To clarify for non-British readers, the term ‘Kitemark’ is a method used by organisations to indicate that their product or service complies with the relevant British Standards by adding a logo that is similar in shape to a kite.

An approach has been used successfully in the past in an asset database that I was involved in took the following approach:
* Asset data for the site was cleansed and updated in order to comply with current data rules
* The site manager was required to review the data for suitability and correctness
* If they approved the data, then a note was added to the site record along the following lines “Joe Bloggs certified that this site data was acceptable on 1/1/2006”

Note three key factors of this approach:
* It clearly states who has assessed the data as acceptable
* Data is stated as ‘Acceptable’ and not perfect etc. as this recognises that there may still be some minor data issues
* The data of certification to allow data users to determine how recently the data was assessed as acceptable

Hope that clarifies things a little….

Julian
Reply

2 thoughts on “The permanence of data”

Leave a Reply Cancel reply