Data quality assessments can provide large amounts of useful information. However, to gain a complete perspective on organisational data quality, it is essential to consider three key perspectives:
- The data itself – entries in databases and spreadsheets
- The requirements for data – arising from processes and organisational objectives
- The data subject – the person, product, activity or event represented by the data
Considering data quality from only one or two of these perspectives are insufficient to understand organisational data quality. Having a lot of information about some aspects of data quality may mask the fact that you are missing key data quality dimensions and insights. Activities to improve data quality may therefore not be correctly targeted.
Looking at each of these perspectives in turn….
There are many data profiling tools available that can rapidly deliver a significant amount of information about a data set. For example, ‘out of the box’ analysis of a data table or data set by a profiling tool will generate many insights:
- The number of records
- Count of unique values (and the number of duplicates)
- Proportion of null values (or blank entries)
- Maximum and minimum values and perhaps the average for numeric values
- Format patterns e.g. CCNN NCC as a format for a UK post code
- When multiple tables in the same database are analysed, the profiling tool can automatically suggest likely relationships between data tables, primary and foreign keys
Specific data profiling tools are not essential to measure and understand data quality, you can also use a wide variety of analytic tools for this purpose, including spreadsheets. Whichever technology is used, the downsides are similar, from an organisational perspective.
Data profiling drawbacks
Data profiling results can be delivered quickly, with minimal user input and can provide a large volume of insights into your data which can be a deceptively incomplete view of your data quality. In isolation these results are of limited value to an organisation, for example:
- The accuracy of your data cannot be assessed without considering the data subject. Just because you have data that is plausible, does not mean that it correctly represents the subject. For example, data may indicate that a car that is actually red is plausibly recorded as blue. Only when you compare the data to the car itself, will you know that this data is inaccurate
- Profiling a data table or database provides one representation of the data subject but is not an enterprise view of your data quality
- Assessing data quality without linking to business requirements means that there will be no checks that the data is what the organisation needs. You may have millions of rows of data, but may still not be able to meet your organisational objectives if key fields are not recorded
- You cannot identify and deliver appropriate data improvement projects (which could involve improving existing data and/or gathering new data) without considering changes to data requirements to meet organisational objectives
According to ISO 9000 (and repeated across many, many other standards), quality is defined as “…conformance to requirements…”.
Assessing data without considering data requirements only illustrates the characteristics of the data, not its quality.
Data profiling tools can be configured to include rules translating organisational requirements into data quality assessments – a linking of the data values with data requirements. Developing data quality rules requires a degree of domain knowledge to ensure that rules are appropriate for the data and context. Such rules enable assessment of the validity and consistency of your data and identify data that is outside agreed limits. You are now developing a broader view of your data quality.
Assessing data alongside data requirements still only provides a partial view of data quality and what is needed to meet the overall requirements of the organisation:
- Again, without comparing to the data subject itself stops you gaining any information about the accuracy of your data
- If the weight of a product, a key attribute, does not exist in the relevant data tables, then you will not know that the data is missing if you are only analysing from the perspective of the data values
- Failing to take account of the data subject and its context may mean data requirements are not realistic. They may be either impossible to meet or may be inordinately expensive to gather and assess. For example, requiring the location of road signs to the nearest centimetre will require more expensive GPS equipment than ‘standard’ equipment. The measurement process may take significantly longer increasing the data acquisition costs and later costs of checking the data
- Meeting data requirements without considering the data stores existing in an organisation means you may select the wrong data store to record a new data requirement
A Data Subject is the thing or entity that the data represents and could be a product, service, customer, activity, event etc. Considering data quality from the context of the data subject is essential to fully understand your organisational data quality.
It is extremely difficult to assess the accuracy of data without reference to the data subject itself. For example, a seemingly valid and plausible date of birth of 3/6/1981 appears correct, but fails to identify that it had been recorded in a US date format and should actually be 6/3/1981. Such errors are unlikely to be spotted without reference to the data subject, or another authoritative data source.
Reference to the data subject is essential for ‘inventory accuracy’:
- Is there one, and only one, entry for each data subject? (Avoiding data duplication or missing data entries)
- For each data entry, is there a corresponding data subject? (Avoiding ‘ghost’ data entries)
These key checks ensure, for example:
- That a single product is not listed twice (or more times) against different suppliers, descriptions etc.
- That Jo Smith, who left the organisation five years ago, is not still listed as an ERP system user
In many situations the data representing a data subject will be stored in multiple data stores. For example, information about an employee is likely to be recorded in:
- The ERP systems they use for their normal work
- The HR system
- The training system
- Building access control system
Employees may also appear in numerous spreadsheets that are outside the control of IT – for example, details of employees involved in a charity event recorded in a local spreadsheet.
Assessing the data quality dimension of ‘consistency’ requires analysis across all data stores holding information about the data subject. For example, a particular product with a unique ID may appear in the online store, finance system, procurement systems and warehouse/ logistics systems. Past updates to the product description may not have been copied across to all data stores which represents a consistency issue.
If the data subject is a person and they understand where their data is stored, they will be better able to understand what needs to be changed if they, for example, move house.
Assessing data quality across these perspectives
The sections above demonstrate how the key perspectives of data values, data requirements and data subjects must all be considered when assessing organisational data quality. However, that is not enough!
All these aspects of data quality should be considered together to understand and define the data architecture of your organisation. This should include all data sources, not just those managed by IT, and yes, that also includes user created spreadsheets. In turn, this informs the approaches to master data management and data flows across and outside the organisation. Your data governance activities should consider all aspects of data and data quality, not just for the data stores managed by the IT department.
Benefits of a complete view of your organisation’s data quality
Why should you be concerned about understanding organisational data quality? A previous blog post explains, the benefits of improving data quality include:
- The removal of inefficiencies from your organisation
- Improvements to the quality of decision making and effectiveness
- Better performance and greater efficiency
Briefly, poor data quality will lead to adverse financial impacts for your organisation. In today’s challenging times, can you afford for your organisation not to be as efficient and effective it can be?
Why not get in contact to discuss how we can help you deliver greater performance for your organisation enabled by better data quality management.