Recently the UK government published extracts from the COINS database. COINS stands for the Combined On-line Information System and the extract contains details of all expenditure by UK Government Departments over £25K for the years 2008/9 and 2009/10. Expenditure is divided into budget expenditure, actual and forecast out-turn data. The release of this data comes as a result of the current coalition government’s desire for greater openness.
If we take a look at these extracts from a data quality perspective, then this may give an indication of whether any resulting analysis needs to be treated with “a pinch of salt”.
The data extracts contain 24 million entries stored in 4-5 Gb text files for 2009/10 alone and, due to its size, is intended for professional analysis organisations and media organisations to undertake analysis (for example the Guardian newspaper have created an on-line analysis tool).
A comprehensive document has been published which explains the data structures, usage and treatment of the data and coding for the extract in a metadata form. It also contains the statement:
The data on COINS are quality-assured and complete at the level at which they are required for the following purposes:
- fiscal management;
- operational publications (e.g. Main and Supplementary Estimates); and
- statistical publications (e.g. Public Expenditure Statistical Analyses, the joint ONS/Treasury Public Sector Finances statistical bulletin and the National Accounts).
Lower levels of data are not quality assured by the Treasury. Individual departments can to some extent choose the level of granularity that they use within pre-defined aggregates set by the Treasury. Lower level detailed data may therefore appear incomplete and be inconsistent across departments.”
Analysis by journalists is already identifying interesting areas of expenditure and omissions from the COINS data, for example some defence and security related expenditure (which is understandable), however, there do not appear to be clear guidelines on what has been omitted.
If we look at the different attributes usually associated with data quality metrics, what does that tell us?
- Accuracy – The above QA statement suggests that individual departments are responsible for their own accuracy checking to the level of granularity they choose
- Completeness – Based on assessment by journalists, plus different departments interpretations of the £25k granularity cut-off, the completeness of data is unclear
- Validity – A quick review of some data extracts suggests that there are no obvious validity issues
- Consistency – Similarly, the extensive coding structure suggests that there is likely to be a level of consistency across departments
- Precision – Expenditure is displayed to the nearest £,000 which, considering the size of some expenditure items appears reasonable
So, will the publication of this data mean that the previous disagreements about the interpretation of data will disappear?
This is probably unlikely, for three reasons:
- The overall complexity of the data and coding structures suggests that, without good familiarity with these data structures and business metadata, it is likely that much analysis may be inaccurate initially. Over time, the analysis of the published data should improve, but initially, it is likely that any journalist who creates a story based on “The Government spent £X on Y” is likely to be corrected by the Government stating “but you have not taken into account Z”
- Lack of clarity on the completeness of data due either to security of other reasons or choice of granularity means that it will be very difficult to assess what is not included
- Similarly, the statements on the accuracy of the data and its stated purposes, suggests that strategic analysis of larger expenditure items will probably be reasonably accurate, however, the accuracy of any analysis using larger volumes of smaller value expenditure items may be less accurate.
So until there is better understanding of the data structures and its quality, I suggest that any headline statements derived from COINS data be treated with a pinch of salt.