Custom Search

Thursday, October 09, 2008

Is it Legal to Test on Live Data?

This comes up again and again.  When I was a programmer, many years ago, working for the then Police National Computer Unit and coding in Burroughs Extended Algol, we tested systems in a test harness while coding them, and with heavyweight batches of test data during system integration and acceptance trials.  We did not test on live data because we had no need to.  And we had no need to because there was sufficient funding in place to let us create highly complex test data that tried out every nuance of the system.


But commercial systems often test on live data.  And that, frankly, is unlawful unless there is a both Fair Processing Notice at the point of collection of the data that states that the data may also be used for system testing, and also a "purpose" notified in the UK to the UK Information Commissioner that data will be used for testing in addition to all the other uses.

There's an excellent guide here that goes into good detail about this, summarising it as:

To comply with the DPA in the area of application testing and development, the most straight forward solution is to anonymise or de-identify the information. Software products can provide an effective solution to such anonymisation, whilst retaining the integrity and usability of data for the testing and development environment.

In those circumstances where ‘real’ data are used, the quantity of such data should be reduced to the bare minimum needed for application testing. Where testing is to be carried out by a third party supplier, the supplier should be vetted on its security procedures and contractually obliged to ensure that appropriate technical and organisational security measures are in place.

I do not see the need for the first sentence of the  second paragraph.  It's a walk in the park to anonymise data, not just to meet the needs of the law, but to meet the very human needs of anyone accidentally identified because their data escapes into the wild!  Here's a simple example:

Take the table with personally identifying information and sort ONLY the field holding the surname into a->z order.  Now sort ONLY the forename field into z->a order.  Split any email address field at the @ sign and trash the portion before the @.  Take the field with the first line of the address and transpose it by 73%, then sort into a-> z order.  Your data is now anonymous, but with real, ordinary field contents.

If you can use any data record to identify a living individual after that then the file had far too few records in it for testing purposes anyway, and you should have made the data up in the first place.

It's a matter of common sense, isn't it?

0 comments: