OK, some of this material isn’t new but I’ve been asked to edit a special (Information) journal edition on (something like) ‘Will AI, Big Data and the IoT Mean the End of Privacy?’ The plan is to circulate a ‘discussion paper’ to encourage submissions. What follows is an early draft of that (extended from The Prof on a Train Game) so it won’t hurt to get it ‘out there’ as soon as possible. Comments welcome below, by email, message, whatever …
The embodiment of the potential loss of privacy through a combination of AI, big data and IoT technology might be something like an integrated app capable of recognising anyone, anytime, anywhere: a sort of ‘Shazam for People‘, but one capable of returning seriously personal material about the individual. How credible is such a system? And what might stop it?
Introduction: A Future Scenario?
It’s 2025 or thereabouts. You meet someone at an international conference. Even before they’ve started to introduce themselves, your IoT augmented reality glasses have told you everything you needed to know … and a lot more you didn’t.
Jerry Gonzales. Born (02/11/1970): Glasgow, UK, dual (plus USA) citizenship; 49 years old. Married 12/12/1994 (Ellen Gonzales, nee Schwartz), divorced 08/06/2003; two daughters (Kate: 23, Sarah: 17); one son (David: 20). Previous employment: Microsoft, IBM, University of Pwllheli; current: unemployed. Health: smoker, heavy drinker, recurrent lung problems, diabetic, depression. Homeowner (previous); now public housing. Credit rating: poor (bankruptcy 10/10/2007); Insurance risk: high. Politics: Republican. etc., …, Sport: supports Boston Red Sox and Manchester United FC. …, Pornography: prefers straight but with mild abuse …, etc., etc.
And that’s the simple basis of this paper, along with the overlapping questions that naturally follow:
- How likely (futurology) is this to happen?
- What’s necessary (technology) to allow it?
- What can be done (legally, politically, morally, etc.) to stop it?
However to begin this discussion, we consider a comparable, essentially parallel, application of technology: one that already exists, not merely legally but almost universally considered a positive use of mobile devices and the Internet.
A Theoretical Foundation: Shazam for People?
The music recognition system, Shazam, runs as an app on most mobile phones and tablets. Using the device’s microphone, Shazam ‘listens’ to any (well, nearly any) piece of music for a few seconds, identifies it and informs the curious user. (It perhaps also offers the opportunity to purchase and download the track, which is not irrelevant to the discussion that follows.) How does it do this?
An indication of Shazam’s modus operandi lies in its ability to recognise a single piece of music under diverse conditions. The same track will sound very different when played, with no outside interference, on high-quality equipment, to listening in (say) a car against some engine rumble, to hearing it as background entertainment in a noisy public place such as a shop, bar or cafe. Converting and comparing to a standard format (MP3, for example), then comparing bits, will fail entirely.
Instead, Shazam detects simpler, quality-invariant features of the music such as the tempo, peak rate energy, or number of times the audio signal, across different frequencies, crosses the zero point, various spectral analyses, etc. Although these invariants prove to be a more effective approach than bitwise comparison, two points are immediately obvious … and important:
- It’s highly unlikely that any of these features can be detected/measured/recorded perfectly
- No single feature, in isolation, is going to be remotely sufficient to identify the piece uniquely
In other words, a simple, one-dimensional approach won’t work; and yet Shazam does. Instead, it combines several of these imperfect invariant features, as best it can, into an ‘acoustic fingerprint’, which – if constructed effectively – may uniquely identify the track. (A fundamental principle of combining datasets in big data analytics is that increased data dimensionality decreases anonymity, whatever the subject.) This acoustic fingerprint can then be sent from the device and queried against an Internet-based lookup (database). Information on the matching music is then returned to the device and offered to the user. The essential components of such a system are therefore:
- As accurately as possible, and yet imperfectly, collect identifying features of the music
- Combine these individual features into a single (hopefully unique) acoustic fingerprint
- Transmit this fingerprint and query against a global database
- Return the matched result and all available information to the user
The mathematics (transforms, etc.) of the construction of the acoustic fingerprint are an unnecessary distraction from our discussion. Suffice to note that it works: the result is sufficiently discriminatory to identify the piece and that this is further made possible by large proprietary databases owned by, or available to, the system as a whole. With this in place, the final step of returning the result and any relevant associated information is trivial.
The reason for this established comparison should be clear, because an obvious question then is how viable such a system could be for people? At a high level, the conversion of Shazam’s operation to a form of ‘Shazam for People’ (SfP) is simple enough in theory, but each step poses questions and challenges in practice. However, here’s the obvious initial attempt:
- As accurately as possible, collect identifying features of the person in question. How? What features might be available?
- Combine these features into a single ‘personal identification mark’ (PIM). Will the result be sufficiently discriminatory? Can it be unique?
- Transmit the PIM and query against a global database. Is there/can there be such a database for people? (Or is one needed?)
- Return identification and all available information to the user. What parts of this are legal/illegal? Realistically, how effectively could it be prevented?
We now consider each component of the SfP process in detail.
Considering the principles, established above, that no single feature need be captured perfectly, or can be expected to act as a sole means of identification, several techniques are credible as individual contributions to a PIM. Each of the following recognition techniques is, at worst, an area of active research and many are well-developed in military, security, biometric, commercial, etc. spheres. Each potentially serves as a credible ‘identification vector’ (IV).
- Face recognition
- Gait analysis
- Body size/shape/proportion detection
- Voice, pitch, tone, language, dialect, accent, etc.
- Chemical/biological/medical analysis (e.g. breath composition, breathing rate, pulse, blood pressure, electro-galvanic skin properties)
- Special characteristics (e.g. scars, injuries, tattoos, piercings)
- Corrective/enhancement technology (currently glasses, lenses, hearing aids, etc. but more advanced ‘implants’ in time?)
- Unique biometric identification where available (e.g. retina patterns, ‘conventional’ fingerprints, DNA)
Each of these, inaccurate and insufficient individually, may form a useful component of a compound PIM. However, the concept can be taken further: for additional IVs, there may be situational/contextual data available of comparable value:
- Location (where they are, where they’ve come from, where they’re going)
- Association (who they’re with, or talking to)
- Occupation (what they’re doing, reading, watching, saying, using, etc.)
- Appearance (what they’re wearing, carrying, etc.)
- And finally, but potentially very significantly, any technology they may be carrying (or wearing or, in future perhaps embedded within them). If interaction with any of it is possible then a particularly useful IV or set of IVs follows.
Which (combination) of these IVs could be captured in practice, of course, would depend on both the technology being used and its context. Gait analysis, as a technique for example, requires movement. Smart glasses, as a technology by comparison, could perhaps detect most visual signals but would require some extension to perform chemical biological or medical analysis. A more sophisticated contact-based approach could combine more of the latter but might need additional output to convey any returned results to the user. Identification of any indicative technology carried would necessitate IoT-level protocol cooperation. A single device capturing all possible IVs may be unrealistic – at least for the immediate future – so, for any given practical subset, will unique identification be possible?
Currently Shazam’s effective database (including those components acquired from, or in cooperation with, third parties) runs to around eleven million tracks. Its method for constructing a unique acoustic fingerprint is sufficiently sophisticated to give identification of ‘extremely high but unpublished’ reliability from a 10 second sample time. Can the proposed SfP’s PIM, from its available IVs, be expected to make a sufficiently accurate identification from around seven billion people in the world?
Realistically, probably not – at least not yet. The initial problem is less the theoretical numerical challenge; rather the practical one of technological engagement. The world is unequal. Whilst many in its developed regions have already left their digital impression on (say, in simple terms) the Internet and its data, most elsewhere are effectively technologically anonymous. This however, is both a help and a hindrance to SfP’s chances. Until such time (if ever, of course) as all parts of the planet share the related benefits and perils – uses and abuses – of connective technology, those that are excluded increase the chances of identification for those that remain by reducing potential targets and thus the ‘odds’ of success.
Ultimately, however, the mathematical viability of any SfP system will depend on the range and quality of IVs that can be collected and the efficacy of their combination into a PIM, which in turn depends on the underlying technology available. Every aspect of this improves almost daily. Considering the current rate of technological emergence, development and advancement, if a completely reliable universal approach is unrealistic today, it would be brave to insist it will remain so a few years in to the future. As an example, Blippar already promote a system, informally described as ‘Shazam for Faces’, capable of identification of around 400,000 ‘celebrities and public figures’ with 99% accuracy, using face recognition alone. There have also been some disturbingly accurate hoaxes.
Once again, for the purposes of this simple discussion paper, we omit the mathematics of transforms, etc. turning individual IVs into a PIM. A more interesting challenge, however, lies in the existence, or otherwise, or even the ultimate necessity of, the global database against which PIMs would be matched.
Central Databases and Querying
This may be the most difficult component of SfP – and the most interesting discussion. The concept of a central, Internet-based, queryable database (of people) trivially requires two things:
- The existence of the database itself, and
- A search/match standard/protocol: presumably the personal identification mark (PIM)
so it may clarify the argument to consider each of these separately. (It doesn’t look like a particulalry difficult exercise to join the two together if they exist.)
1. A Central Database of People?
Is a digital database of everyone possible? Can it be?
Well, not yet; that’s for sure. As already mentioned a large fraction of the world’s population have no Internet presence in any form whatsoever. But either that will change or our SfP will have no interest in them anyway. So it’s reasonable to start with what we’ve got. Where do we already have partial human databases? Could they grow to become what SfP needs?
There certainly are partial DBs already. For ‘notables’, there’s Wikipedia (and worse!); for academics. there’s Google Scholar; and many similar platforms for restricted coverage of other areas. And for everyone else, there are social media (Facebook, Twitter, etc.) profiles. Some of these are public, some private, others configurable to be something in between. Some use direct input from the ‘person of interest’, some don’t. Between them, they pretty much cover all the ground needed; but, by-and-large, they still have one thing in common: they’re all legal.
But there’s another type emerging … Just as an example, try Prabook; then search for the author of this paper. The page contains a complete potted history of the individual’s career and some very personal details too (including parents, spouse, children, key dates, etc.). None of this was supplied by the individual: it’s all been scraped from other Internet sources, including archieve documents in several places. This isn’t a Wiki of famous people: it’s potentially the start of a DB of anyone. Prabook claims a mission to ‘to record and preserve information on individuals who have made a contribution to their nation, local community or any professional field, and on whom sufficient data can be found in books, magazines, public and private libraries, and archives’ but already its motives are being questioned.
Is Prabook legal? It almost doesn’t matter because similar sites have appeared and dissapeared in recent years: as one is ‘taken down’ following complaints, another appears. As ‘personal information density’ increases over the next few years, it’s likely – probably inevitable – that, at any given point in time, there will something to target, and these DBs will gradually expand to include more and more people. Like bogus, pay-as-you go ‘Who’s Who’ entries, we can all be famous if someone somewhere profits from it!
2. A Personal Identification Mark (PIM) Standard?
On face value, this could be the trickiest bit of all. SfP’s PIM will require an agreed data standard/protocol: a mechanism for combining the various IVs into a single record, but flexible enough to deal with variation in what features are available in any ‘capture instance’. From a technical perspective, this isn’t hard: separately, and in combination, it’s what Shazan and existing face recognition systems (sort of) already do. But surely, unauthorised use of the PIM could be made illegal? Surely, any websites carrying a PIM and offering SfP matching services could be taken down like a shot? Surely a product or an app offering SfP wouldn’t be allowed on the shelves?
But it’s never as simple as that. Apart from the same problem, as in (1), that chasing down these websites is a battle that may never be won, we have the recurrent issue that such technology would have obvious benefits elsewhere (and presumably with individuals’ consent). [The same arguments as with sex robot technology.] A group of volunteers in an organisation may well want to cooperate in this form of mutual identification and welcome the use of devices, apps and websites using their PIMs for such a purpose. Making a complete data standard/protocol outright illegal will be difficult: it’s been a topic of debate many times before.
The upshot is that, whilst restrictive legislation may be possible, it could be hard to enforce. The combination of technology with legitimate alternative use and the practicalities of identifying and dealing with offenders may be too much. (And we’ve not even mentioned the dark web in any of this!) The fundamental rule of Internet data still applies: if anyone plays free and easy with your personal information, it may be possible to trace the culprits, it may be possible to prosecute, even punish, the wrongdoers … but the damage has already been done!
Returning Personal Information
This remains trivial in a technical sense: once an individual has been identified, any information held in the central database can be returned immediately. But this may be only a fraction of what could be available across the wider Internet. The central record would also contain an acquired (‘learned’) set of effective (combined) search terms that could be used to scrape or mine personal information in real time. Independent data sets could be processed together to de-anonymise material and achieve further identification. If necessary, conflicting records could be ‘data-cleansed’ and the results iterated back to improve system performance. Once the SfPs skeleton is in place, it will improve rapidly.
Once more of course, a more valid objection relates to legality. Whilst ethical and moral concerns are easily dismissed (they have tended to be historically when there’s profit to be made), the law itself is harder to bypass with impunity. But, for legislation to provide an effective check to our proposed SfP, different questions have to be considered:
- What existing (e.g. GDPR) legislation is in place relating to SfP? Is it sufficient or do parts need to be extended?
- In whose interests would existing/future legislation be applied? Who is perceived as requiring protection?
- Which aspects of SfP are (or could be made) illegal? (Such questions are often complicated by the same technology having beneficial applications elsewhere.) How effectively, in practice, could any legislation be implemented and upheld?
- Can privacy legislation ever cope with situations in which no actual data ever exists: rather everything is constructed, combined and processed in real time, then released when done? Similarly, if different actors were responsible at different stages of the process, which would have broken the law?
We don’t pursue these questions in depth in this paper. Such arguments would be too wide-ranging and/or lengthy for a discussion paper of this nature. However, the special issue that this editorial introduces welcomes contributions containing legal (or political, economic, ethical, etc.) debate as much as technological content.
Conclusions: Putting it All Together (Or Pulling it All Apart?)
Futurology is difficult! It’s not ultimately clear how an SfP system would work, although there are numerous ways in which it might. If a prototype wearable device or mobile app (say) was to be available in five years or so, it could employ any of the techniques for feature (IV) extraction discussed here, the exact combination to be determined by the pace and success of hardware and software evolution in each domain. There might even be a particularly disruptive technology on the horizon that could render several of these redundant and make the whole SfP notion even more realistic. Timescales may be debated but the fundamental principle seems sound.
The message, for now, is that such a system is credible. There are technical, legal and ethical challenges to consider but none appear utterly insurmountable if there’s a will – for good or bad – to make it work. Whether, in practice, such a system does emerge is ultimately not a technological question: those problems are easily solved with time. Instead, whether or not SfP appears – and exactly what it might be used for – are questions whose answers will be determined via the conflicting pressures of profit and public interest. There’s little doubt that it would be commercially viable but who would really benefit (or suffer) from it? Will concerns about individual privacy and exploitation, and their political influence, prove to be sufficient and effective restraints? It could be argued that history warns us against complacency in such matters.