Data and privacy concerns for ML tools

Perceptions about data and privacy

In this lesson, we'll be talking about data and privacy as it pertains to both users of machine learning tools, as well as the provenance of machine learning tools.

First--let's examine the distinctions between data and privacy, and how they converge on consumers, the everyday citizens who are using and building these tools.

Data pertains to the data both for and from the user of the tool--the data that a machine learning tool outputs, as well as the marketing and advertising consumer behaviour data that's collected and used to promote the company behind the tool. However, it also pertains to how companies collect or use data in the development of machine learning tools.

Privacy is how you interact with a company's products or services, and the visibility of those interactions.

At the intersection of these lies how your consumer or personal data is collected, used, and bartered--from the company which first collected it, to third-party companies, or even to governments.

We can identify three tenants that can guide the ethical handling of data privacy:

Consent
Notice
Regulatory obligations

In this lesson, we'll be exploring consent and notice quite deeply, and skimming over regulatory obligations. If you want to learn more about regulation specifically, we'll be covering more topics in regulation and liability in a forthcoming Academy course.

Using a simple example, we can illustrate each of these concepts as to how they relate to our everyday lives. If we assume we are at home, in the privacy of our own homes, and having a conversation with a trusted friend, we can reasonably expect some level of privacy. For example, we can lock the door, with the understanding that if someone else wants to join our conversation or listen in, they need to knock to gain entry. This is consent--they are required to knock, and wait for us to provide consent, before they join or hear the conversation.

Because we trust our friend, we can also specify--"Please don't repeat our conversation to anyone else!" We expect that this is respected, but also if they accidentally (or purposefully) repeat the tale, we will be informed about it. This illustrates the idea of notice--that we are previously informed or have a reasonable expectation about the level of privacy our conversation has.

Finally, we have regulatory obligations that allow us to have the freedom to have any conversation we would like, in the safety of our homes.

But do we preserve these same expectations for our online communities and spaces? Often times (maybe particularly amongst younger generations...?) we see online jokes and memes that refer to the idea that "your phone is always listening"--an uninvited visitor to our conversations that we think of as private. After all, I'm messaging my sister in WhatsApp, not my sister and the board of executives at WhatsApp!

Commonly we have the rebuttal--"I have nothing to hide, what does it matter that my data is being scraped or my privacy violated?" In fact, the repercussions to this are wide-ranging, and touch on important topics ranging from national security to climate change. To hold back the curtain on these repercussions, we need to examine the following in a critical manner.

We need to be aware of the extent to which our data and privacy is either respected or violated.
We need to be aware of the repercussions of having our data and privacy up for grabs; implications for everything from national security and elections being compromised, all the way down to individual consumer behaviour modification.
We need to be aware of the provenance of the data behind AI/machine learning tools to ensure that we understand the output of the tools and that we have not exploited anyone in the process of obtaining said data.

We will explore these three core topics through the lens of ethically handled data privacy: consent, notice, and regulatory obligations. In some sense, the second point will pervade through each of the case studies that we examine, but we will be illustrating points one and three explicitly through our case studies.

Case study: Pokemon Go

🔍 Are we aware of the extent to which our data is used, and is our privacy being respected with due notice?

This case study comes from the book The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power by Shoshana Zuboff, which I would highly recommend to anyone interested in learning more about the importance and extent of our data privacy being handled improperly or unethically.

Pokemon Go is a virtual reality game, in which users attempt to "catch them all"--using their smartphone camera and GPS in order to locate Pokemon in their real world. Pokemon Go is a product of Niantic Labs, headed by former Google Maps product vice president and Street View boss John Hanks. Niantic Labs was formed from funding from Google, Nintendo, and the Pokemon Company.

As a reward for finding certain Pokemon, users were rewarded with in-game currency and rewards such as candies and stardust. There was also a significant social aspect to the gamification of Pokemon Go--friends were able to encourage each other to catch Pokemon, win against other Pokemon in battles, and join Pokemon gyms (Pokestops). But it didn't just require access to a user's smartphone camera or GPS--Pokemon Go also collected an extensive permissions list. The app reaped not just precision location tracking (tracking where you were, for how long, who with), but also included permissions like reading your phone's contact list, finding accounts associated with your smartphone, and so on. This information was not only just with Niantic Labs, but was potentially also shared with third-parties. Much like any other terms of services for mobile devices, users are not supplied with these third-party's privacy policies or terms of services, and thus a layer of transparency is lost.

Niantic went on to sign deals with companies, creating sponsored locations where Pokemon would spawn. In this way, Niantic was able to engage in a technique known as nudging, a form of behaviour modification where a user is ever-so-slightly encouraged to make certain decisions. Placing a Pokemon consistently near a Starbucks might not compel a user to buy one the first time--but what about the second, third, fourth time...?

This is just one example, but in reality we are surrounded by long, complex privacy policies and terms of services that are obfuscating and entrap us in deals with companies we might not have anticipated, given the data sharing that goes on often between the original company and third-party companies. The layering of all the apps, websites, and products we might have agreed a set of privacy policies which is where critique of the ethics of this really gains momentum, particularly when many technology companies can get bought out by others and the landscape is overall dominated by a few very large players.

Case study: Prosecraft

🔍 How can we build datasets obtained with informed consent?

For our next case study, we first must look to understanding literary criticism (bear with me!). Literary criticism is an academic field, taught often as part of literature studies. It's all about dissecting stories--what makes one novel popular over another? What was unique about the novel? How did the author set that novel up? When we study literature, we learn about tropes, story structures, and themes--everything that takes "okay" writing to "great" writing. That can expand our appreciation of that work, help provide context to that work, and most of all--the story of stories is the story of us, and can inform our picture of what the world was like for various communities at various points in history.

It is not outside the realm of literary criticism to add in elements of computational science in order to help inform some areas of literary criticism. In fact, this has gone on to spawn further areas of research; natural langauge processing can be thought of as the mash-up between literary criticsm or linguistics with computer science.

Author Kurt Vonnegut's entertaining lecture on the "Shapes of Stories" elucidates how a story structure can easily be quantified in such a way as to be machine-readable. This idea spawned a research project by the University of Vermont 'Computational Story Lab.' In the Reagan et al. 2016 paper, the group demonstrates their ability to, using machine learning, analyse the 1300 works of Project Gutenberg, and determine the emotional arc of the story based on the "happiness" level within each page. Using these identified story arcs, the group proposed that there were overall six story structures that are widely used in literature. You can see an example of such a story arc here--though note, you might get spoiled for Harry Potter and the Deathly Hallows!

💡 Though it would be remiss of us to point out that machine learning tools are not infallible at identifying emotions, and the above paper is part of an active research field.

Benji Smith, an expert in machine learning and computational linguistics, was writing his first book, Abandoned Ship: An Intimate Account of the Costa Concordia Shipwreck, when the idea of having access to similar statistics felt to him a potential answer to his questions about how to best go about writing a book. In his own words, he wrote about how unsure he was of how long to make the book--ultimately, he pulled a book off his shelf and estimated how many words it contained. He

...kept a little spreadsheet, and it was precious to me...Precious guidance from authors who books I adored, when I was struggling to tell my own story.

Smith created a software package, Shaxpir, as a tool for aspiring writers--a desktop word processor akin to Scrivener. Meanwhile, his spreadsheet of statistics grew, influenced by the University of Vermont's research.

Again, in his own words:

When I ran out of books on my own shelves, I looked to the internet for more text that I could analyze, and I used web crawlers to find more books. I wanted to be mindful of the diversity of different stories, so I tried to find books by authors of every race and gender, from every different cultural and political background, writing in every different genre and exploring all different kinds of themes. Fiction and nonfiction and philosophy and science and religion and culture and politics.

Smith did consider the prospect of regulatory obligations:

Since I was only publishing summary statistics, and small snippets of text of those books, I believed I was honouring the spirit of the Fair Use doctrine, which doesn’t require the consent of the original author.

Eventually he transformed his spreadsheets to an online webpage, Prosecraft, dedicated to linguistic analysis. It had a repository of 25,000 books, and provided statistics on everything from the total word count, to rankings of "vividness" and how much "active voice" a book had. It included: highlights of passages with the "most active" versus "most passive" voice, recreating the passages in full; word clouds for pages that showed how much they contributed to the overall emotion of the story; and word count distribution for the books. From Smith's point of view, he was creating a tool to help writers craft their stories.

(Of course, whether this indeed helps writers is a topic of literary critcism in and of itself!)

Author Hari Kunzru (of the novel White Tears) was on the subway home when he noticed Prosecraft trending. As Wired reports it, Kunzru checked to see if the website contained his novel--and indeed, it did, despite Kunzru knowing his publisher wouldn't have allowed license for it to be used in this way. He tweeted:

This company Prosecraft appears to have stolen a lot of books, trained an AI, and are now offering a service based on that data. I did not consent to this use of my work.

As we can see--the tenants of consent and notice here are clearly violated. Permission was not obtained from each author for the inclusion of their work in training Smith's tool, nor was notice proactively provided to them that their work was involved in said tool. On the tenant of regulatory obligation, Wired reports there are conflicting opinions as to whether indeed Fair Use would have covered this application of machine learning.

Backlash came swiftly for Smith, who eventually took down Prosecraft of his own accord. The backlash mainly consisted of ire due to the use of shadow libraries in order to complete the training data--though some authors proactively contacted Smith about including their work. Others felt that whilst shadow libraries were acceptable to exist, they should only be used by researchers, for researchers (and not a for-profit tool).

Others still generally disapproved of AI writing tools, for example, even Grammarly. Such tools could induce a particular sentence structure or writing style, which can prevent a writer from appropriately developing and exploring their own unique voice. For example, a poet such as e.e. cummings, who famously flouted grammar rules, would have had their work homogenised if passed through such an AI tool. This in turn then changes the overall landscape of writing, meaning even if you as a writer don't engage with these tools, your inspiration and reading material is overall still limited to this small window of style.

💡 Shadow libraries are not just used by smaller machine learning tools; they also form the training datasets of some of the biggest names in LLMs--e.g. OpenAI's ChatGPT, Meta's Llama, and Bloomberg's BloombergGPT. Many regulatory and legal issues are currently being explored in the context of these LLMs using these shadow libraries; which we will cover in ensuing AI Ethics courses in the Academy.

Case study: Karya

Yet training datasets can also consist of larger swathes of the Internet.

🔍 But what does the Internet actually look like? What biases are being encoded in these large training datasets by virtue of their origin?

In fact, the Internet has quite strong biases in and of itself. The predominant language on the Internet is English. Though this is decreasing year on year, still the most websites available are English. Another prominent bias is gender bias. Less than 15% of Wikipedia contributors are women; only 34% of Twitter users are women; and only 33% of Reddit users are not men.

In this next case study, we tackle some of these biases, and address another aspect of data and privacy: the consent and notice behind data collections that makeup machine learning tools.

Importantly, once a set of training data is established, there still needs to be data cleaning or labelling done. This is to ensure that if a user interacts with social media, or an LLM or other machine learning tool, if they request harmful or abusive text--or to prevent them from receiving harmful or abusive text--companies need to ensure their tools are not providing them with said harmful or abusive texts. To achieve this, they typically use a machine learning tool that can screen for such texts. However, to do this, a tool needs examples of such text already, because LLMs do not have contextual or semantic understanding. Whilst they can predict the next sentence in a paragraph or the next word in a sentence, they don't have a contextual understanding of that sentence or word. And the same for images--this is why Captcha Forms often ask for a contextual understanding ("Identify the bus in this image of traffic").

To gather this crucial labelled data, technology companies will often outsource work to data companies to do said labelling. This requires workers to see or read these harmful materials, often in graphic detail.

A large hubspot for this type of work is located in Kenya, where workers are often underpaid, and also often denied therapy in order to cope with the psychological harms such exposure causes. Currently, efforts are underway for workers to unionise, and lawsuits are being brought against companies to hold them accountable for this work.

In stark contrast, non-profit start-up Karya stands as a pillar of aspiration for technology companies. Karya, founded by Vivek Seshadri and Manu Chopra, has signed the Ethical Data Pledge. This consists of three key tenants:

Paying workers living wages.
No outsourcing of data; all data is handled and processed by Karya rather than outsourced to another company.
Data worker's welfare is proactively invested in.

Karya provides workers a living wage (above minimum wage), as well as de-factor ownership of the data created or processed whilst working. This means that if and when Karya sells data on to big tech companies or clients, and this then gets sold on, workers will continue to profit from their work and have proceeds at every step of the transaction.

There are some limits to the Karya model: workers cannot make more than the annual income equivalent in India. The work is also intended as a supplementary, rather than permanent, job. Finally, some of the workers have a limited understanding or knowledge of AI, which could prevent them from fully recognising their agency.

However, Karya is able to ethically provide high accuracy datasets in a dialects and languages that are typcially not well represented on the Internet. Their datasets have also been specifically designed to combat gender biases.

The story of Karya is the story of where start-ups and companies engaged with machine learning and/or machine learning data need to begin, as all three tenants of data privacy are largely being followed, with workers providing consent for their training data outputs to be used, and given notice (and proceeds) everytime their data is passed onwards.

Another success story in this realm is that of the te reo Maori people's speech-to-text systems, which remain in their possession and a community-centered tool. For more information on this project, please see the project landing page.

Conclusion

We can see the importance of maintaining transparency in both the user-side and the development side of machine learning tools--the tenants of consent, privacy, and regulatory obligation are central to maintaining ethical standards. Maintaining the integrity of a machine learning tool relies on this transparency, as well as the frank and up-to-date acknowledgement of any possible biases in their training data.

💡 As you answer the questions in the recommended reading or viewing for this lesson, remember the previous lessons, in which we discussed stakeholder-dependent views on the interplay between machine learning and artifical intelligence, as well as the discussions we had on the potential creativity and agency embedded within such algorithms.

📚 Recommended reading or viewing for this lesson

An interview with Shoshana Zuboff, by John Naughton, on surveillance capitalism. After reading the article, have your feelings about the nature and amount of data collected on us through technology changed at all? Do you feel any repercussions in society currently, whether small or large, due to this vast amount of data being collected on us in order to behaviourally nudge us towards certain outcomes? How have our expectations regarding data collection and privacy potentially evolved from this, and do you notice any generational differences in this?
Listen to interviews with the Kenyan workers behind the labelled data OpenAI uses in their services. WARNING: the content in this link can be difficult at times to listen to. Please feel free to skip this exercise and instead go directly to the following questions: in light of lesson one, where we spoke about perceptions associated with AI and machine learning, given the work that must be done in order to bring these tools to a consumer, would you still feel comfortable labelling it "artificial intelligence"? Why or why not?
We've seen in this lesson how the Internet is already heavily biased towards certain sectors of society, and this will be reflected, as a result, in the AI tools that have been trained on vast swathes of data from the Internet. In the resource above, the Wall Street Journal podcast, it's noted how, for example, ChatGPT has had the most moderation for English, with little available in Swahili. We saw in our lesson's case study that Karya provides the opportunity for a language that's often technologically overlooked, the language Kannada. In 2024, more than 70 countries are due to hold regional or national elections. These could be incredibly influenced by tools such as ChatGPT, and yet how many of these citizens interact with technology exclusively in English? However, neither is it particularly feasible to make every machine learning tool fluent or available in all languages of the world. Given all this, what benchmarks would be appropriate for an LLM or other machine learning tool upon release? How can we mitigate the potential negative impacts and yet not stifle innovation of these tools? What kind of warning label would be best for these kinds of tools or their output?

3. Data and privacy concerns for ML tools

Perceptions about data and privacy

Case study: Pokemon Go

Case study: Prosecraft

Case study: Karya

Conclusion

Recommended books and longer reads:

Return to Lesson Index