GPT AI Privacy in Content Production and Localization
Privacy and AI are both thorny and intricate topics. When coupled together it's tough to see anything clearly. Is there a real privacy threat when working with GPT or is it just in our imagination?
Online privacy is in itself a thorny and complex matter. When mixed with the latest breakthroughs in Artificial Intelligence, AI (online) privacy becomes a potent Molotov cocktail. Lack of knowledge/experience, preconceptions, and deep human emotions, all merge turning AI Privacy into an impenetrable dark forest. Dedicating time to reading this article will hopefully give you a light and a compass so that you can at least begin to navigate this spooky terrain.
First some light on the prickly emotional Terrain…
Privacy triggers some of the most intense human emotions
The feeling of being invaded. Exposed. When it comes to such a sensitive matter people are quick to jump the gun and make emotionally driven decisions as opposed to slowly and rationally examining what all of this means. From an evolutionary psychology perspective:
Throughout human history, maintaining privacy has been crucial for our survival and reproductive success. Privacy allowed individuals to protect their resources, avoid potential threats, and maintain control over their social environment. As humans are social creatures, we have evolved to be sensitive to our social standing and relationships. Consequently, the desire for privacy is closely linked to our need for control, autonomy, and social reputation.
Privacy can arouse profound emotions in the role it plays in personal autonomy. Privacy allows us to explore our thoughts, beliefs, and emotions without fear of judgment or external influence. This freedom supports our sense of self and allows us to develop our identity and individuality. When our privacy is threatened, we may feel vulnerable and exposed, leading to anxiety, fear, or even anger.
AI also strikes the human emotional core
ChatGPT brought AI down like a sledgehammer from a distant sci-fi reality to a real-world threat and opportunity matrix in a heartbeat. The rapid dissemination of such a groundbreaking model awakened some of the strongest human emotions such as avarice and fear:
AI rattles deep human emotions due to its potential to disrupt established norms, challenge our sense of control, and evoke existential concerns. The rapid advancement of AI technologies can provoke fear and anxiety about job displacement, loss of privacy, and societal changes. Additionally, AI's increasing autonomy may spark unease about relinquishing control to machines, raising concerns about their decision-making capabilities and ethical implications. Furthermore, AI's potential to surpass human intelligence triggers existential questions about the uniqueness of human consciousness and our place in the world. Overall, AI touches upon deep-rooted emotions as it forces us to confront uncertainties surrounding our future, control, and identity.
It’s only natural that it’s challenging to keep a cool rational cost-benefit-biased perspective when analyzing the potential wins and perils of engaging with such turmoil.
2+2= Much more than 4 in the emotional world
These gut-wrenching feelings (even if unconsciously so) quickly merge into an emotional fender bender. When you add all the unresolved feelings, knowledge, and experience around online privacy with the newly introduced advances in AI, there’s no other way to say it: it’s an emotional shit show. Freud’s delight if he were around to witness this. And although this says nothing about privacy and AI specifically, it’s vital to at least partially acknowledge this emotional entanglement in order to set it aside for a minute so that we can focus on the facts.
What do privacy and data even mean?
Privacy
Definitionally privacy means according to Oxford’s Dictionary: the state or condition of being free from being observed or disturbed by other people. In other words, nobody is watching you, taking note of you, or in any way interacting with you. By metaphor, privacy is what you would experience in your bedroom vs. the inverse of privacy, what you would feel in a public town square.
Data
And data, the other key component to privacy according to Oxford’s Dictionary means facts and statistics collected together for reference or analysis. In other words, any information that serves any given purpose.
Privacy + Data
In the Physical World
And merging these two concepts together, it’s easy to show that data and privacy are dynamic principles. My bedroom behavior is clearly private whereas my actions on a town square are clearly public.
What about my personal behavior at a store at a mall for instance observed by someone working at the store? If they take note that I have a preference for green things over yellow things for instance, and begin to place more green things on display when I walk into the store, are they infringing on my privacy or simply being mindful?
Do they need my formal acknowledgment about what they are noticing about my behavior and consent over the actions they are taking? Or can they simply do this covertly without invading my privacy?
What happens if there are countless invisible store workers around me, monitoring my behavior so that I can be best served and consequently buy more? This metaphor takes us to the online realm.
It’s easy to illustrate that when you begin to dissect data and privacy in the physical realm, there’s a ton of gray area. Different cultures, situations, contexts, and behaviors all dance together to create a delicate composition of data and privacy.
In the online realm
When it comes to online behavior these distinctions quickly become less obvious. Is a Facebook post available to all my friends private or public? Is a Google search I run public or private? And if private, to who? Private, meaning that Google won’t share it with third parties? Or private meaning that Google won’t use that information at all to offer me ads based on my search intent? I don’t go to have to go too far down the privacy rabbit whole to make my point:
due to the scale delivered by technology, online privacy is at best a murky concept, persistently exploited by companies that fundamentally require private data to operationalize and leverage their business.
It gets even murkier. According to GDPR for instance, even a publicly available web page is within the realm of privacy. GDPR definitions of publicly available information according to Wired:
GDPR’s protections apply if people’s information is freely available online. In short: Just because someone’s information is public doesn’t mean you can vacuum it up and do anything you want with it.
Transporting this idea back to the physical world, it’s like saying that if you have a store that’s open to the public, the store’s contents are still private. If your intent is to buy those contents, you are not infringing on that store’s privacy, but if your intent is to tacitly commit industrial espionage, then you are infringing on privacy. Analogously on a public website, as a visitor interested in the contents of a blog or a product page you are not infringing on privacy but if you are crawling that content for ulterior purposes then you do need permission.
A side-note: Beyond Privacy and into Copyright Infringement
This goes beyond privacy and into copyright infringement because if a large language model were built on data generated by others without their consent and approval, it would only seem logical that something is owed to those authors. Likewise, if an AI art model generates expressionist renderings based on the work of Pollock, or surrealist versions tuned by Dali’s paintings, it seems reasonable that something is owed to those artists. As important as it is, we will leave this ethical authorial exploration for a subsequent piece in order to continue to drill down on privacy and data.
Although fundamental, that’s an entirely different matter
This article does not seek to question the merit of what goes into building an AI model but will focus only on what happens to your content, particularly in the context of content production and localization. The key question we want to answer in our exploration is: what is your actual exposure vs. your potential upside?
So far we have established the conceptual framework:
Data and Privacy are dynamic in nature and have to do with intent, just as much as they have to do with the nature of the data itself.
Not only is privacy far from a universal standard from a definitional perspective but once you factor in the added aggravation of the rules say, what’s enforceable and what truly gets done with your data, any analysis is far from trivial.
In short, for many, there are legitimate privacy concerns even over the publicly available data used to train and model GPT, let alone the more sensitive data you decide to feed the model through prompts and completions. But this still does not mean that the content you feed GPT is in any categorical way exposed.
The Illusion of Privacy
The vast network of services that typical users engage with in order to exist online, from searching to shopping, from reading to posting, in some way or form typically manages to capture, things that can easily be understood as private, such as your behavior. Things you click on vs. things you scroll on, text you search, and time you spend on pages. As an online person, you are generating a continuous stream of information whether implicit or explicit.
Metaphorically speaking, technology makes it possible for each of us to be surrounded by a myriad of store clerks taking scrutinous notes over everything we do and everything we don’t do to continuously adjust our experience so that we spend more and more time and resources at that store.
And it’s no wonder that even with laws that are meant to safeguard that, and even trying to be careful I still find myself seeing ads that are uncannily close to something I searched for on a social media platform, or simply something I looked at. And what’s even harder to detect is the information that’s not being displayed to me due to my behavior.
In my opinion, tech evolves faster than legislation by an order of magnitude allowing companies to continuously find loopholes to use my data in ways that are way beyond those contemplated by current rules, let alone my imagination. When I crunch all of this together, it’s my perspective that:
Our online world in its current shape does not have any real privacy when push comes to shove. Privacy is just a veil, an act that some proportion of tech companies abide by at some level but at other levels exploit because private information is just too valuable to truly protect.
But if you are a privacy purist and still believe in the idea that online things can be kept private, I respect your position and will keep that in mind as we continue to dissect the issue of content processing in GPT.
The privacy beef or pasta specifically with Open AI
What actually happened that spiked all of this most recent privacy concern over Open AI?
On March 30th, ChatGPT had a bug according to OpenAI that resulted in an undetermined amount of users being able to see other users’ prompt titles and new conversations. 1.2% of ChatGPT plus users also had their financial payment information exposed. For more information refer to OpenAI’s statement here.
My take on this:
Not acceptable that OpenAI terms this euphemistically an “Outage” as opposed to a “Data Breach”
In the same way, they quantified that 1.2% of users had financial payment information exposed they should quantify how many users saw how many chat titles and conversations for greater transparency and credibility
As far as breaches go, this is quite small
I won’t go into full details, but as far as breaches go (you can read more here), this breach is relatively small and insignificant, not even making it to the top 100 list published here.
The media visibility, amplified by the AI privacy emotional troubles previously discussed, makes people prone to blowing the degree of privacy concerns around ChatGPT out of proportion (either up or down). When a country like Italy abruptly adopts a harsh measure and bans ChatGPT, public perception becomes even more distorted. In a click-driven world, where people seldom dedicate precious minutes to actually understanding anything, that’s all you need for the bias to become set in stone. Now that we can set all of this beef and emotional salad aside, let’s explore the facts.
What happens to translatable content once it is fed onto ChatGPT?
As of now, you can engage with ChatGPT either directly through OpenAI or through Microsoft Azure. Let’s explore the ramifications of both methods.
Directly through OpenAI’s Chat GPT interface
Open AI’s latest paper on GPT-4 states:
We take a number of steps to reduce the risk that our models are used in a way that could violate a person’s privacy rights. These include fine-tuning models to reject these types of requests, removing personal information from the training dataset where feasible, creating automated model evaluations, monitoring and responding to user attempts to generate this type of information, and restricting this type of use in our terms and policies. Our efforts to expand context length and improve embedding models for retrieval may help further limit privacy risks moving forward by tying task performance more to the information a user brings to the model. We continue to research, develop, and enhance technical and process mitigations in this area.
While the intentions seem noble, the text is authored purposely with loopholes such as saying that personal information is removed “where feasible”. In its essence, the text says, “We are the good people and we are doing our best to not violate your privacy”. Worthy intentions but we need a lot more clarity on exactly what happens to data fed onto GPT when working through the Open AI’s GPT interface.
I was unable to find more specific language as far as usage policies for content submitted via the ChatGPT UI. If you are aware of documentation that sheds light on this please share it with me and I will update this article accordingly and I thank you in advance.
Directly through Open AI’s API
Open AI has a different tone and discourse when talking about their updated API policies here. Not only is it easy to locate but the language is crystal clear. The two key points pulled directly from their API usage policies are:
OpenAI will not use data submitted by customers via our API to train or improve our models unless you explicitly decide to share your data with us for this purpose. You can opt-in to share data.
Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).
This gives API users a significant degree of confidence that the data will not be used to train or retrain models and that it will be deleted after 30 days. As far as privacy goes, it seems clear that there is little to no significant exposure based on the policies but you are still running the risk of a relatively incipient tech company that is far more focused on research and development than data management since its inception.
On Microsoft Azure
Microsoft Azure makes up for this lack of ethos in information security management and provides even more clarity around exactly what happens to your data.
Its full privacy policy can be found here. In my opinion, the most important statement in the policy is:
No prompts or completions are stored in the model during these operations, and prompts and completions are not used to train, retrain or improve the models.
This statement is emphasized through repetition with slight variation throughout the policy. The language gets a bit tricky and in certain operations, prompts or completions may be stored for 30 days in an encrypted form and access only to specifically designated Microsoft employees with that level of clearance:
Prompts and completions. The prompts and completions data may be temporarily stored by the Azure OpenAI Service in the same region as the resource for up to 30 days. This data is encrypted and is only accessible to authorized Microsoft employees for (1) debugging purposes in the event of a failure, and (2) investigating patterns of abuse and misuse to determine if the service is being used in a manner that violates the applicable product terms. Note: When a customer is approved for modified abuse monitoring, prompts and completions data are not stored, and thus Microsoft employees have no access to the data.
In summary, the worst-case scenario through Microsoft Azure is:
The content fed to the Azure-based GPT will be stored at most 30 days
All data is encrypted at rest
It won’t be shared with anyone other than select Microsoft employees
Data won’t be used to train the Open AI models instanced in Azure- this is said flat out in plain English:
No. We do not use customer data to train, retrain or improve the models in the Azure OpenAI Service.
What does this mean as far as privacy for your data goes in each of these scenarios?
Via ChatGPT UI
In short, unless there is documentation out there that I have been unable to locate so far, Privacy via ChatGPT is flimsy at best. Since there are no categorical statements as far as what happens to prompts and completions, it’s safer to assume that your conversations with ChatGPT will be used for training purposes. To what degree, they may be exposed, and what risk you run is unknown at this point and tough to determine since this product and concept is so incipient.
Via OpenAI API
You are quite protected based on the updated API usage terms published by Open AI. You can rest assured that your content will not be used for training unless you specifically opt-in for that and you can also rest assured that your content will be destroyed after 30 days. The key challenge here is that Open AI is primarily a research company handling a trove of information rather than an information management company handling that same information. But other than the Open AI organizational risk, it’s safe to say your data is protected according to their policies.
Via Microsoft Azure API
This clearly seems the scenario in which you reach maximum protection. Your data is covered by Microsoft Azure brand integrity. Statements in its privacy policy are clear and categorical and as far as how the data is handled practices follow high-end data management practices. Not only do the described practices safeguard any data shared with GPT through Microsft Azure, but you are also working with a company that is an industry benchmark for information security along with AWS and other market leaders.
Conclusions
These are the main concepts I learned while writing this article:
The wide range of emotions rattled by online privacy and artificial intelligence make it particularly challenging for people to examine these concepts from a neutral analytical position which equals lots of bias and preconceived notions as opposed to facts.
There seem to be few hard privacy safeguards when it comes to using ChatGPT’s UI. It’s of vital strategic value to continue to train GPT models through user behavior. Reinforcement learning through human feedback is one of the key breakthroughs Open AI used to tune GPT to feel more human and sensical. It would seem to me that the data generated by users is just too valuable to pass up entirely.
Open AI’s API and Microsoft’s Azure API both seem to provide privacy safeguards that are good enough for most global organizations.
That being said, when it comes to privacy it’s my opinion that if something is truly sensitive to the point they need to remain secret for years, the productivity gains from using GPT are not likely to outweigh the added risk (as small as it may be) to process this content through such an engine. But considering run-of-the-mill content that eventually reaches consumers and is meant to be public, the productivity gains from leveraging GPT seem to outweigh any privacy concerns when working via API. As a matter of personal preference, I treat the UI as a public forum. That is, I expect no privacy whatsoever from content submitted there even though there is some language in place that offers vague notions of privacy.
When it comes to privacy and AI in general there are valid concerns that are far from being solved or addressed. But when it comes to content fed via API it’s relatively safe when contextualized within the overall architecture of cloud services.
While the complex emotional nature brought about by privacy and AI often causes people to be extreme about it either being insanely bullish or banning it even before any kind of deep analysis, it seems like a fair assessment that your content is protected enough when processed through Open AI’s or Microsoft’s Azure APIs and unless you have something that must remain air tight for a long period of time, the productivity gains as small as they may be would justify harnessing the power of these large language models without becoming overly fearful or concerned.
But these are my thoughts on a truly difficult topic. Anything you would like to add? I would love to hear! Thank you for reading!