6 min read time

Data Security and the ChatGPT Effect

by   in Cybersecurity

We’re all seeing and experiencing the excitement over the release of large language models (LLMs) such as ChatGPT from OpenAI, and many other LLMs from labs and tech giants such as Google, Meta, and Microsoft. (Quick definition: LLMs are artificial neural networks used to process and understand natural language. Commonly trained on large datasets, they can be used for a variety of tasks such as text generation, text classification, question answering, and machine translation.) It’s not just that people are experimenting and using these AI/ML services for purposes – both appropriate and some inappropriate – but applications and services are embedding links to these services to add intelligence and capabilities that would take years to develop on your own. Just look at Microsoft and their integrations into the Microsoft 365 suite of tools.

Data Security and the ChatGPT EffectThe rush to hand data over to these services leaves me wondering, are people “thinking before tinkering”? Sure, it’s exciting to see what these models produce. Sometimes it’s relevant, sometimes it’s funny, but it’s always in context of the data you provide – be that in the form of a question or a request to do work, such as “write me a blog post about the effects of the new AI/ML large language models and the implications of this technology for data security.”

While even the most basic security standards recommend you protect your data and control access to it, it’s not as clear to people that feeding data into analytics engines and AI models can have unintended consequences.

Doctor, Doctor… It Hurts When I…

One compelling use case I’ve heard demonstrates the potential danger. Organizations are rapidly adopting large language models as a more automated and intelligent way to search and use company data across all their various data sources. There could be hidden danger with this seemingly innocuous approach to cataloging, indexing, and serving up organizational data. As data sources are crawled and ingested as part of the training set, it’s possible – and some might say highly probable – that you will ingest sensitive data into the training set. It’s almost impossible, once the model is trained, to know what it has ingested and what it might reveal in response to queries – maybe credit card numbers or customer personal data it saw in documents sitting out on some repository. AI and enterprise search is a great example of why security professionals need to be disciplined about scrubbing data, making sure to use tokenization so that high value and regulated data doesn’t leak into models where it could emerge later in surprising and very public ways.

It may be possible to cleanly search an enterprise search index for sensitive information, but there is no good way to gain a complete understanding of what a transformer network knows since it’s all encoded in an opaque set of weights and abstract data sets.

Good Data Security Can Help, but Not Absolve You of Previously Committed Sins

As an industry, we’ve been doing data security for some time – and we have well-established best practices. That said, we’ve certainly seen a ramp-up in data collected and stored across a wide range of platforms and services. It’s not always apparent where your data travels once you share it with an application or service, so it’s best to be careful and manage your data as if it should always be protected.

I would also encourage you to establish data management policy and provide employees with basic training around data security that includes the risks of handing your data to any public-facing AI, such as ChatGPT. This is really the minimum organizations should be doing. Tell your employees that providing data to any public service, like large language model AI and applications that include these services, is potentially dangerous. It’s best to figure out ways for employees to provide data that has been protected or is representative without being actual, real-world data.

So What Can and Should You Do?

We all know that misuse and exposure of sensitive data can cause significant problems that are difficult to resolve and can damage corporate brand and reputation – and, in the case of regulatory control failure, inflict significant financial punishment with fines and consumer litigation. So, given that these large language models (and other AI services) are not going away anytime soon, act with caution. Until we get better control over our data and can guarantee the ingestion and output won’t expose us to security breach or compliance failure, it’s best to consider services like ChatGPT, Dall-E and RytR as danger zones where none of your sensitive data belongs. Here are best practices I would recommend:

  • Don’t put any sensitive data into AI models if you can avoid it – certainly not any unprotected data.
  • If you must put sensitive or regulated data into AI and analytics applications or services, use format-preserving encryption (FPE). Note that FPE may have unexpected, potentially negative, results on AI models being trained (see my note below on test data).
  • Be sure you understand the goals and potential pitfalls of adding existing data into AI models before you begin to upload any data.
  • Protect your data, govern access to it, and audit everything before uninformed, or unwitting, employees use corporate data without your approval. 

This last point is a general “best practice” rule that applies whether you intend to share high value data assets with any applications or services.

So, all the common-sense, logical security best practices still apply to your data management. Of course, you need to make this work within your company’s security framework and data ecosystem. And you should also be managing identities and access. That includes group membership and guest/contractor access to ensure only those with appropriate rights have access to data – protected or not! You should also be protecting sensitive data as close to the point of origin as possible. This ensures that data flowing through your systems is persistently protected and (if you’re managing authentication and authorization) only those with appropriate rights can decrypt. Finally, data protection can support format preservation – meaning downstream applications and services don’t need to be developed to support protected data. With the right format-preserving, privacy-enhancing technologies, referential integrity is maintained across databases, and text processing for analytics needs no adjustments. Take control of your destiny by getting control over your data!

More Resources

Want to learn more about protecting your data? Download this paper on Tokenization. For more information on OpenText Voltage, visit our multi-cloud data security information hub. OpenText Voltage is a Leader! Download this free report, The Forrester WaveTm: Data Security Platforms, Q1 2023, to see why.

But Wait, There’s More: A Quick Note about Test Data Management / Synthetic Data

Here’s the problem: You have a great set of data that proves some business or scientific theory. You’d love to show everyone, but the data is sensitive. Maybe it has names, birthdays, national IDs, whatever… You can’t share this with anyone. What can you do?

There are products out there (Such as Structured Data Manager from OpenText Cybersecurity), that can comb through a data set to understand its structure and typical values. This helps it generate synthetic or fake data that ‘looks like’ it belongs to your data set. It matches the schema; it looks like data that would fit into your tables or spreadsheets or whatever – and downstream applications won’t know the difference.

Test Data Management

This is one way to use data that looks like your data, without risking exposure. 

Join our Voltage Data Privacy and Protection Community. Keep up with the latest Tips & Info about Data Privacy and Protection. We’d love to hear your thoughts on this blog. Log in or register to comment below.


Data Privacy and Protection