What can go wrong?
TL;DR
- Treat anything you paste into a consumer chatbot UI as data that may be logged and used to improve the service. Prefer API/enterprise endpoints or institution-hosted/local models for work with personal data, unpublished results, or proprietary code. See statements from OpenAI (Enterprise Privacy), Azure OpenAI (Data & Privacy), and Google Vertex AI (Data Governance & Zero Data Retention).
- Multi-tenant chat UIs have had cross-user exposure incidents (see OpenAI’s postmortem on the March 20, 2023 ChatGPT outage).
- Independent reporting shows platforms collect broad personal data and may share prompts with service providers and other parties; mobile apps can add precise location, phone numbers, and photos. See: heise developer.
- Safer patterns exist: local (Ollama + Open WebUI / LM Studio / GPT4All / llama.cpp), institution-hosted gateways, or enterprise plans with no-training defaults and retention/residency controls. See the Hamburg DPA checklist.
- Training on your content (consumer front-ends). For services “for individuals,” chats may be used to train models unless you opt out. Business/API offerings typically do not train on your inputs by default. See: OpenAI Enterprise Privacy, Azure OpenAI FAQ, Vertex AI data governance.
- Memorization & extraction. LLMs can regurgitate training snippets or leak prompt content. In one case, Carlini et al. (2021) demonstrated that GPT-2 had memorized hundreds of examples such as individuals’ contact information and could output them on command.
- Cross-user mix-ups/bugs. Multi-tenant UIs are convenient—and brittle. Example incident: OpenAI’s March 20, 2023 outage postmortem.
- What providers actually collect/share (snapshot). Reports detail data sources (web, partners, brokers) and sharing with service providers, affiliates, research partners, and in some cases ad partners; mobile apps may collect precise location, phone numbers, photos, and telemetry. Read: heise: “LLM-Betreiber sammeln …”.
Note that often data shared via APIs is — in contrast to data shared via web frontends — not used for training purposes.
Safer usage patterns
A) Local-only (no data leaves your machine)
Run models locally with: Ollama + Open WebUI or the Open WebUI docs, LM Studio, GPT4All, or llama.cpp.
Best for: human-subject data, embargoed results, export-controlled material.
B) Institution-hosted (on-prem/HPC/VPN/VPC)
Spin up vLLM/llama.cpp/Ollama behind your VPN. Add a prompt-gateway to redact PII, block risky uploads, and log usage for audit. A self-hosted UI like Open WebUI makes this approachable.
C) Enterprise/per-tenant cloud
Use your lab or university’s tenant with no-training defaults, retention controls, and regional hosting. See: OpenAI Enterprise Privacy, Azure OpenAI data & privacy, Vertex AI data governance.