Key Contacts: Sean McElligott – Partner | Anne Bateman – Partner

An interim report was recently published by the European Data Protection Board (EDPB) on the work undertaken by the ChatGPT Taskforce, which was established to exchange information on possible enforcement actions on the processing of personal data by ChatGPT. 

The report is a preliminary view on certain aspects of the investigations by various Supervising Authorities, but nonetheless contains some helpful early insights into how European regulators may investigate companies providing Large Language Models (LLM).

The report helpfully distinguishes different stages in the processing of personal data involved in using a LLM and it then assesses the impact of GDPR on each stage:

  1. collection of training data (including the use of web scraping data or reuse of datasets);
  2. pre-processing of the data (including filtering);
  3. training;
  4. prompts and ChatGPT output; and
  5. training ChatGPT with prompts.

One of the first questions controllers of LLMs must wrestle with is the question “what is the legal basis for the processing”.  Under GDPR, each processing of personal data must meet at least one of the conditions specified in Article 6(1) and, where applicable, the additional requirements laid out in Article 9(2).

The report notes that ChatGPT seeks to rely on Article 6(1)(f) of GDPR around the lawfulness of the processing of personal data in the context of its web scraping.  There is perhaps no great surprise in this approach, as Article 6(1)(f) allows for the processing of data by a controller when it is necessary for the purposes of the legitimate interests pursued by the controller except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject.   In other words, the fundamental rights and freedoms of data subjects on one hand and the controller’s legitimate interests on the other have to be evaluated and balanced carefully.

If special categories of personal data are processed (which they frequently are), then one of the exceptions of Article 9(2) must also be applicable for the processing to be lawful. Interestingly the report notes that in principle, one of these exceptions can be Article 9(2)(e) although the report makes it clear the mere fact that personal data is publicly accessible (and therefore “scrapable”) does not imply that “the data subject has manifestly made such data public”.  However, in order to rely on the exception in Article 9(2)(e) the controller will have to ascertain whether the data subject had intended, explicitly and by a clear affirmative action, to make the personal data in question accessible to the general public.

Finally, the report acknowledges that where large amounts of personal data are collected via web scraping, a case-by-case examination of each data set would be hardly possible. 

Going forward, it appears that many other GenAI systems and LLMs will rely on the legitimate interest provision of Article 6(1)(f) as a justification for their processing of personal data to train their models. In a statement a few weeks after EDPB published its report, Meta made clear that it is using Article 6(1)(f) as the legal basis for using personal data to train its Llama LLM model. The US-based start-up Anthropic, which operates a group of LLMs named Claude, cites legitimate interest as a legal basis for the processing of personal data in its privacy policy. Similarly, France-based Mistral AI also references Article 6(1)(f)’s legitimate interest in their privacy policy as a justification for processing personal data.

Usefully, the report annexes a questionnaire which contains five pages of very detailed questions which was developed within the context of the ChatGPT taskforce.  This public questionnaire is a very useful document in providing an insight into how regulators may engage with companies in the LLM space.

Although an interim report, it nonetheless amounts to a careful and measured view of the use of personal data in the context of LLMs.  It is perhaps more accepting than many were expecting and provides valuable insights for AI providers in the context of the inevitable regulatory investigations that will arise in the future – it is an initial roadmap for compliance with GDPR and the upcoming AI Act.