Human Rights Here | The Implications of Generative AI in the EU Data and Copyright Protection Frameworks

Credits: Taner Kuru

By Eleni Kosta, João Pedro Quintais and Taner Kuru

Since the launch of ChatGPT in November 2022, Generative AI (GenAI) and Large Language Models (LLMs), a sub-set of GenAI models, have been welcomed by millions of users worldwide. ChatGPT became the fastest-growing consumer application in history by reaching 100 million active monthly users just two months after its launch. While several sources and databases are being used to train these models, their developers rely to a large extent on scraping the publicly accessible data online for training purposes. These practices have led not only to public scrutiny and criticism of GenAI models but also to legal questions and to their admissibility, including in light of individuals’ fundamental rights and freedoms, such as those relating to data protection and copyright.

For example, under Article 8 of the Charter of Fundamental Rights of the European Union (Charter), everyone has the right to the protection of personal data concerning them. This requires personal data to be processed in a manner consistent with the data protection principles. However, the techniques used to train GenAI models may infringe on this fundamental right. In the context of LLMs, for instance, developers often rely on content that is scraped from publicly accessible websites, including news articles and personal blogs. Accordingly, LLMs are trained with data that includes millions of individuals’ (sensitive) personal data. Given the massive scale of processing activities conducted by the developers of these models, their proximity to data subjects, and the fact that data subjects could not have reasonably expected their data to be used for such purposes, compliance of such processing activities with data protection principles, requiring personal data processing activities to be, among others, lawful, fair, and transparent, has been questioned. This is evidenced by several investigations opened by data protection authorities in the EU, but most importantly, by the creation of the specific task force on ChatGPT by the European Data Protection Board (EDPB).

A similar discussion is also being held in the copyright protection realm, as GenAI models have often been trained on copyright-protected works and other protected subject matter without the permission of copyright holders. The result has been a flurry of litigation, especially in the US, where there are over 30 cases pending against GenAI companies for copyright infringement. In the EU, this legal friction poses challenges at the constitutional level, considering that Article 17(2) of the Charter protects intellectual property – and thus copyright – as a fundamental right. At the secondary law level, this issue is mostly regulated by the 2019 Copyright in the Digital Single Market (DSM) Directive. Training and developing AI models might involve a number of activities (e.g., web scraping, pre-training, training) that often entail copyright-relevant reproductions. In EU copyright law, many such activities qualify as “text-and-data mining” (TDM) and are mainly regulated by different exclusive rights of reproduction and the two TDM exceptions in the DSM Directive. In simple terms, Articles 3 and 4 of the Directive contain two TDM-related mandatory exceptions, which allow for these activities subject to certain requirements, such as lawful access and the right of the copyright holder to opt-out. To make things more complex, the recent AI Act complements these rules with obligations for general-purpose AI (GPAI) model providers to put in place a policy to respect EU copyright law (including the opt-out requirement) and to draw up and make publicly available a sufficiently detailed summary about the content used for training. Still, it is unclear whether this thicket of rules will protect copyright holders or ensure their remuneration.

Therefore, while these models are getting increasingly integrated into our lives, it becomes essential to investigate whether and how a fair balance can be struck between these competing rights and interests to ensure technological progress is done responsibly and fairly.

To that end, Professor Eleni Kosta and Taner Kuru from the Tilburg Institute for Law, Technology, and Society (TILT) of the Tilburg University and Dr. João Pedro Quintais from the Institute for Information Law (IViR) of the University of Amsterdam, with the support of the Netherlands Network for Human Rights Research (NNHRR), co-organized a research workshop on the “Implications of Generative AI in the EU Data and Copyright Protection Frameworks” on 19 September 2024. The workshop brought together experts from academia and key human rights institutions, e.g., the Italian Data Protection Authority that blocked ChatGPT in early 2023 due to several infringements of the EU data protection laws.

The workshop started with a session that addressed copyright-related issues. The session was opened by Professor Thomas Margoni, who provided an overview of the potential implications of the newly introduced AI Act regarding GenAI developers, ranging from transparency obligations to the legitimate interest of relevant stakeholders in scrutinizing the training datasets. The following discussions revolved around whether and to what extent the relevant provisions of the AI Act are feasible, enforceable, and meaningfully contribute to the protection of the copyrighted work against undesired practices. Besides, some currently unclear matters, such as the opt-out regime, were pointed out as potential causes for concern when applied in practice. Additionally, thought-provoking questions were raised, such as whether the existing copyright regime is equipped to address the emerging concerns related to GenAI models.

The second session focused on data protection-related concerns specifically related to training LLMs with publicly accessible personal data online, which opened with the remarks of Dr. Giuseppe D’Acquisto, who serves as the Senior Technology Advisor for the Italian Data Protection Authority. His opening remarks clustered the issues that must be addressed under three main questions. First, what is the legal basis for these processing activities? Second, how can data protection principles effectively be implemented in these processing activities? Third, how can data subject rights be enforced against these processing activities? The discussion among the experts focused on what measures could be implemented by developers of these models to ensure appropriate safeguards are implemented against the unreasonable inferences with the right to data protection of the individuals concerned. The participants took up these questions by critically reflecting on the guidance provided on these matters so far by several data protection authorities, such as the Discussion Paper on LLMs and Personal Data by the Hamburg Commissioner for Data Protection and Freedom of Information, but most importantly the Report of the work undertaken by the EDPB’s ChatGPT Taskforce. Furthermore, along with the implications of the hallucinations of LLMs on the freedom of expression and information and compliance with the accuracy principle, the participants addressed the potential implications of several AI Act provisions in the intersection of GenAI and EU data protection framework.

The third and final session of the workshop was dedicated to an open debate, where participants could contribute to and/or further the debates of the previous sessions and identify common issues and possible solutions by drawing lessons from each framework. In this regard, some common elements of both frameworks, such as their territorial scopes and enforcement procedures, were compared, and their advantages and shortcomings were assessed from different perspectives by providing several case studies from and outside the EU legal framework. Some thought-provoking questions were also posed to activate the discussion, such as whether we need a new legal framework to regulate GenAI or whether we should focus on values rather than rights to do so. As is often the case in legal debates, the workshop left us with more questions than answers, indicating how much work lies ahead.

Bios:

Eleni Kosta is Professor of Technology Law and Human Rights at the Tilburg Institute for Law, Technology and Society (TILT, Tilburg University, the Netherlands). Eleni is conducting research on human rights with a focus on privacy and data protection, specializing in electronic communications and new technologies. She has been involved in numerous EU research projects. In 2014 Eleni was awarded a personal research grant for research on privacy and surveillance by the Dutch Research Organization (VENI/NWO). She is member of the Advisory Board of the Dutch digital rights organization, Bits of Freedom, and an observer to the Europol Financial Intelligence Public Private Partnership (EFIPPP). She is a member of editorial boards of academic journals (EDPL, IRLCT etc.) and conferences and workshops scientific and organizing committees (CPDP, ISP etc.). Eleni also collaborates as associate with timelex (www.timelex.eu).

João Pedro Quintais is Associate Professor at the Institute for Information Law (IViR), Amsterdam Law School. Starting with a focus on copyright law, João’s research agenda has developed along three research strands. First, he studies how intellectual property (IP) law applies to new technologies, from peer-to-peer networks, to streaming, hyperlinking, blockchain, and Artificial Intelligence (AI). Second, he examines the implications of copyright law and its (algorithmic) enforcement on Internet users’ rights and freedoms, on creators’ remuneration, and on technological development. Third, he assesses the role and responsibilities of large-scale platforms, especially in the context of algorithmic content moderation of illegal/harmful content.

João’s recent and current research includes: an NWO VENI Grant (SSH) for the project “Responsible Algorithms: How to Safeguard Freedom of Expression Online”; leading a Work Package on content moderation on hosting platforms and its impact on access to culture for the Horizon 2020 project ‘reCreating Europe’; interdisciplinary projects for the European Commission tackling the challenges of AI to the IP rights framework and collective rights management in Europe; and an international project on the Right to Research in International Copyright Law.

João is also Co-Managing Editor of the widely read Kluwer Copyright Blog, co-Director of the Glushko & Samuelson Information Law and Policy Lab, member of the European Copyright Society, member of the Management Team of the Digital Services Act Observatory, member of ALGOSOC project on Public Values in the Algorithmic Society, and member of the Netherlands Network of Human Rights Research (Working Group Human Rights in the Digital Age).

João has published extensively in the area of information law. His publications can be found in his institutional page or on SSRN.

Taner Kuru is a PhD researcher at Tilburg Institute for Law, Technology and Society (TILT). His research focuses on the implications of emerging technologies in the EU data protection framework. Before joining TILT, he completed the Advanced LL.M. in Law and Digital Technologies program at Leiden University with cum laude distinction as an awardee of the Jean Monnet Scholarship in 2020. He then received the European Data Protection Law Review’s “Young Scholar Award” for his article titled “Genetic Data: The Achilles’ Heel of the GDPR?” based on his master’s thesis. He also worked as an intern at the United Nations Interregional Crime and Justice Institute (UNICRI) Centre for Artificial Intelligence and Robotics during 2020. Taner is also a certified lawyer at Ankara Bar Association and previously worked as an attorney-at-law in Turkey.

Credits: Taner Kuru

Add comment