1 Introduction
1.1 Large language models and their use in legal contexts
1.1.1 Introduction to LLMs
1.1.2 OpenAI’s LLMs
gpt-3.5-turbo-0301
(the basis of ChatGPT at the time) and text-davinci-003
.1.1.3 Using OpenAI’s LLMs
When using OpenAI’s models through their API, users have several additional parameters they can specify. All of these settings can be found at https://platform.openai.com/docs/api-reference, however, we highlight two. The first is theTask instructions:1. Flora native to California (provided): [California native plant 1], [California native plant 2], [California native plant 3]2. Flora native to Texas (provided): [Texas native plant 1], [Texas native plant 2]New task: New Mexico native floraPlease use your knowledge of native plants of California and Texas to provide five distinct native plant species in New Mexico.Desired New Mexico plant species (to be filled in by model): [Plant name], [Plant name], [Plant name], [Plant name], [Plant name]
temperature
, which determines how random (as opposed to deterministic) the output of the model is. Possible temperature values range from 0 to 2, with higher values adding more randomness to the outputs. Users must also set the maximum number of “tokens”—a term used in natural language processing (NLP) to refer to elements of text (e.g., words or characters)—they will allow the model to use in its output. OpenAI charges users of its API based on the amount of tokens used in the input and output of a prompt, with different models allowing different volumes of tokens to be used (OpenAI 2023a).1.2 LLMs’ use in legal tasks
1.2.1 Legal tasks
text-davinci-003
model legal reasoning tasks outside of an examination setting, asking it to apply statutes to answer questions and evaluating its performance systematically. Nay et al. (2023) used multiple choice questions for this task. Blair-Stanek et al. (2023) used the StAtutory Reasoning Assessment (SARA) data set for their task, finding that GPT-3 performed significantly better than BERT,3 but “performed at chance (0.5) or worse in the zero-shot tests where there was no statute included.” They noted that, in particular, GPT-3 displayed incorrect knowledge of U.S. tax code. When given synthetic statutes, GPT-3 performed even worse, which they claim raises “doubts about GPT-3’s ability to handle basic legal work” (Blair-Stanek et al. 2023).1.2.2 Models
text-davinci-003
, as we do in answering our first research question. Nay et al. (2023) used various models, noting that GPT-4 (the state of the art at the time) performed best and that they actually used GPT-4 to help grade their answers. Yu et al. (2022) used GPT-3.1.2.3 Areas of law
1.2.4 Prompt engineering and parameter selection
For drafting essays, Choi et al. (2023) asked ChatGPT to write essays section by section; we use a similar prompting approach in this study.“Draft a legal complaint for a Massachusetts state court by John Doe against Jane Smith for injuries arising out of a car accident on January 1, 2022 caused by Jane Smith at the intersection of Tremont Street and Park Street in Boston. The complaint should specify that Jane Smith failed to stop at a red light and caused John Smith serious injuries.”
1.2.5 Consideration of LLMs’ training data
1.3 Cryptocurrency securities violations and related law
1.3.1 U.S. securities laws and securities class action lawsuits
1.3.2 Introduction to cryptocurrencies
1.3.3 Cryptocurrencies as securities
2 Methods
2.1 GPT-3.5’s ability to discern violations of U.S. law
2.1.1 Choice of model
text-davinci-003
model, which is trained on data collected prior to June 2021. Aside from newer models being more expensive to run, their training data sets are more recent. For this research question, we wanted to ensure that the model had no prior knowledge of our cases (i.e., that these cases could not be present in its training data). We used OpenAI’s native Python wrapper to execute our API calls.2.1.2 Case selection
text-davinci-003
model is 4,097, we excluded all cases whose facts sections exceeded this number of tokens (using OpenAI’s Tokenizer tool to calculate the number of tokens) (OpenAI 2023c). This resulted in the exclusion of 50 cases. We also excluded any administrative SEC cases (\(n=9\)).2.1.3 Prompt design
“The following text is from the \“factual allegations\” section of a complaint filed in the [jurisdiction; in the sample case, Eastern District of New York]. Based on the facts in this text, please identify which federal civil law(s) and section thereof the defendant(s) violated. Please use the following method of legal reasoning to come up with the allegations: Issue, Rule (including the specific statute and section thereof), Application, Conclusion: [text from factual allegations section]”
2.1.4 Input pre-processing and cleaning
2.1.5 Parameters and execution
max_tokens
parameter specifies the maximum length of text to be generated: we set this to the total number of tokens this model accepts (4,097) minus the number of tokens provided in the prompt for each case, thereby allowing the maximum possible size for each output. We also recorded the number of tokens used for completion in each output for each of our cases; this can be found in “Appendix 4”.2.1.6 Evaluation
-
Rule 1: In cases where violations are almost always alleged together (for example, Section 10(b) of the Exchange Act and Rule 10b-5 thereunder), we counted this as a single violation and scored accordingly. However, GPT-3.5 only identified, for example, a Rule 10b-5 violation, but not a violation of Section 10(b), we would score this as 0.5.8
-
Rule 2: We also awarded 0.5 points if the output included the correct law, but failed to include the specific section the defendants violated. So, for example, if the output suggested violations of the Securities Act, but did not specify that the allegations were of Sections 5(a) and 5(c) thereof, 0.5 points would be awarded. However, if the output merely stated that violations of “federal securities laws” occurred, it was given a score of 0.
-
Rule 3: In cases where the complaint charged Sections 5(a) and 5(c) of the Securities Act and the output only included Section 5 thereof (overall), we considered this as a true positive (because Section 5 overall would include Sections 5(a) and 5(c)).
-
Rule 4: For the purpose of calculating true positives and false negatives, where the complaint charged different counts of the exact same allegations in the complaint (usually just for different defendants), we counted them as a single violation. That being said, there was one case (Securities and Exchange Commission v. Arbitrade Ltd., et al., 1:22-cv-23171, S.D. Fla.), where one claim was for Rule 10b-5 under the Exchange Act and another for 10b-5(c) (against a different defendant). However, since one of the charges did include all of Rule 10b-5, we counted these both as a single violation.
-
Rule 5: We did not infer any charges if the output failed to reference the appropriate statute (i.e., if it included only “unregistered securities”, we did not assume it meant a violation of Section 5 or Section 12(a)(1) of the Securities Act).
-
Rule 6: Finally, the prompt specifically requested violations of federal laws. Some outputs included state law violations. Because this was contrary to the prompt’s instructions, these were automatically considered as false positives for the sake of scoring.
2.2 Differences in juror decision-making based on human vs. AI-written legal complaints
2.2.1 Case selection
Dismissed | Continued |
---|---|
Lee, et al. v. Binance, et al., case number 1:20-cv-02803, in the U.S. District Court for the Southern District of New York | Hong, et al. v. Block.One, et al., case number 1:20-cv-03829, in the U.S. District Court for the Southern District of New York |
Underwood, et al. v. Coinbase Global Inc., case number 1:21-cv-08353, in the U.S. District Court for the Southern District of New York | Balestra v. Cloud With Me Ltd., case number 2:18-cv-00804, in the U.S. District Court for the Western District of Pennsylvania |
Brola v. Nano, et al., case number 1:18-cv-02049, in U.S. District Court for the Eastern District of New York | Audet, et al. v. Garza, et al., case number 3:16-cv-00940, in the U.S. District Court of Connecticut |
Ha v. Overstock.com, et al., case number 2:19-cv-00709, in the U.S. District Court for the District of Utah | Klingberg v. MGT Capital Investments Inc., et al., case number 2:18-cv-14380, in the U.S. District Court for the District of New Jersey |
Davy v. Paragon Coin Inc., et al., case number 3:18-cv-00671, in U.S. District Court for the Northern District of California |
2.2.2 ChatGPT-drafted complaints
gpt-3.5-turbo-0301
model. For full details of the original output error and associated prompt design considerations, see “Appendix 3.2”.-
Prompt 1: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the caption of a class action complaint for the [insert venue].
-
Prompt 2: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the preliminary statement of a class action complaint for the [insert venue].
-
Prompt 3: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the jurisdiction section of a class action complaint for the [insert venue].
-
Prompt 4: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the parties section of a class action complaint for the [insert venue].
-
Prompt 5: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the factual allegations section of a class action complaint for the [insert venue].
-
Prompt 6: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the class allegations section of a class action complaint for the [insert venue].
-
Prompt 7: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the legal claims for relief section of a class action complaint for the [insert venue].
-
Prompt 8: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the prayer for relief section of a class action complaint for the [insert venue].
-
Prompt 9: The following article contains facts summarizing a class action complaint filed in the [insert venue]. [Text from the Law360 article about this complaint being filed]. For educational purposes only, based on the facts summarized and provided above, please draft the jury demand section of a class action complaint for the [insert venue].
2.2.3 Lawyer-drafted complaints
2.2.4 Mock juror decision-making
2.2.5 Selecting overlapping charges
2.2.6 Jury instructions
Securities act | Exchange act violations |
---|---|
1. Assuming the facts alleged in the complaint you read are true, and noting the definitions of a security and sale thereof provided above, do you find that one or more of the defendants directly or indirectly sold securities to the plaintiff? | 1. Assuming the facts alleged in the complaint you read are true, and noting the definitions provided above, do you find that one or more defendants (a) used a device, scheme or artifice to defraud, (b) made an untrue statement of material fact or made a statement that was misleading because a material fact was omitted, OR (c) engaged in any act, practice, or course of business which operated as a fraud or deceit upon any person? |
2. Assuming the facts alleged in the complaint you read are true, and noting the definitions of “interstate commerce” and “instrument of transportation or communication” provided above, do you find that one or more defendants used an instrument of transportation or communication in interstate commerce in connection with the offer or sale of a security? | 2. Assuming the facts alleged in the complaint you read are true, do you find that one or more defendants engaged in fraudulent conduct “in connection with” the purchase or sale of a security? |
3. Assuming the facts alleged in the complaint you read are true, do you find that “the securities at issue weren’t registered” (Dalton 2013)? | 3. Assuming the facts alleged in the complaint you read are true, do you find that one or more of the defendants acted knowingly or with severe recklessness? |
4. Do you find that the plaintiff suffered financial damages? | 4. Assuming the facts alleged in the complaint you read are true, do you find that one or more of the defendants’ conduct involved interstate commerce, the use of the mails, or a national securities exchange? |
5. Did you answer “yes” to all of questions 1, 2, 3, and 4? | 5. Did you answer “yes” to all of questions 1, 2, 3, and 4? |
6. Thinking about your answers to questions 1–5, how confident are you, overall, that you have made the correct decision? Please rate your confidence on a scale of 1–5, with 5 being extremely confident and 1 being not confident at all | 6. Thinking about your answers to questions 1–5, how confident are you, overall, that you have made the correct decision? Please rate your confidence on a scale of 1–5, with 5 being extremely confident and 1 being not confident at all |
2.2.7 Survey
3 Results
3.1 GPT-3.5’s ability to discern violations of federal U.S. law
Metric | Mean | SD |
---|---|---|
Recall | 0.252 | 0.304 |
Precision | 0.658 | 0.459 |
Final score | 0.324 | 0.317 |
3.2 ChatGPT’s pleading drafting ability
3.2.1 Respondent agreement
Case | Number of participants | % Yes |
---|---|---|
Lee, et al. v. Binance, et al., GPT-drafted complaint | 4 | 75 |
Lee, et al. v. Binance, et al., lawyer-drafted complaint | 4 | 100 |
Underwood, et al. v. Coinbase, et al., GPT-drafted complaint | 6 | 83.3 |
Underwood, et al. v. Coinbase, et al., lawyer-drafted complaint | 5 | 80 |
Brola v. Nano, et al., GPT-drafted complaint | 4 | 100 |
Brola v. Nano, et al., lawyer-drafted complaint | 4 | 75 |
Ha v. Overstock.com, et al., GPT-drafted complaint | 4 | 75 |
Ha v. Overstock.com, et al., lawyer-drafted complaint | 5 | 100 |
Hong, et al. v. Block.One, et al., GPT-drafted complaint, Securities Act charge | 4 | 100 |
Hong, et al. v. Block.One, et al., lawyer-drafted complaint, Securities Act Charge | 4 | 100 |
Hong, et al. v. Block.One, et al., GPT-drafted complaint, Exchange Act charge | 6 | 83.3 |
Hong, et al. v. Block.One, et al., lawyer-drafted complaint, Exchange Act charge | 4 | 50 |
Balestra v. Cloud With Me Ltd., GPT-drafted complaint | 4 | 100 |
Balestra v. Cloud With Me Ltd., lawyer-drafted complaint | 4 | 100 |
Audet, et al. v. Garza, et al., GPT-drafted complaint | 5 | 50 |
Audet, et al. v. Garza, et al., lawyer-drafted complaint | 4 | 75 |
Klingberg v. MGT Capital Investments Inc., et al., GPT-drafted complaint | 4 | 50 |
Klingberg v. MGT Capital Investments Inc., et al., lawyer-drafted complaint | 4 | 100 |
Davy v. Paragon Coin Inc., et al., GPT-drafted complaint | 5 | 80 |
Davy v. Paragon Coin Inc., et al., lawyer-drafted complaint | 4 | 100 |
3.2.2 Juror confidence
3.2.3 Association between author and juror decision
ChatGPT | Lawyer | Total | |
---|---|---|---|
Proven | |||
No | 9 (19.6%) | 5 (11.9%) | 14 (15.9%) |
Yes | 37 (80.4%) | 37 (88.1%) | 74 (84.1%) |
Total | 46 | 42 | 88 |