3.1 Human feedback
The primary challenge in deploying RLHF revolves around obtaining human feedback effectively. The human feedback in RLHF can take various types, including but not limited to: (a) preference ratings, (b) summarizations, (c) corrections, (d) demonstrations, and (e) specific reward signals. Typically, obtaining human feedback in most domains involves time-consuming manual data labeling. However, public patent data and prosecution history present a distinct advantage as they inherently include human feedback or intent. For instance, when it comes to (a) preference ratings, a granted patent claim is considered preferred, whereas a rejected claim is not. In terms of (b) summarizations, the abstract of a granted patent can be derived from the patent’s description or claims. Additionally, (c) corrections can be identified through revised patent claims, which may address issues such as antecedent-basis error encountered during patent prosecution. Regarding (d) demonstrations, existing dependent claims can serve as examples to illustrate how independent claims derive dependent claims.
When it comes to patent drafting and human intent, it is desirable to incorporate the drafting intent in controlling patent text generation. For instance, considering (e) specific reward signals, if the goal is to achieve a broader patent scope, generating shorter patent claims with fewer limitations can represent a higher reward signal. Conversely, when the drafting intent is to avoid anticipation of prior arts and make it easier to be granted (at the cost of potentially lower patent value), longer patent claims might be favored and represent a higher reward signal. In summary, the human feedbacks mentioned earlier in types (a)
\(\sim\)(d) can be derived from patent data and prosecution history. Additionally, in the case of type (e), it is desirable for generative language models to be controllable and capable of reflecting the intent behind patent drafting. Owing to limitations in resources and the availability of public data, this research concentrates solely on implementing human feedback in types (a) and (e). More details are provided in Sect.
3.3.
3.3 Dataset
This research relies on two primary sources of raw data provided by the USPTO: AIPD (USPTO
2022) and PatentsView (
http://www.patentsview.org/). AIPD offers information and categorization of AI patents, while PatentsView provides details pertaining to patent documents, such as patent claims. According to, there exists a data file in the AIPD that identifies U.S. patents issued between 1976 and 2020 and pre-grant publications that contain one or more of eight AI technology components. These AI components are defined as: machine learning, evolutionary computation, natural language processing, speech, computer vision, knowledge processing, planning and control, and AI hardware. The authors in Giczy et al. (
2021) generated this data file using a machine learning approach that analyzed patent text and citations to identify AI components in U.S. patent documents. This research follows the naming convention of the AI components in the data file as: ML, EVO, NLP, SPEECH, VISION, KR, PLANNING, and HARDWARE. To conduct the experiments in this research, eight training datasets are created, each corresponding to one of these eight AI components. In the AIPD data file, a document id can take one of two forms: (1) a patent number for granted patents, or (2) a publication number if the document is a published patent application (pre-grant). Additionally, the data file contains an application id, which represents the application number of a patent application. While the AIPD data file is helpful for identifying the AI categories of a patent document, it does not include the actual textual content of the document, such as the title, abstract, description, and claims.
Researchers have two sources of data for accessing the textual content of patent documents: the PatentsView platform and the Bulk Data Storage System (BDSS). One key distinction between PatentsView and BDSS is their data structure: BDSS is document-centric, while PatentsView is database-centric. In BDSS, a single file in XML format contains all the textual data and metadata for a given patent. On the other hand, the textual data for a patent is spread across multiple database table files at PatentsView. These individual table files can be imported and combined to create a comprehensive database. To achieve quicker iterations and facilitate model training, this research concentrates on the textual data of claim one. Consequently, the PatentsView platform is a better option for accessing the patent claims and integrating them with the AIPD data. PatentsView provides downloadable table files for both granted patents and pre-grant applications respectively and on a yearly basis. The relation between granted patents and pre-grant applications can be identified by another table file called pg_granted_pgpubs_crosswalk mapping patent application numbers to their corresponding granted patent numbers.
It is worth mentioning that a database dump from PatentsView is accessible upon request, which can simplify the process of creating a database. However, upon inspection, the patent claim text is not included in the database dump due to the substantial volume of patent claims. The version of the inspected database dump is dated as of March 30, 2023. It would be beneficial if a newer version of database dump released by the USPTO could include patent claims in the future. By integrating the AIPD data file having eight AI technology components and the database tables from PatentsView (
granted patents,
pre-grant applications, and
crosswalk), the datasets needed in this research are created encompassing all patents in AIPD along with their corresponding text of patent claim one. The training datasets are given the prefix AIPCO. Since AIPD comprises eight AI technology components, eight individual datasets are constructed for the experiments in this research, each representing a specific component along with its corresponding text of patent claim one. These datasets are named as follows: AIPCO-ML, AIPCO-EVO, AIPCO-NLP, AIPCO-SPEECH, AIPCO-VISION, AIPCO-KR, AIPCO-PLANNING, and AIPCO-HARDWARE. In the context of the methodology described in Sect.
3.2, during the SFT stage, these eight datasets are used for fine-tuning. Subsequently, in the RM stage, the eight datasets are used for training reward models too. The eight SFT models and reward models are then used in the PPO stage of experiment in Sect.
4.3.
Table
1 presents the statistics for the eight datasets. The first column displays the names of the databases, followed by the total number of rows in the second column. The third column represents the total count of
granted patents, and the fourth column is the average length of those patents. The fifth column represents the total count of
pre-grant applications, and the sixth column is the average length of those applications. It is worth mentioning that the total count of
pre-grant applications being less than the count of
granted patents is attributed to the
crosswalk data. For some granted patents in the data, such as reissued patents, the pre-grant application number is empty. For some other rows, the reasons for this emptiness are less evident. While a further investigation might be needed to understand these reasons, for the purposes of this research, the numerical difference in rows between
pre-grant and
granted is not a major concern for training models.
Table 1
Datasets of Patent Claim Ones
AIPCO-ML | 61,136 | 31,792 | 1359 | 29,344 | 864 |
AIPCO-EVO | 16,274 | 8476 | 1412 | 7798 | 863 |
AIPCO-NLP | 57,629 | 30,746 | 1438 | 26,883 | 828 |
AIPCO-SPEECH | 32,824 | 17,324 | 1336 | 15,500 | 805 |
AIPCO-VISION | 145,162 | 74,378 | 1280 | 70,784 | 838 |
AIPCO-KR | 297,289 | 156,648 | 1404 | 140,641 | 852 |
AIPCO-PLANNING | 317,442 | 168,297 | 1444 | 149,145 | 853 |
AIPCO-HARDWARE | 183,224 | 95,988 | 1372 | 87,236 | 819 |
It is noted that the Office Action Research Dataset for Patents (
https://www.uspto.gov/ip-policy/economic-research/research-datasets/office-action-research-dataset-patents) provided in Research Datasets has the potential to enhance this research in the future. The dataset marks the first time that comprehensive data on examiner-issued rejections are available to the research community. As previously stated, an office action is a written notice to the patent applicant of the patent examiner’s decision on patentability. Therefore, the notice generally discloses information, such as the grounds for a rejection, the claims affected, and the pertinent prior art. According to Lu et al. (
2017), the relative inaccessibility of office actions has prevented researchers from fully exploiting valuable information during patent prosecution. The authors in Lu et al. (
2017) aim to rectify the situation by using natural language processing and machine learning techniques to systematically extract information from office actions and construct a relational database of key data elements. The dataset covers 4.4 million office actions mailed during the 2008 to mid-2017 period from USPTO examiners to the applicants of 2.2 million unique patent applications.
From the perspective in Sect.
3.1, office actions encompass various types of human feedback. For instance, a rejection can be categorized as
(a) preference rating as described in Sect.
3.1. How the claims are affected and revised can be considered as
(c) corrections. Furthermore, the pertinent prior art can play a role in
(b) summarizations used for summarizing the basis in office actions. Ideally, the office actions serve as the most valuable source for obtaining the necessary human feedback for RLHF. However, upon close inspection, the dataset’s coverage does not align with the primary data source AIPD in this research. In addition, the rejections in the dataset lack the patent text essential for training language models. As a result, for training reward models in experiment of Sect.
4.3, pre-grant applications are chosen as the source for negative samples, and granted patents serve as positive samples. Further elaboration will be provided in that section. It is also noted that the Patent Examination Data System (
https://ped.uspto.gov/peds), another data source from the USPTO, has comparable limitations and does not fulfill the requirements of this research. If either the Office Action Research Dataset or the Patent Examination Data System could offer more structured and comprehensive data in the future, an enhanced quantity of human feedback during patent prosecution might be available for leveraging in RLHF within the patent domain. In summary, a total of eight AIPCO datasets, each containing the text of patent claim ones, are prepared for each AI component.
3.4 Library
The implementation in this research leverages several open source libraries in Python, particularly the TRL (Transformer Reinforcement Learning) library (von Werra
2020) and its examples. TRL is a full stack library providing a set of tools to train transformer language models with reinforcement learning. The library covers all three stages: SFT, RM, and PPO, as described in Sect.
3.2. The TRL library is built on top of the transformers library by Hugging Face. Therefore, pre-trained language models, such as PatentGPT-J-6B (Lee
2023) and DistilBERT (
https://huggingface.co/distilbert-base-uncased), can be directly loaded. Throughout the research, TRL version 0.4.2.dev0 was utilized, while the library remained under intensive development. Other options for implementing reinforcement learning in the language domain include: TRLX (Transformer Reinforcement Learning X) (von Werra
2020), TextRL (Text Generation with Reinforcement Learning) (Lam
2023), and RL4LMs (Ramamurthy
2022). Amidst the fast-paced development in applying RLHF techniques to language models, it is advised to observe which library stands out as the most promising in the future.
To enable the fine-tuning of language models on a single GPU with limited 16 G VRAM, this research utilizes the LoRA method as low-rank adaptation from the PEFT library (Sourab Mangrulkar and Sylvain Gugger
2022). The LoRA method proves to be effective in reducing the number of trainable parameters by using parameter-efficient fine-tuning techniques. As shown in Sourab Mangrulkar and Sylvain Gugger (
2022), using LoRA on consumer hardware yields performance comparable to that of full fine-tuning which demands high-end hardware. Furthermore, this research reduces the model size by loading a model in 8-bit precision through the bitsandbytes library (
https://github.com/TimDettmers/bitsandbytes). The library is a lightweight wrapper around CUDA (Compute Unified Device Architecture) custom functions, 8-bit optimizers, matrix multiplication, and quantization functions. CUDA is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements. It is a proprietary software layer developed by NVIDIA.
3.5 Training
Based on the methodology described in Sect.
3.2, the first stage (SFT) involves fine-tuning the pretrained language model PatentGPT-J-6B using the eight AIPCO datasets in Sect.
3.3. During fine-tuning, the PatentGPT-J-6B model is loaded in 8-bit precision and with the low-rank adapter in the LoRA method. The adapter adds pairs of rank-decomposition weight matrices to the 8-bit model, and only these newly added weights are fine-tuned. After fine-tuning, an additional step is taken to merge the adapter weights to the original model. This merging of weights results in the creation of the SFT model. Each of the eight AIPCO datasets is used to individually fine-tune the PatentGPT-J-6B model, resulting in the creation of eight domain-specific SFT models for the subsequent stage. Each model’s fine-tuning is completed in a single epoch. The perplexity values at the end of the fine-tuning for each dataset are shown in Table
2. Perplexity is a statistical measure of how confidently a language model predicts a text sample. The lower the perplexity value, the better the model can predict the next word or sequence of words in a given text. The following perplexity values are considered low, which suggests that the SFT models are effective in predictive capabilities.
Table 2
Perplexity of supervised fine-tuning (SFT)
Perplexity | 6.82 | 6.82 | 6.24 | 6.17 | 6.00 | 11.77 | 6.09 | 6.04 |
The second stage (RM) involves the training of a reward model using human feedback. This research explores two separate implementation approaches for this stage. The first approach pertains to
(a) preference ratings in Sect.
3.1. In this approach, the categorization of
granted and
pre-grant represents a form of human feedback. The implicit human feedback has been supervised and does not require further labeling efforts. The base model utilized for training in this stage is the distilbert-base model (
https://huggingface.co/distilbert-base-uncased). The downstream task for the base model is a binary classification task (
granted or
pre-grant). A reward of 1 is assigned to instances classified as
granted, while a reward of 0 is given to instances classified as
pre-grant. More information about how this reward model is experimented in subsequent reinforcement learning (PPO) can be found in Sect.
4.4.
Returning to the base model, the distilbert-base model is a distilled version of the BERT base model. It reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster, according to Sanh et al. (
2019). With a size of 66 M, it can avoid an out-of-memory (OOM) issue and becomes practical in this research to execute the reward model alongside the SFT model during the subsequent reinforcement learning stage on a consumer-grade GPU. The same eight AIPCO datasets used in the SFT stage are also employed in the RM stage. As a result, eight reward models are created after fine-tuning the distilbert-base model with the AIPCO datasets individually. The accuracy results of these eight reward models are presented in Table
3. Each reward model underwent training for one epoch. The AIPCO dataset was split into 90% for training, 5% for validation, and 5% for testing. For demonstrative purposes, the performance of these reward models is considered adequate for building a prototype of RLHF. Assuming GPU VRAM is not limiting, it is speculated that using the PatentGPT-J-6B model as the base model may yield improved performance. Nonetheless, confirming this assumption will require additional resources and further investigation in the future.
Table 3
Performance of reward models (RM)
Accuracy | 0.718 | 0.703 | 0.764 | 0.727 | 0.715 | 0.749 | 0.768 | 0.742 |
In the RM stage, the second approach of reward implementation focuses on
(e) specific reward signals as discussed in Sect.
3.1. This approach expands upon the concept of the reward model by substituting the model with different reward functions. The reward functions are designed to reflect the underlying intent behind patent claim drafting and determine the reward accordingly. In this research, three reward functions are implemented and tested. Generally, for patent practitioners, shorter patent claims are preferred as they offer a broader scope for the patent. In contrast, longer patent claims may increase the likelihood of obtaining a patent allowance. During patent prosecution, there are two main ways a patent examiner might reject a patent application over prior art. One is anticipation. The other is obviousness. It is generally easier for longer claims to be granted because they are more likely to avoid anticipation of prior arts and reduce the likelihood of the obviousness rejection. Nevertheless, as a patent claim becomes longer, its scope tends to be narrow. From this practical aspect, attaining the ability to control the length or scope of generated patent claims is highly desirable. To the author’s knowledge, prior to this research, no previous attempts have been made to control the length of patent text generation.
In Sect.
4, the first reward function depends on the length of patent claims, and it computes the reward value based on a designated maximum length. If the generated patent claim exceeds the maximum length, the reward will be set to zero. Within the specified maximal length, longer patent claims will receive higher rewards. This reward function guides the subsequent PPO algorithm to learn generating longer patent claims, while attempting to abide by the maximum length constraint. The main objective of the experiment in Sect.
4.1 is to assess the viability of controlling patent text generation using RLHF. Additional details concerning this specific reward function can be found in that section. The second reward function focuses on controllability based on patent scope. When drafting patents, the inclusion of limiting terms (e.g., “wherein”) serves to narrow the scope of a patent claim. A patent with more limiting terms and limiting clauses generally has a narrower scope. Consequently, this may increase the chances of the patent being allowed, while reducing the likelihood of patent infringement. In the experiment detailed in Sect.
4.2, the reward function is designed to count the occurrences of such limiting terms. A higher reward value is assigned when there are more limiting terms. Further information about this reward function will be provided in the same section. Regarding the third reward function, it combines the previous two to calculate a joint reward value based on both the length of patent claims and the count of limiting terms. The specifics of this joint reward function, along with the experimental details and results, can be found in Sect.
4.3.
Returning to the 3-stage methodology, the third stage involves training and using PPO as the reinforcement learning algorithm to optimize the SFT model from the first stage. This optimization is done against either a reward model obtained in the second stage or a reward function defined. It is noted that a reward model requires training and data, whereas a reward function does not. Reward models are neural-based and obtained through training, while reward functions are rule-based and defined directly in source code. Despite the differences, both types of rewards yield numeric values as rewards, which allows them to be mathematically combined in the training. This amalgamation of neural-based rewards and rule-based rewards is expected to have broad applicability, encompassing a wider range of use cases in the future.
For example, to be patentable, a patent claim must fulfill at least three essential requirements: novelty, non-obviousness, and utility. Theoretically, training three reward models, each corresponding to a specific requirement, along with defining a reward function to assess the patent claim’s length, would make it possible to train a policy model that can generate patentable patent claims within a predefined length. In this research, a reward model is trained in Sect.
4.4. The joint reward for Sect.
4.3 is composed of the reward functions in Sects.
4.1 and
4.2. Technically, it is also feasible to have a joint reward from the reward model in Sect.
4.4 and the reward function in Sects.
4.1,
4.2, or Sect.
4.3. Nonetheless, before proceeding with any further joint reward, it is crucial to conduct separate validation on the quality of patent text generation using these reward models or functions. Future research needs to delve into exploring the combination of the aforementioned reward model and reward functions.