Abstract:
The introduction of large language models (LLMs) has led to global attention being focused on a few high-stakes ethical issues - not least among them would be that they amplify and spread social prejudices inevitably embedded in their training data. The point of this work is to frontally attack the crying need for effective methods that use practical means to reduce biased output and so make it more fair and safe with AI based generative systems. In this paper we present a fine-tuning approach with Reinforcement Learning from Human Feedback (RLHF) for debiasing a pre-trained causal language model. Our approach in training the base model with supervised fine-tuning objective on custom data (then applying multi-step RL step) We introduced the bias of the final model via a reward signal that penalizes the bias, to generate bias-free language. The base model (supervised) and final (RLHF-tuned) models were extensively tested using a classification method on a held-out test system. In general the final model is much better at generating neutral text than the base models. The classification report for the last model revealed a significant increase on precision and recall of "Unbiased" and significant loss of stats of "Biased". Such results help to confirm that our RLHF-based finetuning can effectively mitigate harmful bias in practice, and indicate a scalable and sturdy method for creating fair as well as responsible AI. This work contributes to that literature which seeks now to produce robust and indeed ethical generative models available for large-scale use by the public.