With the recent explosion in interest in language modelling brought on by the release of ChatGPT, there are now thousands of options to choose from when it comes to training Large Language Models.
I was driven to build my database after discovering that there were no open datasets of Australian law I could train an LLM on. The truth is, ever since I started working on the database, creating an LLM has been the last thing on my mind. I have been much more concerned with negotiating with government stakeholders to construct a comprehensive, high-quality dataset for Australian legal AI enthusiasts to build upon, a task I am still occupied with today.
There are hundreds of tutorials for training GPT2 out there, and I can guarantee that almost every single one will employ a different method. There were so many options that I entered a state of decision paralysis, resolving at many points to simply give up.
Until I landed on this Hugging Face tutorial for causal language modelling with DistilGPT2. Although poorly designed and downright incorrect at times, with some key modifications, I was able to adapt the tutorial for my use case, producing a training dataset of 110 GB.
With my model finetuned and a loss of 0.61 achieved, I published it on Hugging Face. Although it may not be as large as I had originally hoped, I’m still quite proud of it.
Source: Umar Butler
Leave a Reply