How I built the largest open database of Australian law

Late last year, while attempting to train a large language model (LLM) to solve legal problems, Umar Butler made a surprising discovery — there weren’t any open databases of Australian law to train my model on. 

While there were certainly a few free-to-access legal databases, none were truly open, at least not in the sense of being able to just download their data and start training models without fear of infringing on anyone’s copyright. They all had policies against web scraping, and they were all either unable or unwilling to license their content.

So, before Umar could start training an LLM on Australian legal data, I’d need to get my hands on that data first. As with most of my projects, this sounded much easier than it would actually turn out to be. Almost a year later, and he is still hard at work on expanding the database to encompass all of Australia’s legal code.

In this article, Umar walks us through the entire process of how he built the Open Australian Legal Corpus, the largest open database of Australian law, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.

Source: Umar Butler


Posted

in

, ,

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *