The process of legal reasoning and decision making is heavily reliant on information stored in text. Tasks like due diligence, contract review, and legal discovery, that are traditionally time-consuming, can be automated, saving a huge amount of time. This makes the development of approaches that leverage natural language processing (NLP) and machine learning (ML) for the analysis of legal documents a very important topic for both business and government.
Bardess, in collaboration with a research group from the Government of Ontario, Canada, has developed a multi-stage approach to the analysis of legislative texts aimed at extracting meaningful information for the public and creating links with other legal and non-legal documents. The focus of the analysis is The Accessibility for Ontarian with Disabilities Act (AODA) and its Regulation, though the approach can be easily generalized to other texts. AODA is a law that was passed in 2005 and that provides a process for developing and enforcing accessibility standards in Ontario. Its goal is to ensure that all organizations develop policies to prevent and remove barriers for people with disabilities. Obvious examples are physical and architectural requirements, but burdens can also be related to documentation, reporting, and training.
As a result, AODA defines a series of burdens for both public and private entities, where burdens are obligations and requirements that organizations have to comply with. In this context, the goals of this analysis are:
- Automate the process of extracting knowledge from legislative texts, in the context of AODA the responsibilities set out by the law.
- Understand who are the entities that are affected by the legislation.
- Provide a framework for efficiently representing the burdens extracted and facilitate searching for relevant information.
The analysis of legal texts presents a few unique challenges:
- Language parsing and tokenization are made harder by the use of formatting, abbreviations, and references that are specific to legal documents.
- The lexicon is relatively limited and very specialized, but the interpretation is highly sensitive to the context and there are no industry-specific pre-trained models that incorporate semantic analysis.
- Information extraction is further complicated by the syntactic complexity of sentences, which is often non-linear.
- Context sensitivity also effects supervised learning, where training sets coming from a specific context don’t generalize well to others, e.g. training on legislation from England in order to analyze the legislation of the United States.
All of these challenges make it almost impossible to rely purely on existing open source models. However, a combination of ontologies, rule-based systems, existing NLP tools, and unsupervised learning can achieve important results.
In absence of labeled data, the first stage of the analysis, where the legislation is parsed and the burdens are extracted, relies on a lightweight ontology and some business rules to extract potential burdens from the text. The second stage, which aims to extract substrings that identify the entities responsible for complying with the regulation, employs a combination of NLP pre-trained models available in Spacy and graph theory. Finally, unsupervised methods like clustering and topic modeling can be used to partition the burdens into homogeneous groups that help to clarify the effect of AODA on different industries.
Editor’s note: Want to learn more about NLP in-person? Attend ODSC East 2020 this April 13-17 in Boston and learn from the experts who define the field!