PM Martijn

Creating your own machine-readable dataset

The documents that you want to analyze are often not available as a readily available machine-readable dataset, but as a collection of human-readable documents (such as PDF, doc/docx, or odt) or web-content (html or possibly other formats). In this case, you will probably need to first collect the documents and then transform them in a machine-readable dataset before you can effectively use NLP tools for your research.

Collecting documents

There are generally two approaches to collect the document that you want to put in your dataset: requesting them or scrape website(s). The first option, to request the documents from the administrator, is generally the most preferred method (for you and the administrator). If there exists an API for the information you need, you can request these using that API. If there is no API or data dump available, reach out to the administrator to request the documents you are looking for. Depending on the administrator, there may be legislation which gives you an enforceable right to access this information. For specifically the Dutch setting, we have general guidelines for finding an appropriate dataset and leveraging relevant freedom of information and open data legislation.

When it is for whatever reason not possible to request the documents from the administrator and the documents are available on a public website, you can also scrape this website to collect the documents. This is generally not the preferred method, as you can never be completely sure that the scraped dataset is actually complete and consistent. Furthermore, the administrator of the website in question might not be happy about a bunch of automated requests to their website. There can also be technical constraints (it may take a long time to scrape all required documents), and legal constraints. Scraping your data generally requires tailor-made software for your specific case. However, software libraries exist to make things easier (such as the popular Beautiful Soup 4 for Python). In the data-collection repository of WetSuite you can find some examples of crawlers for Dutch legal documents.

Transforming a collection of documents into a machine-readable dataset

Once you have collected your documents, you will probably need to perform further processing before you can effectively apply NLP techniques. Just a bunch of documents in formats such as PDF, Word (.doc, .docx), downloaded web-content (html or possibly other formats), or other document formats (.odt) is hard to automatically analyze using software, because such document types are focused for use by humans.

Before you can perform any automatic analysis, you will need to convert your collection into a machine-readable format. A wide variety of machine-readable formats exist, but for now three are the most relevant: plaintext, JSON and XML. Plaintext files (.txt) are unstructured files with only text (and no formatting like in Word). JSON and XML are both structured file formats which allow for the easy inclusion of metadata and – for XML – structure of your documents. JSON and XML should be the preferred option if you want to enrich your collected documents with metadata. But how to convert your PDF, docx, odt, etc. documents to machine-readable files? For Python and most other programming languages many libraries exist to open such files and extract the text [TODO: link to relevant example notebook]. However, if your documents are images or scans of text, the PDF documents you have might not directly contain a text layer which can easily be extracted. In this case, you will need to perform Optical Character Recognition (OCR) to extract the text from your files. While good software libraries exist and this can yield good results, it is not straightforward. There can be quite some caveats for your specific collection.

Last updated: 21-Nov-2024