Extracting HTML Content with MLE JavaScript
In the past, when I've seen people build out processes to extract a piece of data from a HTML document, it has usually involved crafting some RegEx statements or building a program that runs external to the database. With MLE JS available in 23c, I wanted to explore the possibility of using JavaScript in the database to use the same functions we may have used in the past - namely, document.querySelectorAll.
Choosing a Module
Unlike in the browser, in server side JavaScript, we don't have access to these document constructs, so we need to choose a module to facilitate this. Previously I have used the jsdom module, but unfortunately this project doesn't support ESM style. With a bit of searching, I came across another library, linkedom, which I will be using for this example.
A lot of other examples on the web for using third party modules to enhance your applications involve those that are self contained and don't have any dependencies. Linkedom is a little more complex with a bit of a dependency chain. To make things more interesting, the import statements are in this syntax, "/npm/htmlparser2@9.1.0/+esm".
To handle these two issues, I decided to write a little program to produce some scripts that performs the following actions:
- Get the dependency chain
- Download the modules
- Substitute the references with the simple name of the module
- Produce SQL scripts to load the modules into the database
- Create an MLE environment with import references for all the modules
I published this project here, https://github.com/tschf/mle-module-loader. Caveat is that this is using an npm package that is no longer maintained, but it served its purpose for this little excercise. With this program, a sample run looks like this:
This tells us that the full dependency chain for linkedom is the following additional modules we'll need to compile into the database.
- cssom
- html-escaper
- uhyphen
- css-select
- htmlparser
- domandler
- css-what
- boolbase
- nth-check
- domutils
- domelementtype
- entitles
- dom-serializer
If using my program, after it downloads and generates the SQL scripts, the file structure will resemble this (note: I didn't include the full file-set in this screenshot, the real output does include the full module-list I specified above).
And after running install.sql in SQLcl, you should see those objects compiled into the database (below image is a screenshot from APEX Object Browser).
Conceptualizing the Extraction
As a first step, you probably want to play around in your browser with the site you want to extract data from. For demonstration purposes, I picked the Victorian Crime Statistics website, which produces some XLSX spreadsheets once a year of the latest crime data, and provides the links. To keep thing simple for the demo, the objective will be to pull that data (link text and URL) and put it into a table.
The page I'm referring to is here: https://www.crimestatistics.vic.gov.au/crime-statistics/latest-victorian-crime-data/download-data.
A quick inspection of the page, I can identify each like is contained within a span, the span has a class of "file" and the nested element is the anchor tag with the link. With that in mind, I can come up with the selector span.file a. This gives me the following code:
Indeed, this produces the expected outcome:
Figuring Out Linkedom
If you consult the documentation for Linkedom - by looking at the npm page, https://www.npmjs.com/package/linkedom, they provide some basic usage examples. The basic gist for our use-case is that we need to grab the parseHTML function from the module. When we call that function, we pass in the HTML document, and one of the properties that returns is the document, which is what we need to do our interactions with the page (via that querySelectorAll function that was referenced earlier on).
When we define our database objects to leverage this module, we need it to be on an object that we can specify the environment (linkedom_env) on. This can be either a PL/SQL block using the DBMS_MLE package (where we specify the environment on the context), or we can also specify it on our call signature. That is to say, it doesn't appear as though you can specify it on an inlined MLE call specification, according to this documentation.
Putting It All Together
Now that we have all the necessary bits, we can start to put it all together. Firstly, we need a table to store the data in. I'll create a table CRIME_STAT_LINK.
First, we need to use the mle-js-fetch built-in module in order to pull down the document (which provides the fetch function). This particular site rejects the request if a User-Agent header is not provided, so we also need to send that with our request. The documentation for the fetch module can be found here: https://oracle-samples.github.io/mle-modules/docs/mle-js-fetch/23c/index.html
Next, we load parseHTML function from the linkedom module and transition our demo code from earlier.
The final part is to set up the script to load the data into our table. For that there is a session global that refers to the SQL Driver, and we can use that to execute statements - it contains an execute function where we can pass in our SQL. So we update our previous loop to not log the data from our links, but insert into our table.
Just to confirm it worked, we go over to query that table (and yep, all looks good).
The final, full working DDL for module and call specification looks like the following code snippet (execution is via the procedure load_victoria_crime_links):
If you'd like to try out some other exercises of using third-party modules that don't have a dependency chain like this example, I'd recommend doing the APEX + Server-Side JavaScript (MLE) Live Lab.
Credits
Opening Splash Photo by Edu Grande on Unsplash