bishoppebbles

← Previous Post Next Post→

A colleague came to me with a request. He wanted to save a local copy of a technical manual from a website. Seemed simple enough but there was a catch... it used long, obfuscated URL document names.

If you're not familiar with the components of a web-specific Uniform Resource Identifier (URI), aka the Uniform Resource Locator (URL), here are the basic parts. The path and document are most applicable for this post.

https://www.example.com/the/path/index.html
   ^            ^          ^          ^
protocol  domain/host    path     document

For reference here's a link to my complete PowerShell code. I'll be referencing it throughout.

My colleague said he can usually download a website's content locally when the site had a well defined URL path (directory) structure. However, this site did no such thing. It used extremely long document names of seemingly random hex values. This might not actually be random but after spending a few minutes with CyberChef I didn't decipher anything meaningful.

URL Document Pattern

Further investigation showed that all the URL document names for this specific manual used the same 170 character prefix. Then for each new parent-to-child link it would append additional hex characters to that prefix. Here's an example:

Parent document name and the base path prefix

(170 chars total)

80a2354fa0510d9ecf99fa293f4cdf6a795d6d75325f33bfdd835fbe500ae33d51e2c05cb206ec1287eee8e95c99be81d6a974998a1802ec4bf70e5aeb85086bf684d3846726fb144ac5c048a1516760f23ceb194c

Child document name of the parent

(186 chars total, 16 appended chars)

80a2354fa0510d9ecf99fa293f4cdf6a795d6d75325f33bfdd835fbe500ae33d51e2c05cb206ec1287eee8e95c99be81d6a974998a1802ec4bf70e5aeb85086bf684d3846726fb144ac5c048a1516760f23ceb194c3e2d56b2c926e7c9

Child of the child document name

(202 chars total, 16 appended chars)

80a2354fa0510d9ecf99fa293f4cdf6a795d6d75325f33bfdd835fbe500ae33d51e2c05cb206ec1287eee8e95c99be81d6a974998a1802ec4bf70e5aeb85086bf684d3846726fb144ac5c048a1516760f23ceb194c3e2d56b2c926e7c92799851c95f82c33

Child of the child of the child document name

(276 chars total, 74 appended chars)

80a2354fa0510d9ecf99fa293f4cdf6a795d6d75325f33bfdd835fbe500ae33d51e2c05cb206ec1287eee8e95c99be81d6a974998a1802ec4bf70e5aeb85086bf684d3846726fb144ac5c048a1516760f23ceb194c3e2d56b2c926e7c92799851c95f82c336d02fe0c373f4350a3d3e91909c26eb20262906b489d3db1c260d57a59cbdc57bd99e3c3b7

Without this structure I'm not sure how I could have solved this problem. I spent some time looking at the documentation for curl, wget, and Python's Beautiful Soup library to check if an option existed for this type of scenario. I admittedly didn't look that hard so maybe a simpler solution already exists but it wasn't readily apparent. Plus I think this problem seemed interesting to solve on my own.

Methodology

It might not be the most performant but PowerShell is readily available in Windows environments and I'm comfortable with it so I went that route. My basic algorithm performed the following:

Uses a seed URL (i.e., the main page of the technical doc), downloads it, and then extracts any embedded links that contain the 170 hex prefix. Those links are saved to a hashtable for tracking.
After that first iteration any unvisited links in the hashtable are downloaded. Those are then searched for embedded links with the matching prefix. If found they are added to the queue.
The process repeats until no new embedded links of interest are discovered.

First Draft

When I originally wrote this script I created a list of unique URLs containing this prefix. I then used wget to download them. I'm guessing the website author did this on purpose but all the links with actual content of interest exceeded the standard 255 character filename limit used on Linux and Windows. Doing that truncated the file name and also broke the links in the local copy. My way around this was to use wget's -O option and specify the output file name. The problem with is that it appended all the site documents into a single 33MB file. This worked but it wasn't pretty or ordered properly. It's also slow to open in a browser and links within the page don't work.

wget -i links.txt -O consolidated_output.html

I know I could have scripted this in bash to give each download a unique filename. However, my bash scripting is infrequent and every time I write a bash loop I have to google it. Even then, all the links would still be broken. I moved back to PowerShell so I could contain the project in a single script. That would also give me an efficiency benefit where I could download, save, and rename each link once. My current method was hugely inefficient as I downloaded each link twice. I'm not dealing with a ton of data here but with 125 unique links it wasn't fast either.

Linking It Together

I could have stopped with my progress here. I met my colleagues requirement of downloading all the tech doc data. All the necessary data was saved in 125 different HTML files. The downloaded version of this document was never going to be as pretty as the online version because it lacked all the interactivity, CSS, backend functionality, and whatever else was used that I don't know about. However, the links were broken and fixing this would improve the document's navigation. Absent that, you'd have to click through all 125 HTML files to find the one of interest. I knew I could patch this too.

Having all the data, I created another hashtable. I used the URL path and document name (i.e., the "random" hex values) as the key, just as I did before. However, this time I also created a short, unique name for that given download path when it was saved locally. This hashtable gave me a mapping between the two I could use after everything was acquired. Once that was complete, I then used PowerShell's grep-like and regular expression functionality (i.e., to search each document for application links and replace them with their new file names. This process negated the technique used to break links because of the long document names.

Dev Hiccups

In this section I'll review some development roadblocks I encountered. This is nuianced PowerShell stuff in my opinion so only the interested should proceed.

The go-to PowerShell cmdlet for web queries is Invoke-WebRequest. Using PowerShell v5 I ran into a snag where the cmdlet would work once but then hang on all subsequent queries. I don't know the exact reason for this but it turns out that before PowerShell v7 this cmdlet defaulted to using Internet Explorer as its parsing engine. This can cause problems and/or inconsistencies and in my case it did. Using the -UseBasicParsing option directly uses .NET functionality and all was well.

It turns out if you're enumerating a PowerShell hashtable (i.e., .NET collection) you can only read, not write to it. The Microsoft documentation spells it out clearly:

Enumerators can be used to read the data in the collection, but they cannot be used to modify the underlying collection. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to MoveNext or Reset throws an InvalidOperationException.

This caused issues as I continually looped through my hashtable of URL keys to determine if the URL was previously downloaded. This was based on the key's value of true/false. If false, that meant that link had yet to be downloaded, so it would be downloaded, marked as true, and any newly discovered links were added as new keys to the hashtable. Except, that obviously didn't work based on the "enumerator is irrecoverably invalidated" thing.

I instead took the approach of enumerating my main hashtable while saving all the results to a second hashtable. After that round of search was complete, I would then mark all existing keys in my main hashtable to true and then copy the newly discovered URL data over from the temporary hashtable.

Admittedly, I struggled for a while replacing all the long hex based links with the new, short file name links. The process was simple: search each HTML file for applicable links, look up the mapping I saved, and replace. My issues lied with various PowerShell-isms I wasn't accustomed too. A lot of this boiled down to my (lack of) understanding with the -match and -replace operators as well as the $Matches automatic variable. One point being I didn't know that -match is actually a different operator if it's being used against a single string versus a collection. The latter doesn't use the $Matches variable which baffled me for a while. I found a Stack Overflow answer from Roman Kuzmin that explained the issue nicely:

Strictly speaking string -match ... and collection -match ... are two different operators. The first gets a Boolean value and fills $Matches. The second gets each collection item that matches a pattern and apparently does not fill $Matches.

I also tried for what I thought was a more "elegant" solution. In hindsight this wasn't actually so. It would have been less readable, and most importantly, I never got it to work. I'm not completely satisfied with my solution as it still has inefficiencies but I had to finish this extracurricular project. After downloading all the files I then searched each file line-by-line. If a given line had a matching link it was rewritten with it's new, shorter link. The whole file was completely rewritten in a temporary String Builder buffer and once the scan was complete it was completely rewritten to disk. I would have preferred to replace only the lines of interest in the original file but that proved to be something I didn't solve. This was due to there being an unknown number of links (usually more than one) per file and how PowerShell handles that when using the -match and -replace operators. I finally made the decision to work with each file on a line-by-line basis. Though my solution wasn't exactly how I initially envisioned it I was happy with the result. A single script to find, download, rename, and then update all embedded links in one shot.