Friday, March 9, 2012

A Quick Download Assist

We bought the "I Love Lucy" complete episode collection.  Included with the TV episodes are some radio broadcasts of "My Favorite Husband" where Lucille Ball has a very similar role to the Lucy character that she later played (the link in the script will explain if you are interested).  The broadcast are now in public domain. My wife and I are fans of radio shows in general and she asked that I find these.  From a Google search and  also via Wikipedia I found a few collections of these shows @ Archive.Org.  The problem was that I could down load one massive MP3 or manually download 109 files.  I tried a quick and dirty. Click Click Click...  But only had about 30 or so made it down due to the simultaneous download limits in IE.

So I pull up my PowerShell 3 console and started hacking:

$clnt = new-object System.Net.WebClient
$URL = "http://www.archive.org/details/MyFavoriteHusband_866"
$path = "$home\Downloads\MFH\"
$t = $clnt.Downloadstring($url)
$regex = [regex] '"([^"]*[.]mp3)">'
$fileList = $regex.Matches($t)
$URI = [system.URI] $URL
Clear
$i = 0
$fileList | % { $_.Groups[1].Value } | %{
      $URI2 = new-object "System.URI" -argument $URI,$_
      $DownLoadName = $Path + $URI2.Segments[-1]; $DownLoadName
      $i++;$i
     $clnt.DownloadFile($URI2.AbsoluteUri,$DownLoadName)
}
The website returns download links relative to the domain as /Download/....  fortunately the URI class is built for this.  If you New-Object a URI and pass it an old URI and a relative URL, then the returned URI is a fully qualified version of the absolute URL.  URI also has a Segments property that splits the URL by it's parts so I can directly access the Download name.

The site contained 2 links for every MP3, so I needed to make the Regex pick up only one of them.  This meant that I had to grab extra data.  This made my For-Each a bit more complicated. as Each Match Looked something like (The quotes were in the result):
"/download/MyFavoriteHusband_866/Mfh1951-03-24124IrisLizsEaster.mp3">

Had I my trusty cheat sheet, I would have done the Regex as:
$regex = [regex] '(?<=")([^"]*[.]mp3)(?=">)'
The changes are "Zero-width positive lookbehind assertion" and "Zero-width positive lookahead assertion"s In short they say look for, but don't return as part of the match. http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.85).aspx

Then I could have done my For-Each as

$fileList | %{          
            $URI2 = new-object "System.URI" -argument $URI,$_.Value
It is amazing,but not surprising, how much a wrong Regex may work, but make everything more complicated that follows.


To make sure I could "See it work" I added in an $i to index and the name of the file coming down.  Yes, I should have made this a progress bar.  Maybe later!

- Josh
-----------
Update: Fixed Typos
Update2: Added note on the Regex Change

No comments:

Post a Comment