SharePoint File Extractor

Mike BerrymanDo not try and find SharePoint files, that’s impossible.  Instead, only try to realize the truth… there are no SharePoint files.

In SharePoint, most (not all) files don’t actually exist in the conventional way.  What I mean by this is that you can’t just browse to a directory and open a SharePoint page in your favorite text editor and make changes.  Instead, most files in SharePoint exist as BLOB data in a SQL database and the only way to actually edit these files is through SharePoint Designer.

So what if you want to have any of these files in source control?  Of course you could use SharePoint Designer to copy/paste the contents of each target file to your file system, but that is extremely cumbersome (you can’t just copy/paste a file itself, you need to actually open it and copy the contents).  As our clients have become more and more proficient with using SharePoint Designer to make their own changes we have had an increasing need to actually get these changes in source control.

In order to do that, I recently created a program that would extract files from SharePoint.

To do this, I started with what I knew about SharePoint Designer:

  1. SharePoint Designer interfaces with a SharePoint site via web service calls
  2. These web service calls obviously can return a listing of files for when the user is browsing a directory in SharePoint Designer, as well as the contents of a file once its opened
  3. SharePoint Designer was built from FrontPage

Because SharePoint Designer uses web services to interact with SharePoint, I could use an HTTP packet sniffer (fiddler is my tool of choice here) to see what the requests and responses looked like when getting a list of files or getting the contents of a file.

Knowing this, what I wanted to accomplish was to replicate the web service calls that SharePoint Designer does in order to “extract” files from SharePoint and save them to the file system, maintaining the directory structure of the files.  With the files on the file system, they could then be put into source control.

The first thing fiddler showed me was that every web service call SharePoint Designer does (that I’m interested in at least) was a POST to “/_vti_bin/_vit_aut/author.dll” endpoint, with the body of the POST containing a string looking something like this:

method=list+documents&service%5fname=%2f&listHiddenDocs=true&listExplorerDocs=false&listRecurse=true&listFiles=true&listFolders=true&listLinkInfo=false&listIncludeParent=true&listDerived=false&listBorders=false&listChildWebs=false&listThickets=false&initialUrl=some_directory&folderList=%5bsome_directory%3bTW%7c01+Jan+2016+00%3a00%3a00+%2d0000%5d

It looks very similar to a URL with query string parameters, where each parameter is separated with a ‘&’ character.  The “method” parameter lets the endpoint know what to do, and some quick googling lead me to the MSDN reference article for this web service endpoint: FrontPage Server Extensions RPC Protocol

The example above fetches a listing of files within a given directory.  Because the listRecurse parameter is true, it’ll include files from all sub-directories as well.  The other method, which fetches the contents of a requested file, is called “get document”.

So at this point, I know the web service endpoint to hit and the data to POST to that endpoint to get a listing of files, or the contents of a file.  I can use the listing to replicate the directory structure on the file system, and then fetch the contents of each requested file one-by-one.

Of course, you can’t just make this web service call anonymously.  I needed to authenticate in order to successfully make these calls.  The way you pass an authentication token with the POST differs if you’re working against SharePoint Online vs. SharePoint 2007/2010/2013 on-premise.

byte[] bodyData = System.Text.Encoding.UTF8.GetBytes(requestBody);
string url = _siteUrl + "/_vti_bin/_vti_aut/author.dll";
byte[] responseBody = null;

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
if (_sharePointOnline)
{
    using (var secPassword = new System.Security.SecureString())
    {
        foreach (char c in _password)
        {
            secPassword.AppendChar(c);
        }
        var creds = new Microsoft.SharePoint.Client.SharePointOnlineCredentials(_username, secPassword);
        var authCookie = creds.GetAuthenticationCookie(new Uri(_siteUrl));
        request.Headers.Add(HttpRequestHeader.Cookie, authCookie);
    }
}
else
{
    request.Credentials = new NetworkCredential(_username, _password, _domain);
}
request.Method = "POST";
request.Accept = "auth/sicily";
request.ContentType = "application/x-www-form-urlencoded";
request.Headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "T");
request.Headers.Add("X-Vermeer-Content-Type", "application/x-www-form-urlencoded");
request.ContentLength = bodyData.Length;

var requestStream = request.GetRequestStream();
requestStream.Write(bodyData, 0, bodyData.Length);
requestStream.Close();

You’ll notice if you’re trying to get files out of SharePoint Online, you need to get a SharePointOnlineCredentials object in order to generate a cookie to pass along with your request.  On-premise SharePoint only requires the credentials property of the request to be populated with a NetworkCredentials object.

With authentication out of the way, knowing the web service endpoint and knowing the necessary methods to get what I wanted, I now had all the necessary pieces to get files out of SharePoint.  Of course, nothing is ever easy – it turns out the responses to these web service calls contain a lot of extraneous information.

In order to keep the response “pure” until it was done being manipulated, I didn’t read the response as a string.  Instead, each response to the web service was read as a byte array (more on why in a moment).  However, the documents listing response is basically just a bunch of text outlining the metadata for each file so for that method call I immediately converted the response to a string for ease of use.  Among each file’s metadata text is a line that contains “document_name=<relative path of file>”, which was exactly what I needed.

With the listing of files, and the directory path location of each file now known, the call to the “get document” method can be made for each file to get the file contents.  This is where the byte array response is important – it’s not guaranteed that every file extracted will be text (images, for example) and frankly, we don’t really care if it’s text or an image or a pdf or what.  It will all be saved to the file system from the byte array the same way.  What is interesting is that every file content request’s response contains a chunk of metadata as the first thing in the response, which needs to be stripped out.  This metadata, for whatever reason, is in HTML so I basically just needed to find the first “</html>” instance in the response and know that everything after that is the actual contents of the file.

From these it’s just a matter of saving each file to the file system, using the relative path of each file in order to maintain the directory structure.  And now I can extract files from SharePoint to do with as I wish.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s