I recently needed to verify the integrity of a large number of XML files that are stored in Azure Blob Storage. These files were essentially hand-written and contained numerous typos, errors and other inconsistencies. I needed to not only identify files that contained well-formed XML, but also find files containing nodes with names that were not on a list of known good names.
I started off with an empty Console Application in Visual Studio 2013. Knowing that I would be accessing a Cloud Storage account, I added the Windows Azure Storage NuGet package to my solution. I first added the boilerplate code to identify the blobs in my Storage account that contained XML.
private static CloudBlobContainer _rootContainer; static void Main() { // Access the Storage account CloudStorageAccount storageAccount = CloudStorageAccount.Parse(ConnectionString); CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient(); _rootContainer = blobClient.GetContainerReference(RootContainerName); // Retrieve all XML and TXT blobs under this root as a flat list, ignoring the "virtual" directory structure. IEnumerable<CloudBlockBlob> xmlBlobs = _rootContainer .ListBlobs(null, true) .OfType<CloudBlockBlob>() .Where(blob => blob.Name.EndsWith(".xml") || blob.Name.EndsWith(".txt")); // Find the erroneous XML element names in each blob and merge them with the "global" dictionary. foreach(var blob in xmlBlobs) { XDocument xml = GetXDocumentFromBlob(blob); if (xml != null) { IEnumerable<string> unexpected = FindUnexpectedNames(xml); MergeUnexpectedNames(blob.Name, unexpected); } }; OutputNames(); }
In the above code, ConnectionString is the connection string for the Azure Storage Account and RootContainerName is the name of the Blob container I wanted to access. In my case, I knew that any blob with an “extension” of .txt or .xml would contain XML, so the LINQ query filters on that criteria.
The next step was to try to parse the blob contents and move them into an XDocument object so I could work with them easily. This had the beneficial side effect of identifying files containing invalid XML, which were set aside in another dictionary for reporting purposes.
static XDocument GetXDocumentFromBlob(CloudBlockBlob blob) { // Download the contents of the blob into a stream var stream = _rootContainer.GetBlockBlobReference(blob.Name).OpenRead(); stream.Seek(0, SeekOrigin.Begin); // First, attempt to parse the entire file to identify invalid xml. try { XDocument xml = XDocument.Load(stream); return xml; } catch (System.Xml.XmlException e) { // If there are any errors loading/parsing the XML, store a reference to the blob in a secondary dictionary. Console.WriteLine(e.Message); ParseErrorDictionary.Add(blob.Name, e.Message); return null; } }
I open a Stream for the Blob reference and read it into an XDocument object, which will throw an exception if the XML is badly-formed or invalid. In the case of bad XML, I add the blob name and error message to a dictionary for later output and move on to the next blob. If the blob loads successfully, I sift through it in the FindUnexpectedNames function and return the unexpected XML element names.
static IEnumerable<string> FindUnexpectedNames(XDocument xml) { var output = new HashSet<string>(); // Iterate over the xml elements and ingest them into the output collection. foreach (XElement descendant in xml.Descendants()) { // Retrieve the node name, including prefix if applicable. var nodeName = descendant.Name.NamespaceName == string.Empty ? descendant.Name.LocalName : string.Format("{0}:{1}", descendant.GetPrefixOfNamespace(descendant.Name.Namespace), descendant.Name.LocalName); if (!KnownGoodList.Contains(nodeName)) { // If we aren't explicitly ignoring this name and we aren't already tracking it for this file, add it to the output. if (!output.Contains(nodeName)) { output.Add(nodeName); } } } return output; }
It is worth noting that the function has a return type of HashSet. In the process of developing this utility, I encountered serious performance issues due to the sheer number of XML files I needed to parse, compounded by the number of tags within each file. One of the optimizations I made to combat this was to use a HashSet in this function for the list of unexpected tags, as it has the exceptional time complexity of O(1). This means that the lookup time to search for a value in a HashSet is extremely fast and (most importantly) does not grow as a function of N (the number of elements in the HashSet). I also use a Dictionary<string, <HashSet> globally to map unexpected names to the files they occur in, for the same reason.
The remainder of the FindUnexpectedNames function iterates over the XDocument’s descendant nodes, comparing each node’s name against KnownGoodList (another HashSet) and adding the name to the output if the name was unexpected.
For the final piece of the puzzle, I have a method that takes the unexpected tags from the previous step and merges them into a global unexpected tag dictionary, where each key is an unexpected tag name and the value is a HashSet of the file(s) the tag was found in. This information is necessary in order to triage the erroneous files later.
static void MergeUnexpectedNames(string blobName, IEnumerable<string> names) { // Merge the erroneous names into the UnexpectedDictionary. foreach (var name in names) { if (!UnexpectedDictionary.ContainsKey(name)) { // We aren't already tracking this name, so add it and the file it was found in. UnexpectedDictionary.Add(name, new HashSet<string>() { blobName }); } else { // We are already tracking this name, so just record the file it was found in. UnexpectedDictionary[name].Add(blobName); } } }
On the final line of Main, I call an OutputNames function which outputs the results to an output.txt file in a readable format.
The full source can be found on CodePlex and is available for use with the MIT license.