How to do a join between two XML files with Linq

Linq is still a relatively new concept, but with the release of Visual Studio 2008 beta 2 and talk of a release candidate version it is about to become available for production use.

I’ve been knocking around with Linq for a little while now, and so far I really like what I see. There is plenty of talk about it around the internet and some excellent tutorials kicking around also. For a great look into Linq to SQL and also an overview of lambda expressions and extension methods check out Scott Guthrie’s blog.

First let me begin by saying that I make no pretences to being any form of Linq expert. My linq skills are still very much in their infancy… but I’m going to have a go at talking you through a little problem I had to solve the other day.

The Problem

You have two XML files, or a single XML file with two distinct tree’s. The first file/tree contains some records with primary keys of sorts. The second file/tree contains some joining information. The example I will use here is a list of words and synonyms.

File A may look like this:

<words>       <word>          <id>1</id>
         <text>Accident</text>      </word>      <word>          <id>2</id>
         <text>Mistake</text>      </word>   </words>

File B may look like this (linking the word and its synonyms):

<links>       <link>          <id>1</id>
         <refId>2</refId>      </link>   </links>

A little disclaimer here: Obviously given the chance you would use the hierarchical nature of XML to store the relationships, but because the source of the data I had to deal with was flat I had no choice.

When using linq to SQL you get lovely typed data and all the relationships are mapped out for you (or you map them out using the linq to SQL designer if they are not present in the source data). Unfortunately with linq to XML we can’t explicitly map these relationships *yet*.

Enter stage left the linq join functionality. The join keyword available in linq basically lets you write similar joins to what you might be used to with SQL (left, inner, outer etc).

An SQL query to select a word and find all the synonyms using the ref table might look something like this:

SELECT words_1.word
FROM links INNER JOIN
words ON links.id = words.id INNER JOIN
words AS words_1 ON links.refId = words_1.id

Pretty simple in SQL… but in linq to XML how to you achieve the same result? Using the join keyword of course.

Start off by loading the two XDocuments – one for the words and the other for the linking information:

XDocument wordDoc = XDocument.Load(AppDomain.CurrentDomain.BaseDirectory + "\\words.xml");
XDocument linkDoc = XDocument.Load(AppDomain.CurrentDomain.BaseDirectory + "\\links.xml");

Then start the linkq query in the usual way.

 var wordMatched = from w in wordDoc.Elements("words").Elements("word")

This will iterate all the elements in words/word, and place them into the w variable.

Next we can start populating the return var with information on the currently selected word element.

 select new
 {
 id = w.Element("id").Value,
 term = w.Element("text").Value,

Basically each item in the wordMatched IEnumerable object will contain .id and .term when you iterate them. That’s all really interesting but it doesn’t really do much yet, we want each wordMatched element to contain multiple child synonym words.

 syns = from refIds in linkDoc.Elements("links").Elements("link")
        join synLinkedWord in wordDoc.Elements("words").Elements("word") on refIds.Element("refId").Value equals synLinkedWord.Element("id").Value
        where refIds.Element("id").Value == w.Element("id").Value
        select new
        {
           id = synLinkedWord.Element("id").Value,
           term = synLinkedWord.Element("text").Value
        }

The join here is selecting words from the same XML file first up… this is because the final data we want resides in the same XML file as the words (i.e. words and their synonyms are identical, its the linking table that joins them up). The next part of the join says: we want to make the refIds in the other XML file match the ids of words in the original table.

Next is the where clause, which places a restraint on the query, or the join will return all items that are referenced in the links table… so only return items from the links table that have an id the same as the id on the original “w” object.

Finally we select some data from the original table again (this time using the word that our linking XML gave us).

To read this data is quite simple:

 foreach (var word in wordMatched)
 {
     txtOutput.AppendText("\r\nWord: " + word.term);
     foreach (var syn in word.syns)
     {
          txtOutput.AppendText("\r\n\t->Syn: " + syn.term);
     }
 }

As you can see each word contains the id and term properties, but also contains another IEnumerable properly called syns, each one containing the details of the syns!

Problems

There is one problem which I cannot solve as yet, and this is due to my total lack of depth in linq… Iterating through the wordMatch enumeable will return words which were listed as syns in other words… i.e. in our example mistake will be listed as a syn under accident, but it will also be listed by itself as a main word. This can probably be solved with some more where clauses and possibly another join… comments would be greatly appreciated on this!

Edit 9 Movember (sic) 2007

I solved my own problem whilst working on another problem… Insert the following line below the first from w in wordDoc xxxxx line:

where linkDoc.Elements("links").Elements("link").Where(p=>p.Element("refId").Value==w.Element("id").Value).Any() == false

This line adds a where clause that runs a scan on the links XML document searching for items that have a refId that is the same as the current id. The lambda expression in the .Where checks all elements in the linkDoc.Elements(“links”).Elements(“link”) enumerable to see if their ids match. The .Any() method returns true when the returned enumerable contains one or more items. So if an items id is a refId in the links table we assume that it is a synonym.

Full Listing

XDocument wordDoc = XDocument.Load(AppDomain.CurrentDomain.BaseDirectory + "\\words.xml");
            XDocument linkDoc = XDocument.Load(AppDomain.CurrentDomain.BaseDirectory + "\\links.xml");

            var wordMatched = from w in wordDoc.Elements("words").Elements("word")
                              where linkDoc.Elements("links").Elements("link").Where(p=>p.Element("refId").Value==w.Element("id").Value).Any() == false 			      select new
                              {
                                  id = w.Element("id").Value,
                                  term = w.Element("text").Value,
                                  syns = from refIds in linkDoc.Elements("links").Elements("link")
                                         join synLinkedWord in wordDoc.Elements("words").Elements("word") on refIds.Element("refId").Value equals synLinkedWord.Element("id").Value
                                         where refIds.Element("id").Value == w.Element("id").Value
                                         select new
                                         {
                                             id = synLinkedWord.Element("id").Value,
                                             term = synLinkedWord.Element("text").Value
                                         }
                              };

            foreach (var word in wordMatched)
            {
                txtOutput.AppendText("\r\nWord: " + word.term);
                foreach (var syn in word.syns)
                {
                    txtOutput.AppendText("\r\n\t->Syn: " + syn.term);
                }
            }