PDA

View Full Version : How to merge many XML files into one




mrindia
Jul 15, 2013, 09:15 AM
Hi: I got a small project to combine many XML files into one and convert the combined XML file in Excel using AppleScript. My XML files look like this:

<?xml version="1.0" encoding="UTF-8"?>
<Metadataobject>
<from>horsten.jock@edienpartner-int.com</from>
<jobname>B3_IM09MBDUF</jobname>
<pages>2</pages>
<priority>3</priority>
<timezone>CEST</timezone>
<year>2013</year>
<month>7</month>
<day>15</day>
<hour>11</hour>
</Metadataobject>

and like this...

<?xml version="1.0" encoding="UTF-8"?>
<Metadataobject>
<from>frank.k@mrtner-int.com</from>
<jobname>P1_FR1330G006007_Kate_van der Vaart</jobname>
<pages>2</pages>
<priority>1</priority>
<timezone>CEST</timezone>
<year>2013</year>
<month>7</month>
<day>12</day>
<hour>16</hour>
</Metadataobject>

I get many XML files like this. And I want them to be combined and shown like this:

<?xml version="1.0" encoding="UTF-8"?>
<Metadataobject>
<job id="1">
<from>thck@rtner-int.com</from>
<jobname>B3_IM09MBDUF</jobname>
<pages>2</pages>
<priority>3</priority>
<timezone>CEST</timezone>
<year>2013</year>
<month>7</month>
<day>15</day>
<hour>11</hour>
</job>
<job id="1">
<from>flk@mrtner-int.com</from>
<jobname>P1_FR1330G006007_Kate_van der Vaart</jobname>
<pages>2</pages>
<priority>1</priority>
<timezone>CEST</timezone>
<year>2013</year>
<month>7</month>
<day>12</day>
<hour>16</hour>
</job>
</Metadataobject>

And finally the combined XML file converts in Excel sheet with column headings "Job ID", "From", "Job Name" and so on...

Thanks



r0k
Jul 16, 2013, 08:19 AM
First of all, please go back and edit your post to remove "real" email addresses. We don't need those to help you but the spambots are happy to find those email addresses "in the clear."

If I were doing this, I would do it in terminal, with something like this:

I would cd to the folder with all the xml files,then I would...

cat *.xml | egrep -v "Metadataobject" | egrep -v "xml version" > new.xml
cat firstfilename.xml | grep "xml version" > alldone.xml
cat firstfilename.xml | grep "<Metadataobject>" >> alldone.xml
cat new.xml >> alldone.xml
cat firstfilename.xml | grep "</Metadataobject>" >> alldone.xml


This is using standard Unix command line tools.
The cat command simply sends all the files onto the pipe.
The pipe command, | says take the output of the previous step and pass it as input to the next step.
The egrep -v command says omit all lines containing Metadataobject (both the one before and after the stuff you want).
The next egrep -v command gets rid of xml version lines
The resulting file now has everything in one long block and you can prepend the header stuff before it and the /Metadata object line after it.

The next lines are more of the same so you wind up with the file looking like your example. Of course you know the filenames so you would change firstfilename.xml in my example above to one of the files you are using.

One thing to consider. Maybe you should keep the Metadataobject lines. You may only need to get rid of the "xml version..." headers, so the following command would work...

cat *.xml | egrep -v "xml version" > new.xml
cat firstfilename.xml | grep "xml version" > alldone.xml
cat new.xml >> alldone.xml

Again you know the filenames so you would change firstfilename.xml in my example above to one of the files you are using.

Hope this helps.

jahala
Jul 17, 2013, 02:00 AM
Good first shot, but you failed to notice that he added <job> and </job> to around the contents of each file. The following works to get a combined xml file.

head -n 2 firstfile.xml > test.res; for i in *.xml; do echo '<job id="1">' >> test.res; tail -n +3 $i | grep -v 'Metadataobject' >> test.res; echo '</job>' >> test.res; done; echo '</Metadataobject>' >> test.res

The output file is test.res. I am not yet sure what to do to make it open in excel. I imagine it has something to do with the first line.