Generally speaking, when I need to do a little file manipulation, I usually fire up .net, whip up a little VB.net command line app to do the trick and off I go.
However, a few nights ago, I needed to do some manipulation on a largish (30+meg) xml file. The manipulation itself was fairly simple:
- Find a tag in the file
- Insert the contents of other files into the target file, right before the tag
However, it was late, and I was feeling a bit lazy, so I googled it.
What I got was almost all the first page results pointing me to SED or AWK.
What’s that?
SED is short of Stream EDitor. Essentially, it’s an app for running a text file through a set of regular expressions and outputting the results.
AWK is short for Aho, Weinberger and Kernighan, the names of the three programmers who originally came up with it. It’s actually a language for processing text. But, any more, it generally refers to the command line application to applying that language to a input file and generating output from it.
Not big on UNIX
Now, I’ve been around long enough to know what SED and AWK are, but I’ve really never actually used them. However, with all these search results pointing that direction, I had to poke around a little more.
You can grab a version of SED for Windows here:
http://gnuwin32.sourceforge.net/packages/sed.htm
And AWK (or GAWK, the gnu version of AWK, get it<g> ) here:
http://gnuwin32.sourceforge.net/packages/gawk.htm
Those pages have tons of excellent resources, as well as examples, all the docs you’d ever want to read, etc.
And these two apps have been around for so long, that, well, a quick Google search will turn up an example of just about anything you’d need to do with them, so I’m not going to muddy up search results any more than to say that they are really handy tools, especially if you know a little bit about regular expressions.
A Windows Observation
However, I would point out one fairly minor nit that I ran into, at least with the above two ports that I tried.
Both work just fine, but SED I found a tad more troublesome to install. The main problem was that it relies on several external DLLs. You can see these dependencies using DependencyWalker:
These files need to be in the same folder as the SED.EXE, and they’re all available at the above link. I guess my feeling is that for such a singular tool, these kinds of dependencies should be compiled in. At one point, many many moons ago, it made at least a little sense to reduce your app diskspace requirements by relying on shared dlls and such. But these days, no one cares if an app like this is 150k vs 500k with all the dependencies compiled in.
AWK (or GAWK), on the other hand, has NO dependencies. None. I copied it to my TOOLS folder, which is on my PATH, and viola! Worked right off. Truly an 0-hassle installation.
They both work very similarly, though SED relies mostly on regular expressions, whereas AWK certainly can be used in conjunction with only regular expressions, but also has the full AWK language behind it to boot.
Speed
One note about speed. There’s nothing to note!
Both of these apps were so fast, even against a 33mb input file, that I didn’t even notice they took any time at all. Running them against this file took about the same time as to actually copy the file.
Granted, my needs were simple, and I’m sure more complex expressions would slow things down. But still. That was refreshing.
And that thing I needed it for?
Removing a singular tag from a large XML file automatically:
awk "!/<\/tag\x3E>/" File1.xml >output.xml
Most of the weird look is from:
- Having to escape the “/” with a “\”
- Can’t use a “>” in a batch file command line, because it’s interpreted as a “pipe into an output file” command, which I’ve done at the end of the command with “>output.xml”, so I have to escape it as “\x3E”
I suspect I’ll be using it considerably more in my future!