Useful Perl Scripts With Regular Expressions
Most computer users, especially software engineers, have had a need to modify multiple files to either add a line of text, modify a line of text, or completely remove a line of text. The problem is that there never seems to be a piece of software out there that can help you with this problem. Some programs let you get close to doing what you want but in my experience none ever let you do exactly what you want; so a few hours are spent opening each file and editing them manually.
Well I finally had enough and I wrote a couple little Perl scripts that use regular expressions to edit only the selected file types. The script will modify, add, or remove any text you would want. Since these are Perl scripts you cannot just type what you want to do into a pretty text box and have the program do it; rather you will have to make some small modifications to the files in order for it to modify the files as you would like.
We will walk though how to modify the Perl script to make it do a few different things that most people would want and then discuss other possible uses for the scripts. We will even discuss how to get the script to traverse directories since that is usually where the biggest issues arise when needing to modify multiple files.
Beyond just being able to edit this script so that you can use it to parse and edit your files you should learn a little bit about regular expressions which is always something that people dislike. Most programmers that have used regular expressions but have run into issues where regular expressions just grab more than you want them to. In most languages regular expressions do a maximum munch, meaning they take as much as possible instead of stopping at the first spot where stopping would be permitted by the expression. We will discuss how to make the regular expression grab a smaller portion because this will help when we want to modify simple HTML pages and more specifically maybe atag that has attributes in it.
Open A File Using Perl
In Perl there are a lot of ways to do the same thing so if you open your files differently then feel free to continue to do so but we need to discuss how to open files for those who are just reading this to quickly get a rather large number of files modified.
In the code above we see an example of opening a file located in “/home/directory/file.txt”, which is a Unix directory structure. To make this work on a windows machine you just have to put in a windows path like “C:\\my documents\\file.txt”. While it might seem funny to see the double slashes they are required because a single slash is put in front of special characters so in order for Perl to read a single slash in a string you need to do double slashes. You could always use a forward slash instead of a black slash too and then you would only need one slash but lets not get confused here.
There is one more type of open that we will want to do and that is an open where we are allowed to then write to the file. While it might seem odd to have to open a file differently when you want to write to it there is a good reason for it. You do not want to open a file that is meant just to be read from and then start writing to it by accident in your code. Since Perl requires you to open the file a little differently as we can see in the code sample below, there will be no chance of us opening a file we only want to read from and then writing to it by accident.
As you can see all we did was add a greater than sign (>) in front of the path name to make it so we can write to the file. It is very simple to do but again by leaving out the greater than sign we make it so we cannot write to the file and will probably end up saving ourselves from over writing files that we forgot to backup and will take hours to recreate.
Being that we are good programmers we have to close the file once we are done with it so that we do not run into issues and cause the file to become corrupted. To close the file is simple as can be seen in the code below.
Replace On A Single File
Here we will hard code our script to edit a single file for certain words. You could set the script up to prompt the user for the file but I figured that was over kill since we would probably have to go in and make some minor changes to the script anyway. If you want the script to prompt you can use the code below though.
The code above prints out Please enter dir name: and then we use chomp, which we will discuss later, to remove the linefeed and or newline that comes in when the user hits enter.
The code below will parse a file that contains a <body …> tag and we want to remove all the attributes in the body tag because we are starting to use cascading style sheets (CSS) and we no longer want there to be any attributes in the body tag since we will define all of the attributes in our .css file. Our file will start out with <body bgcolor=”green”> and we will end up with <body>.
It would probably not make much sense to use this script to edit just one file because it would be faster to open the HTML document and make the change manually than it would be to edit this script and then run it. We will be building upon this script so it is not a waste of time.
You will notice above that we open the file once to read it and then we open the file again to write to it. The reason I did this is because it is possible you would actually output to a different file than the original and if that is the case then the code already exists for you to do so easier than if I had only done this process with one open file statement.
In the above code you will also notice the regular expression ([^>]*) which stops the regular expression from doing a maximum munch; i.e. it tells the regular expression to stop at the first greater than sign instead of stopping at the last greater than sign in the file. If this were not here, and you can feel free to give this a try, the regular expression code would actually take everything from the body tag all the way to the last greater sign removing everything, in a well formatted HTML document, from the body tag to the closing html tag and replace it will just a simple <body> tag.
We are going to use File::Find, a Perl Module, to parse all the files in a directory and it’s subdirectories. This module will work on Unix and Windows machines as well as Mac OS machines but Mac users will want to consult the File::Find documents to see a few of the issues that Mac’s have with it and their work around.
This code below will traverse directories but not symbolic links. This means that if there is a real subdirectory in the directory that you tell it to run on then this script will parse all the files in that subdirectory and all the subdirectories but will not follow symbolic links. You can make it follow symbolic links by using the follow attribute but you will want to read the documentation on that.
In the code above we first start out by doing a use
File::Find; which allows us to use the find function. We then define my $directory and set it to the path of the directory we want to parse. The last thing we do in the main part of the code is to call the find function which we need to pass the address off the processing function, this is a call back function that will be called with each file and directory found within the main directory. The second argument is the actual directory or directories we want to use.
The most complex part of this script is the actual processing subroutine which is called with each file and directory found within the main directory. There is no way to tell find to only select certain types of files so this means that our processing code will even try to run on directories and if we try to open a directory, at least in windows, the script will crash. Also we do not want to be parsing image files or other binary files for the body tag first because we could certainly mess them up and secondly we do not want to change them.
Since we know we only want to parse HTML documents and change the body tags we can easily just add an if statement that says if the file ends with .html then lets parse the file. From here, since we know we have an html file, we open the file and then search the whole file for the body tag. When we find the body tag we replace the body tag and keep searching. To be more efficient we could have stopped our searching but I will leave that little modification up to you.
The next thing we do is close the current file and then reopen the file in write mode. We then write everything out, if we made a change or not, to the out file. We then clean up by closing the output file and we do an undef, just to be clean, on the @outLines, which is the array that olds all the data we are going to write out.
Converting From Unix Files To Windows Files
Lots of people seem to be moving from the Windows word to the Unix word or from the Unix word to the Linux word or maybe from one operating system to another and in between often for many reasons. The problem is that some operating systems do a newline (unix), others do a linefeed + newline (windows), and yet others just do a linefeed (Mac prior to OS X). So when moving files between these operating systems there can be some issues and some weird characters show up and you might not know why.
The previous script is easily modified to remove the line termination string and add in a new line termination string. If you are moving from a Unix system to a Windows based system you would want to remove all the \n’s and convert it to windows by adding back on \r\n. This would allow the file to be read in Windows based applications like notepad. If you have ever opened a file in notepad before and saw everything on one line with weird boxes that is because the lines are not terminated correctly and notepad is confused.
There is a program out there called flip that can convert a single file at a time to but when needing to do many files and files in subdirectories it is not as easy to use.
The code below will go though each line and chomp the line which will remove the terminators at the end, be it \n in unix or \r\n in Windows or just \r on the Mac. We then go in and add in the line terminators that we want to add in. Please note I did not get to test chomp on Windows or Mac so I am assuming that chomp does what I said above without testing. If it does not work please let me know and you can easily just do a replace. I did it with a chomp because it seemed like it would be a lot cleaner code.
The code above is just like the code we used to change the body tag so it should be pretty straight forward. I have used scripts like this often to be able to move code that was created on a Unix machine to a Windows machine or the other way around. I have also used this code to move major things like perforce or cvs versioning files from one operating system to another so hopefully this serves to be as useful to you as it has to me.
The first thing that you should have gotten from this tutorial is how to stop a regular expression from grabbing more information than you wanted it to. There are many regular expression tutorials out there but many over look this so hopefully you now have some idea of how to make a regular expression only grab the exact amount of information you want it to grab and nothing more.
The other important thing you should have gathered from this article is how to use
File::Find to parse all files in a directory and its subdirectories. The File::Find module works on most operating systems so it is much better than doing a system call to ls or dir which is what most people did before
Find::Find was created.
Copyright 2004 Matthew Drouin. All rights reserved.
Matthew Drouin is a published author with his book entitled Web Hosting and Web Site Development: A Guide to Opportunities. A graduate of the University of Hartford; Matthew received a BS in Computer Science and then started to chase all his dot com dreams. His love for many different programming languages has recently lead him to start spending the majority of his time writing open source tutorials which are published on his site OpenSourceTutorials.com.