PDA

View Full Version : Very simple(?) question from newbie about perl/textedit




sun surfer
Jun 20, 2013, 09:04 PM
Hi, this may sound extremely easy but I can't figure out what to do (and I'm a total non-techy). I have some long text codes saved in multiple textedit files and need to edit multiple items at once. Doing it manually would take a long, long time.

Sometime told me that I can just use a perl regular expression, pop it in and it will automatically delete all the text parts I need deleted. The parts I need deleted all start and end with the same text but have different middles. So they not only said it was possible but said they'd done and it worked great (they were working on the same type of thing I am), and gave me the code to use.

So it all sounds great and I have the code but I have no idea where to put it or what to do with it! Once I know what to do with it, it should just pop out new textedit files with the info I need deleted. I can't get in touch with the person who gave me the perl code now, and I can't figure out an answer on google, so I'm asking here.

Where do I go to put a perl regular expression so that it will alter a textedit file? Thanks for any help.



jhiesey
Jun 21, 2013, 02:24 AM
You will need to run perl from the terminal (Terminal.app), but exactly how you do it depends on what your friend gave you.

Could you post the code you were given?

sun surfer
Jun 21, 2013, 01:17 PM
Hi, thanks; I had no idea I'd need to use the terminal.

Here's the basic code with "x" substituting for the beginning and end of the data it's looking for to erase (the beginning and end of the targeted text are always the same; the middle is always different hence needing this code to do the job).

<x(.+\n)*.+<\/x>

They said this code worked perfectly for them in the same situation (although to a non-tech person like me it looks like hieroglyphics). But they said just substitute in for "x" and plug it in. I've rarely used the terminal but I know what it is, but how will this code know to target a particular file or particular group of files?

jhiesey
Jun 21, 2013, 06:17 PM
Well, you don't necessarily need to use the terminal. In fact, since your regular expression tries to match across multiple lines (there can be returns/enters in the middle part), I wouldn't really recommend it, since it's a bit harder to do that way.

Many text editors have built-in support for regular expressions, but TextEdit doesn't. If you don't want to use the terminal, you can download an editor like Sublime Text 2, which nominally costs $70, but has a trial you can use for free forever if you ignore the prompts that nag you once in a while (it's not really that annoying actually). Other text editors like TextMate and BBEdit would work as well, but aren't free either. If you aren't familiar with the terminal, this is a much easier approach.

This regular expression you gave looks a bit more specific than one that just finds a string with constant beginning and end and a varying middle. It looks like it was intended to match opening and closing tags in HTML or XML specifically. For example, if x is span, it will match <span>hi</span> or <span style="color:blue">some text</span> It will even match where the middle part goes across multiple lines, as long as none of the lines are blank.

If you just want to match something more general, without the angle brackets (<>) and such, this isn't the right regular expression.

If you use Sublime Text 2, you can just open the folder containing the files you want to edit (it lets you open whole folders), open global find and replace (command-shift-f), turn on regular expression mode (there's a button on the far left for that, labeled .*), put your expression into the "Find" box, and leave the "Replace" box empty. If you click "Replace", all of the files will be changed, and you can inspect each file before saving, or you can just go to File->Save All to save everything. Other text editors will be similar.

If you still want to use the the terminal instead of another text editor, I can give you instructions for that too, but it might be a little tricky if you are totally unfamiliar with it.

sun surfer
Jun 24, 2013, 12:14 PM
Thanks! I downloaded Sublime Text 2 as you suggested, and am trying to make it work. This expression is intended to match opening and closing tags, and that's what encloses each group of text I need to remove - the text inside is different but the tags are the same, so that's why this person made this expression to make it easier.

I've done exactly as you said (and thanks for the clear, precise directions that made it very easy) but something goes wrong - it deletes everything from the start of the first tag to the end of the last tag, so basically it deletes almost everything. What I need to happen is for it to delete each text inside the tags, but leave the text in between unaffected. Is it maybe something I'm still not doing right, or is the expression flawed? The person said the expression worked perfectly for them.


ETA - I think I have solved the problem. I thought to try a different text editor and tried "Ultraedit". I wouldn't have been able to figure out what to do there if not for your directions for Sublime Text 2, but with fumbling around it wasn't so different to figure it out in the "find and replace" window, and it worked perfectly! So, somehow this regular expression doesn't work properly in sublime text 2 but will work properly in ultraedit. Either way, it's solved now and thanks so much for your help. I wouldn't have been able to do it without your help. :)

jhiesey
Jun 24, 2013, 02:11 PM
That behavior is indeed what the expression you posted will do. It doesn't make any distinction between what's inside the opening tag and what's between the tags. For example, if you haveThis is <span style="color:blue">in blue</span>! then after you do the replacement you will end up with This is !

Instead, it sounds like you want the result to be This is in blue! To do that, you would need to fill in the "Find:" box with <span[^>]*>((.+\n)*.+)<\/span> and the "Replace:" box with \1 (that's a backslash and the number one). This last part specifies that you want to replace it with what is inside, which is the part matched inside the outer set of parentheses.

Let me know if that does what you want.

----------

Maybe I didn't quite understand what you are trying to do, since I really don't see why it does what you want in Ultraedit.

It is true that regular expressions aren't particularly well standardized, however, so differing behavior isn't much of a surprise.

sun surfer
Jun 24, 2013, 02:17 PM
No, it's a bit different. Here's an example:

<x>A
B
C</x>
D
E
F
<x>G
H
I</x>
J
K
L
<x>M
N
O</x>

In this case, I would want D, E, F, J, K, L to not be deleted, but with Sublime Text 2 it does delete those as well as the text in the tags, because it deletes everything from the first start tag to the last end tag, even text not in tags, as long as the text is after the first tag and before the last tag.

However, the perl regular expression does work properly in Ultraedit, so it's all good now and I just used that once I tried it and realised it works there. I'm not sure why the perl regular expression doesn't work the same way in Sublime Text 2?

NeverhadaPC
Jun 24, 2013, 03:25 PM
Surprised no one recommended TextWrangler. It's free and supports regular expressions.

I've used it many times to do what you seek. Just go to Find/Replace and check the "Grep" box.

Persifleur
Jun 24, 2013, 03:27 PM
By default, the '+' pattern is "greedy", meaning it'll match as many characters as possible while still matching the overall pattern. Sublime is doing what I would consider standard, and UltraEdit's implementation I would consider non-standard. (Insofar as regular expressions can be standard.)

From UltraEdit's non-greedy tutorial (http://www.ultraedit.com/support/tutorials_power_tips/ultraedit/non-greedy-perl-regex.html):

By default, Perl regular expressions are "greedy", meaning they will match as much data as possible before a new line. Even if the conditions of the regular expression have been met, but a line break has not yet occurred, the regular expression will continue searching for data that satisfies the search criteria.
(emphasis mine)

UltraEdit checks after every line break whether it's found a match, and if so, it stops. Sublime is doing what I would consider "standard": continuing to check whether there is a match until it gets to the end of the file.

It just so happens you want the non-greedy behaviour. Thus the solution is just to make the pattern "non-greedy" by putting a ? after the +:
<x(.+\n)*.+?<\/x>
You should then get the same behaviour in both applications.