PDA

View Full Version : Need help altering a .txt file....




IndianMojo
Mar 19, 2008, 05:40 PM
Hello all...

I have an issue that I hope someone may help me resolve. If this is not the correct forum for me to be posting to, please point me in the direction I need to go.

I am working for a company that produces telephone directories. We purchase listings from other companies and then reformat them to work for us in our book. The problem I am having is that one company has sent listings that are formatted using what appears to be a proprietary coding. T

What I need help with is deleting the useless characters in a string while leaving the listing intact.

Here is an example of what the file looks like:

S01016100532473#######1020906080213 918
S1301610 R RES
S140161001ILN 0 LAST NAME, FIRST NAME 0
S140161002ILA 0ADDRESS 0

I have formatted in bold the information that I need to keep. All of the rest can disappear.

Is it even a possibility to write a script that would do what I need? And again, Iif I am in the wrong forum, please let me know where I need to go.

BTW, I am totally clueless about writing scripts or code, so please use layman's terms, if possible.

Thank you in advance......



toddburch
Mar 19, 2008, 05:45 PM
How much data are talking about? A city the size of New York or a town the size of Lukenbach, Texas?

While you show an example, I'm going to assume there is more detail you are not telling us, like the useless information will vary, and the bold items might or might not be in fixed locations?

With all the details, I could write this reformatter, as I would also imagine this is worth something to your company to get done.

Feel free to PM me.

Todd

IndianMojo
Mar 19, 2008, 09:32 PM
Todd,

Thank you for your response. I will PM you with a specific sample, but am responding here to inform everyone who may read this about the issue.

There are roughly 2000 listings, and most are formatted exactly as I gave in the example. The info before and after the telephone number are a specific number of characters and do not vary. The same is true with the residential, name and address lines.

The only variance is when a business or individual has multiple numbers listed and this would look something like this:

S01005920532473#######910926080213 918
S1300592 R RES
S140059201ICAP 0LAST NAME, FIRST NAME 0
S140059202ILN 1(OLN) 0
S140059203ILA 1,ADDRESS 0
S140059204ITN 1###-###-#### 0
S140059205IAL 12ND LINE 0
S140059206ILA 1(OAD) 0
S140059207ITN 1###-###-#### 0

Again, needed info is in bold.

As you can see this is different than the previous example. These extra line listings are a rarity, however. I could extract these manually and be left with a uniform file to run a script on.

Thanks again in advance for any help you may provide.

toddburch
Mar 19, 2008, 10:14 PM
Great. What's the output format need to look like?

Todd

needlnerdz
Mar 19, 2008, 11:05 PM
isn't this what interns are for?

Minkintosh
Mar 20, 2008, 12:07 AM
I would write a perlscript for this if I were you. Perl has great regular experssions methods. Pick up a book on perl or just search the web for a perl tutorial.

motulist
Mar 20, 2008, 12:15 AM
What I need help with is deleting the useless characters in a string while leaving the listing intact.
S01016100532473#######1020906080213 918
S1301610 R RES
S140161001ILN 0 LAST NAME, FIRST NAME 0
S140161002ILA 0ADDRESS 0

I have formatted in bold the information that I need to keep. All of the rest can disappear.


Should be possible with even a simple automator workflow as long as the formatting is consistently the way you wrote in your sample. However, for the phone number I don't see what delineates the phone number from the garbage numbers. Are the garbage numbers before or after the phone number always the same or something? Maybe they're always the same amount of garbage characters before the real phone number?

To make it easier to figure out a way to help you, why don't you post your previous sample followed by a new second sample so we can see what stays constant and what changes in each entry.

toddburch
Mar 20, 2008, 08:36 AM
I'm guessing that by "disappear", you are merely wanting the info shifted left, like this?


#######
R
LAST NAME, FIRST NAME
ADDRESS


Todd

Flynnstone
Mar 20, 2008, 08:48 AM
sed, awk ...
Perl
For 2000 listings ... Perl or might be able to just use a text editor with search and replace. Then tidy the file up for exceptions.

IndianMojo
Mar 20, 2008, 04:05 PM
Thank you everybody for your responses. I am going to answer some of the posts.

I can't use a find and replace, because the lines of code change for each entry.

To me...a perl is a small white, sometimes black, thing that costs lots of money and are made into necklaces and earrings....however, I am willing to learn, and will definitely look into buying a book on the subject.

The telephone number is always the same number of characters in on the line starting with S01. The last three characters before the telephone number are 473 always. The number of characters after the phone number are also always constant.

And finally, yes this is what interns are for...unfortunately we are too small a company to have interns....so the select, delete, select, delete, tab, select, delete, tab, job would fall to me......(I feel like such an intern).

which reminds me....it would be helpful if this were tab delimited to fit into Excel easily.

Thanks, everyone.

Flynnstone
Mar 20, 2008, 04:21 PM
If you do this every so often , I recommend perl (not the necklace).
I still think you should be able to do it with a text editor like Textwrangler.
You will need to learn and use "regular expressions".

motulist
Mar 20, 2008, 05:00 PM
The telephone number is always the same number of characters in on the line starting with S01. The last three characters before the telephone number are 473 always. The number of characters after the phone number are also always constant.

An automator script would work perfectly in that case. Create a blank automator script and do the following:

add 'launch application: textedit'
record user's action that selects and deletes the garbage characters. For instance:
-hold down shift and hit the right arrow until all the garbage characters are selected
-hit delete
-hit the right arrow 7 times to get to the end of the phone number
-hold down shift and hit the goto end of line key command
-hit return to delete that garbage and go to the next line

etc.

That makes it sound more complex than it is. It'll take some tweaking to get it just right, but it won't be too hard.

EDIT:

No matter what solution you attempt, make sure you build your solution using a sample file that only has like 5 entries in it first before applying the finished solution to your actual entire big list.

amnorvend
Mar 21, 2008, 02:52 AM
There are a lot of options. Perl is a good language to write scripts to do what you're wanting to do. I personally prefer python. It's a very simple language to learn.

This is a good tutorial (http://diveintopython.org/toc/index.html)

Also pay attention to the section on regular expressions: http://diveintopython.org/regular_expressions/index.html

motulist
Mar 21, 2008, 02:57 AM
Learning a whole scripting language just to do this one particular pretty simple task seems like a very inefficient solution when it could be done much quicker and more easily using automator.