Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Andrew936

macrumors newbie
Original poster
Mar 6, 2011
3
0
Hello MacRumors,
I have a large number (several hundred to several thousand) of downloaded html pages with images and links that I need to convert to plain text files (no images, links, or formatting at all). I've been using both the Automator and a batch file-rename application to change the file extensions to .txt, but the end product still isn't quite what I want.

What I come up with is a .txt file that still has formatting and links etc. in it. When I look at the file info, it says it's a "plain text document", but the "make plain text" action in Textedit (format > make plain text) is still available. When I do that ("make plain text in textedit"), I get exactly what I want: a bare-bones, text only document. I just need to do that a few thousand more times. How can I do that?

Is this an issue of text encoding? I should also say that my end product also has to be UTF-8, according to the documentation for another application that these text files are ultimately going to be put into.

So in short, I need a way to perform the textedit action "Make plain text" en masse. I think there's a way to do it with the terminal, but I am fairly clueless as to how bash/unix commands work — so please hold my hand with any instructions involving that sort of thing. I'm running Tiger on a 2007 Macbook pro (I'm not sure if you needed to know that, but I figured it couldn't hurt).

Thank you so much for any help!
 
You can install "lynx" throught MacPorts (see http://www.macports.org for more info)

here a generic example

Code:
$ lynx -dump "https://forums.macrumors.com" > plaintext.txt

This also works with local HTML files and can be automated with only one line shell script.

for example "cd" in directory with HTML files

Code:
#!/bin/bash
        for i in $( ls -1 *.html ); do
            lynx -dump $PWD/$i > plaintext_$i.txt
        done

This was the UNIX/Open Source way.
 
Thank you for your response.

Unfortunately, I'm having some trouble with Lynx and macport. I'm pretty bad at figuring out opensource software, and I keep on getting "bus errors" in the terminal when I try to run/install lynx. I also have to be honest, and say that I don't understand what lynx does, in any case. Can't you just use the terminal to give commands?

So, is it possible to give a more detailed/dumbed-down walkthrough? or maybe does someone know about another application that can automate the "make plain text" operation?

Thank you again for any help!
 
I figure I'd give this one last bump, anyone have any idea?

Any help would be truly appreciated! Either a new suggestion, or help getting lynx to work/explaining what the placeholders in the above code stand for. This is my message when I start up lynx, in any case:

Code:
Last login: Thu Mar 10 00:14:36 on ttyp1
/Applications/Lynx/lynx; exit
Welcome to Darwin!
ip-90-142:~ [username]$ /Applications/Lynx/lynx; exit
Bus error
logout
[Process completed]

Thanks!
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.