Batch conversion to plain text? (modest reward offered for help!)

Discussion in 'Mac Basics and Help' started by Andrew936, Mar 6, 2011.

  1. Andrew936 macrumors newbie

    Joined:
    Mar 6, 2011
    #1
    Hello MacRumors,
    I have a large number (several hundred to several thousand) of downloaded html pages with images and links that I need to convert to plain text files (no images, links, or formatting at all). I've been using both the Automator and a batch file-rename application to change the file extensions to .txt, but the end product still isn't quite what I want.

    What I come up with is a .txt file that still has formatting and links etc. in it. When I look at the file info, it says it's a "plain text document", but the "make plain text" action in Textedit (format > make plain text) is still available. When I do that ("make plain text in textedit"), I get exactly what I want: a bare-bones, text only document. I just need to do that a few thousand more times. How can I do that?

    Is this an issue of text encoding? I should also say that my end product also has to be UTF-8, according to the documentation for another application that these text files are ultimately going to be put into.

    So in short, I need a way to perform the textedit action "Make plain text" en masse. I think there's a way to do it with the terminal, but I am fairly clueless as to how bash/unix commands work — so please hold my hand with any instructions involving that sort of thing. I'm running Tiger on a 2007 Macbook pro (I'm not sure if you needed to know that, but I figured it couldn't hurt).

    Thank you so much for any help!
     
  2. Quotenfrau macrumors 6502

    Quotenfrau

    Joined:
    Mar 6, 2011
    #2
    You can install "lynx" throught MacPorts (see http://www.macports.org for more info)

    here a generic example

    Code:
    $ lynx -dump "http://forums.macrumors.com" > plaintext.txt
    This also works with local HTML files and can be automated with only one line shell script.

    for example "cd" in directory with HTML files

    Code:
    #!/bin/bash
            for i in $( ls -1 *.html ); do
                lynx -dump $PWD/$i > plaintext_$i.txt
            done
    
    This was the UNIX/Open Source way.
     
  3. Andrew936 thread starter macrumors newbie

    Joined:
    Mar 6, 2011
    #3
    Thank you for your response.

    Unfortunately, I'm having some trouble with Lynx and macport. I'm pretty bad at figuring out opensource software, and I keep on getting "bus errors" in the terminal when I try to run/install lynx. I also have to be honest, and say that I don't understand what lynx does, in any case. Can't you just use the terminal to give commands?

    So, is it possible to give a more detailed/dumbed-down walkthrough? or maybe does someone know about another application that can automate the "make plain text" operation?

    Thank you again for any help!
     
  4. Andrew936 thread starter macrumors newbie

    Joined:
    Mar 6, 2011
    #4
    I figure I'd give this one last bump, anyone have any idea?

    Any help would be truly appreciated! Either a new suggestion, or help getting lynx to work/explaining what the placeholders in the above code stand for. This is my message when I start up lynx, in any case:

    Code:
    Last login: Thu Mar 10 00:14:36 on ttyp1
    /Applications/Lynx/lynx; exit
    Welcome to Darwin!
    ip-90-142:~ [username]$ /Applications/Lynx/lynx; exit
    Bus error
    logout
    [Process completed]
    Thanks!
     

Share This Page