PDA

View Full Version : Web page (text only?) searchable archive




chesterville
Nov 1, 2008, 07:08 PM
Does anyone know if there is a program (or a way to develop a program other than a cumbersome Access database) to easily archive and search web pages - specifically the text? An RSS feed has been suggested, but will this allow me to archive and search the text for keywords?

My question probably isn't all that clear, so please allow me to elaborate: Basically, I'm a finance geek. I read 20+ finance blogs on a regular basis and several newspapers (online editions). I frequently wish I could type in a keyword for information that I want (say, "foreclosures" or "US GDP") and then be able to see that text of blog posts or newspaper articles on the subject that I found interesting. This is particularly tricky when I'm looking for specific information from a blog post or newspaper, but I don't remember which one. Also, my dream (twisted, I know) is to eventually build up an archive spanning years of information so that I can refer back to it in the future (something about not knowing history dooms you to repeat it). So if the database actually stored the text on my computer or on some "cloud computer" where it is not subject to the future existence of the blog that would be nice too (I'm not looking to break copyright laws, though - it would be strictly for personal use). If you made it to this part of the thread, wow! Thanks for reading all of this.

Thanks in advance for any assistance.

Cheers,

Andrew



chilipie
Nov 1, 2008, 07:11 PM
Most RSS feed readers will archive the data on your computer, so you won't be relying on the sites to still be there. I think using feeds would be the most sensible way to do it, unless you want to manually copy and paste the relevant text from every new item.

Nugget
Nov 1, 2008, 07:16 PM
You might want to take a look at browseback (http://www.smileonmymac.com/BrowseBack/index.html) which is sort of designed to do what you describe (although it stores as pdf so that you retain formatting and image information as well).

It's really slick, but can be a bit resource intensive.

angelwatt
Nov 1, 2008, 10:56 PM
Google Notebook (http://www.google.com/notebook/) comes to mind as a possibility. It's not the greatest, but may be something to look into. What you're essentially wanting though is a knowledge base (http://en.wikipedia.org/wiki/Knowledge_base). It's something I like to do as well, but haven't come up with a perfect solution. Here's another page to look at for some wiki-based solutions. (http://www.dotnetjunkies.com/WebLog/mlevison/archive/2004/11/13/32040.aspx)

chesterville
Nov 1, 2008, 11:47 PM
Thanks everyone for the responses. I'll be looking into all of your suggestions.