# Text Parser: Text to MySQL Database

Discussion in 'Mac Programming' started by barr08, Jul 26, 2007.

1. ### barr08 macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#1
Oh hey,

I need to parse text from a text file in this form:

06/07/07 02:44:29.966 INFO Servlet.Engine.Transports : 0 accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2

To be split accordingly:

06/07/07 / 02:44:29.966 / INFO /Servlet.Engine.Transports : 0 / accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2

Into a 5 column HTML table with the fields Date, Time, Level, ID, and Info. These columns are seperated above with the bold shashies.

Any ideas? PERL isn't neccesary, it is just what I was thinking of using. Even a start would be apprectiated, you don't need to give me a full working script or anything.

Thanks for looking!

2. ### ATG macrumors regular

Joined:
Aug 7, 2005
#2
Use regexes. AGRegex is the best I've found (and it has my initials )

EDIT: Oh, sorry, I didn't ask which language you are using :\

3. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#3
I am trying to use PERL, but really at this point I would try anything. After I do this, my next step is to search these fields, so regexes will deffinitely be useful.

4. ### stndn macrumors member

Joined:
Oct 22, 2006
Location:
earth
#4
Here's something to get you started:

Code:
#!/usr/bin/perl -w

my $sampleText = "06/07/07 02:44:29.966 INFO Servlet.Engine.Transports : 0 accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2";$sampleText =~ m|^([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+(.+?)(accept.*)$|i; print "$1 / $2 /$3 / $4 /$5\n";

Not the prettiest sight and most likely unefficient, but at least it should give you some ideas.

-stndn.

5. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#5
Thanks for the sample. I should probably admit I am a complete PERL noob, so how would I go by making this work. Do I just make a text file and ad ".pl" to the end of it? What program would I use to run the script?

If PERL isn't the best choice for this task, could someone reccomend a better language to do this in?

6. ### iBert macrumors regular

Joined:
Jul 14, 2004
#6
Perl is a pretty good scripting language specially for parsing text. At least in my opinion.

When are you going to need the parser to be called? Is it like a web service (just to say something) or are you manipulating something in C, C++, Java, etc and need to call this parser?

7. ### stndn macrumors member

Joined:
Oct 22, 2006
Location:
earth
#7
I thought you wanted to have something in Perl, and then I realized you mentioned that Perl is not necessary in your initial post ,p

iBert has the right questions, which I'm going to add to: Is this text parsing a one-time deal? Or are you going to run the script every time you want to show the output in other format?

If it's a one time deal, then you can take the Perl script as your base, make some changes, save it as filename.pl, and run the script through perl parser (one of which can be done from command line / terminal by typing: perl filename.pl).

If it's an ongoing parsing (done per request), maybe change it to something like PHP or whatnot. Not that Perl can't be used for web, but PHP can be mixed with HTML more easily / intuitively than Perl, which seems like what you're looking for (add HTML to the output).

-stndn.

8. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#8
This will be ongoing, per user request, so maybe PHP would be better? I am thinking of parsing the data to a MySQl database, though I am not sure right now if my company supports it.

So should I ditch PERL and go with PHP? I do know that PHP and MySQL work well together (as far as I can remember)

9. ### pengu macrumors 6502a

Joined:
Mar 20, 2005
Location:
#9
PHP won't have an issue doing it (neither will Perl) but as stated, PHP is more aimed at mixing with/generating HTML.

You didn't really specify what parts (other than date/time) are likely to change? it APPEARS to be some kind of serverlog, relating to maybe Tomcat or BES or some other JSP/J2EE related server?

If you could give some examples of what other values you might get, we can probably help with a more accurate/specific/efficient method.

10. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#10
Here is a chunk of a log file:

Code:
06/07/07 02:44:27.585 INFO Servlet.Engine.Transports : 0 Type: TEST
06/07/07 02:44:27.586 INFO Servlet.Engine.Transports : 0 Site User: QA_BPORTAL
06/07/07 02:44:27.586 INFO Servlet.Engine.Transports : 0 Product:ProductId: Broker Portal Name: Broker Portal
06/07/07 02:44:27.588 INFO Servlet.Engine.Transports : 0 Retrieving user ent list
06/07/07 02:44:27.619 INFO Servlet.Engine.Transports : 0 Retrieve ent names for user: 3771000016, contract: qa_bportal
06/07/07 02:44:28.064 INFO Servlet.Engine.Transports : 0 ibg.menu.db.UserEntListDBBuilder statistics:  proc = pk_portal.fn_get_ent_nm_by_usr_ent_typ( ?,?,?,?,? ) connMillis = 7 prepareMillis = 16 executeMillis = 75 processRowsMillis = 117 postProcessMillis = 0 subClassLogInfo =
06/07/07 02:44:28.067 INFO Servlet.Engine.Transports : 0 Linking site attributes
06/07/07 02:44:28.075 INFO Servlet.Engine.Transports : 0 Storing ent set for BROK back to user
06/07/07 02:44:28.832 INFO Servlet.Engine.Transports : 0 ibg.db.user.UserProductDBBuilder statistics:  proc = ESDBO.PK_PORTAL.FN_GET_USER_PRODUCT_INFO( ?,?,?,? ) connMillis = 7 prepareMillis = 1 executeMillis = 5 processRowsMillis = 1 postProcessMillis = 0 subClassLogInfo =
06/07/07 02:44:28.887 INFO Servlet.Engine.Transports : 0 ibg.db.user.UserCompCheckDBBuilder statistics:  proc = ESDBO.PK_ADMIN_USER.FN_CHECK_PARENT_COMPANY(?,?,?) connMillis = 4 prepareMillis = 1 executeMillis = 14 processRowsMillis = 0 postProcessMillis = 33 subClassLogInfo =
06/07/07 02:44:28.902 INFO Servlet.Engine.Transports : 0 ibg.db.user.UserCompCheckDBBuilder statistics:  proc = ESDBO.PK_ADMIN_USER.FN_CHECK_PARENT_COMPANY(?,?,?) connMillis = 6 prepareMillis = 1 executeMillis = 4 processRowsMillis = 0 postProcessMillis = 1 subClassLogInfo =
06/07/07 02:44:28.910 INFO Servlet.Engine.Transports : 0 POST /XML_IBGZP_A HTTP/1.1
06/07/07 02:44:28.911 INFO Servlet.Engine.Transports : 0 content-type: application/xml
06/07/07 02:44:28.911 INFO Servlet.Engine.Transports : 0 connection: keep-alive
06/07/07 02:44:28.912 INFO Servlet.Engine.Transports : 0 accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
06/07/07 02:44:28.912 INFO Servlet.Engine.Transports : 0 host: Unknown
06/07/07 02:44:28.913 INFO Servlet.Engine.Transports : 0 content-length: 302
06/07/07 02:44:28.913 INFO Servlet.Engine.Transports : 0 user-agent: HttpConnection
06/07/07 02:44:28.914 INFO Servlet.Engine.Transports : 0
06/07/07 02:44:28.915 INFO Servlet.Engine.Transports : 0 <?xml version="1.0" ?>
06/07/07 02:44:28.915 INFO Servlet.Engine.Transports : 0 <FBIBLV_REQUEST>
06/07/07 02:44:28.916 INFO Servlet.Engine.Transports : 0     <PSL_VIEW>FBIBLV.VIEW</PSL_VIEW>
06/07/07 02:44:28.917 INFO Servlet.Engine.Transports : 0     <FBIBLVI>
06/07/07 02:44:28.917 INFO Servlet.Engine.Transports : 0         <PRODUCT_ID>Z</PRODUCT_ID>
06/07/07 02:44:28.918 INFO Servlet.Engine.Transports : 0         <client_group_id>xxxxxxxxxx</client_group_id>
06/07/07 02:44:28.918 INFO Servlet.Engine.Transports : 0         <portal_id>xxxxxxxxxx</portal_id>
06/07/07 02:44:28.919 INFO Servlet.Engine.Transports : 0         <incl_rules_table>N</incl_rules_table>
06/07/07 02:44:28.920 INFO Servlet.Engine.Transports : 0     </FBIBLVI>
06/07/07 02:44:28.920 INFO Servlet.Engine.Transports : 0 </FBIBLV_REQUEST>

Basically this program I am writing is going to be a search. Each day, this log file will be parsed somewhere to a database, so a user can log on to a site, search via a field (such as date, time, etc.) and get a list of results. The database does not need to be viewable to the user, only their search results.

At this point, I am thinking of parsing with PHP, making a MySQL database to hold the data, and searching the database with PHP.

Any ideas, help, guidance, whatever is greatly appreciated.

Edit: If PHP and MySQL end up being the languages I use, I would need a testing server, right? I can't just test these scripts on my computer?

11. ### bobber205 macrumors 68020

Joined:
Nov 15, 2005
Location:
Oregon
#11
MySql and PHP can be installed locally. You can reach your server by going to locahost:80

I just checked my apps folder and I can't seem to find the app that does this server. Do a quick "PHP and MYSQL server for OS X" search on google.

EDIT: I tried that search and I can't find it. Damn.

12. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#12
I am actually stuck at work on this XP box, which might make looking for something like that easier. I'll poke around google for something.
Does anyone know how I could do this on a windows machine? I am coming up short here.

13. ### macfaninpdx macrumors regular

Joined:
Mar 6, 2007
#13
You could either install them separately, by going to their respective sites (mySQL and PHP), or you could use packaged solution such as XAMPP, which will install Apache, MySQL, PHP & PEAR, and PERL along with several other useful utilities.

The advantage to installing them individually is that you only install the pieces you need. The advantage to using something like XAMPP is that it gives you administration utilities as well.

14. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#14
Awesome, thanks. That program is exactly what I needed.

Now back to the original problem...

15. ### macfaninpdx macrumors regular

Joined:
Mar 6, 2007
#15
If you use PHP for you original question, you could try something like the following:
PHP:
 $logfile = file("\\path\\to\\your\\file"); foreach ($logfile AS $linenum =>$line) {     $logdate = substr ($line, 0, 8);     $logtime = substr ($line, 9, 12);     $level = substr ($line, 22, 4);     $id = substr ($line, 27, 29);     $info = rtrim (substr ($line, 57) ); } 
Doing it this way (with substr) means that the date, time, level and id must be the same exact length every time, or it will break the line in the wrong spot. So a RegEx expression would definitely be more robust, but since I do not know regex expressions very well , this is another solution.

16. ### pengu macrumors 6502a

Joined:
Mar 20, 2005
Location:
#16
firstly. i suggest you install apache2, php5 and mysql5 seperately. there are very easy to follow tutorials, and you aren't likely to need to mess with them much once its set up.

as for the log. i'll have a look @ it tomorrow while im at work (its 1.30am here and wayyy too late to be thinking technically) and see if i can offer a basic solution for extracting/splitting the string values.

17. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#17
I would do this, but getting programs installed on these work computers is a huge hassle. The all in one solution would be much easier and less time consuming to get installed on this box.

18. ### macfaninpdx macrumors regular

Joined:
Mar 6, 2007
#18
I just thought of one more solution to splitting the log file entries, although it still may not be as efficient as a properly written regex:
PHP:
 $logfile = file("\\path\\to\\your\\file"); foreach ($logfile AS $linenum =>$line) {      //explode line using space delimiter     list ($logdate,$logtime, $level,$id1, $id2,$id3, $info) = explode (" ", rtrim($line ), 7);     //combine id variables into one     $id = implode (" ", array ($id1, $id2,$id3) ); } 
Since the explode function is using a space character to split the line, the ID field was split into three parts. I used the implode line to combine them again. Also, since there is a limit of 7 on the explode function, the remaining info in the INFO field will all be kept together (not separated).

19. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#19
Thanks for these! I think I am going to try to use the second as a starting point for now. Unfortunately I can't test anything here at work because they have yet to supply me with a server or let me install any programs, and I can't access my school's testing servers because work blocks them.

20. ### stndn macrumors member

Joined:
Oct 22, 2006
Location:
earth
#20
I've just seen the sample log files you provided, and it's safe to say that the example code with regex won't work because I used the keyword 'accept' as one of the separators towards the end.

The key with text parsing is identifying consistent section(s) of the texts you are trying to parse.

From your example log, it's safe to assume that the part that will be consistent is the timestamp. It's not clear yet (to us) whether the texts 'INFO Servlet.Engine.Transports : 0' will always be there as well. If yes, then it can be used as separator if you are going through the regex way.

If you are using the substr () or explode () way per what's given by macfaninpdx, then you need to make sure that the texts and spacings are consistent among different parts of the log.

Otherwise, you might consider mixing regex and substr/explode as well.

-stndn.

21. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#21
I don't know anything about regex, can you give an example?

22. ### krunk macrumors regular

Joined:
Jan 29, 2004
#22
Regex is extremely useful, but if your learning can be very time consuming ironing out the special cases and honing your regex.

If your doing this to learn how to process files, regex is a great thing to pick up toward that end. If your doing it out of some need and getting it done, done right, and done quickly is your primary concern. There are other methods with a lower learning curve...like below.

========================Ruby Code================
file = File.new('/path/to/file')
html = Fie.new('/path/to/html')

html.puts "<table>"
html.puts "<tr>"
html.puts "<th>DATE</th>"
html.puts "<th>TIME</th>"
html.puts "<th>LEVEL</th>"
html.puts "<th>ID</th>"
html.puts "<th>INFO</th>"
html.puts "</tr>"

l = line.split # l => ["06/07/07", "02:44:29.966", "INFO", "Servlet.Engine.Transports", ":", "0", "accept:", "text/html,", "image/gif,", "image/jpeg,", "*;", "q=.2,", "*/*; q=.2"]

date = l[0]
time = l[1]
level = l[2]
id = l[3] << l[3] << l[4]
info = l[4..l.length]

html.puts "<tr>"
html.puts "<td>#{date}</td>"
html.puts "<td>#{time}</td>"
html.puts "<td>#{level}</td>"
html.puts "<td>#{id}</td>"
html.puts "<td>#{info}</td>"
html.puts "</tr>"
end

html.puts "</table>"
html.close

==============================

That's ruby, the # beside the line.split is just a comment showing contents of the returned array. By default, split() separates a line by spaces into a new array.

23. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#23
I am unclear on this also. Let me look into it and get back to you.

24. ### stndn macrumors member

Joined:
Oct 22, 2006
Location:
earth
#24
The first thing to note before using regex for anything is to make sure you can make a pattern out of what you are trying to process. Without a fixed pattern (regardless of how simple or complex it is), regex won't work. Actually, this is true for any form of (mostly text) parsing.

It's not an easy learning experience, but it's worth it if you have time. Otherwise, start with simple (temporary) solution using explode() or other methods.

-stndn.

25. ### barr08 thread starter macrumors 65816

Joined:
Aug 9, 2006
Location:
Boston, MA
#25
OK I looked into this further, the parts with a constant size include the date, time, and the "Servlet.Engine.Transports : ". The number after that can be anywhere from one to three digits. Every log file will have those three things being the same number of digits. Does that help at all?

I put another chunk for you to take a look at for easy reference.

Code:
06/07/07 02:44:27.585 INFO Servlet.Engine.Transports : 0 Type: TEST
06/07/07 02:44:27.586 INFO Servlet.Engine.Transports : 0 Site User: QA_BPORTAL
06/07/07 02:44:27.586 INFO Servlet.Engine.Transports : 0 Product:ProductId: Broker Portal Name: Broker Portal
06/07/07 02:44:27.588 INFO Servlet.Engine.Transports : 0 Retrieving user ent list
06/07/07 02:44:27.619 INFO Servlet.Engine.Transports : 0 Retrieve ent names for user: 3771000016, contract: qa_bportal
The red parts are of constant size, the size of the part in bold blue ranges largely, but is always the end of the string:

06/07/07 02:44:27.619 INFO Servlet.Engine.Transports : 0 Retrieve ent names for user: 3771000016, contract: qa_bportal