PDA

View Full Version : reading strings from a file.




farmerdoug
Apr 13, 2011, 10:11 AM
simple file

UT_DATE 2011_04_13 s UT date at start of night
TSCOPE 200"_HALE s 200" Hale Telescope at Mt Palomar

simple code




#include <stdio.h>
#include <stdlib.h>
#include "/Users/p1640/1640/C/FITS/include/fitsio.h"

int main (int argc, const char * argv[]) // argv[1] = fits file. argv[2] = header file

{
fitsfile *fptr;
FILE *header;
char *line;
int status;

line = (char *)calloc(60, sizeof(char));
if ((header = fopen(argv[2],"r")) ==NULL)
printf("couldn't open header file\n");
//if (fits_open_file(&fptr, argv[1], READWRITE, &status)) {
// printf("load_simple_fits_float_data: fits_open_file: ", status);
// return (1);
//}

while (fgets(line,60,header))

//while (fscanf(header,"%s",line) != EOF)
{

printf("%s\n",line);
//fscanf(header,"%s",line);
//printf("%s",line);
//fscanf(header,"%s",line);
//printf("%s",line);
//fgets(line,60,header);
//printf("%s\n",line);


}


fclose(header);
return 0;
}


not so simple results

When I run the code using the commented out fscanf, I get a list of all the strings in the file but when I run the code using fgets I get


TSCOPE 200"___04_13 s UT date at start of night
HALE s 200" Hale Telescope at Mt Palomar



chown33
Apr 13, 2011, 10:59 AM
Describe what you expected to happen.
Post an example of the output you want.

I can't tell from anything you posted, what it is you want or expect.


Refer to the man pages for fgets() and fscanf().
fgets() stops at a newline, or when its max-count is exhausted.
fascanf() with %s does not do that: %s stops at whitespace.

Since whitespace is a larger class of characters than newline (e.g. whitespace includes spaces and tabs), I would expect %s to return "words" as delimited by whitespace, while fgets() will return "lines" as delimited by newlines. If you have some other expectation, please explain what it is.


Also, fgets() respects a max count, while fscanf() will not (at least not as coded). This means a string longer than 60 will overflow the buffer for fscanf(), but not for fgets().

farmerdoug
Apr 13, 2011, 11:01 AM
I need the output to mirror the input one line at a time for further parsing.

chown33
Apr 13, 2011, 11:05 AM
I need the output to mirror the input one line at a time for further parsing.

Then use fgets().

Be sure to check the buffer for a newline in the last position. If it's not a newline, you exhausted the count, i.e. your buffer length was less than the line length.

farmerdoug
Apr 13, 2011, 11:16 AM
The file displays correctly in text edit; Doesn't that imply the existence of new line characters?

Bill McEnaney
Apr 13, 2011, 11:36 AM
Then use fgets().

Be sure to check the buffer for a newline in the last position. If it's not a newline, you exhausted the count, i.e. your buffer length was less than the line length.
The strrchr function thinks '\0' is the last character in a null-terminated string when that function searches for the last instance of the character you tell it to search for.

farmerdoug
Apr 13, 2011, 11:42 AM
You are suggesting that I use strrchr to check what the last character is?

subsonix
Apr 13, 2011, 11:47 AM
Use something more generous than 60 characters, is my suggestion. I usually go for BUFSIZ size, which is a system defined constant set to 1024. You also print a '\n' but if fgets captures less than 60 characters an eventual newline will be part of the string. That might mess up your output.

chown33
Apr 13, 2011, 11:48 AM
The file displays correctly in text edit; Doesn't that imply the existence of new line characters?

A. Not necessarily.
B. Apropos of what?


A. TextEdit.app will display lines that are terminated with CR's alone. It will also display CR-LF terminated lines. fgets() doesn't necessarily recognize a CR as a line-ending. It does recognize LF (i.e. classix Unix newline character).

If you don't know how your lines are terminated, you need to look at the binary data, not the text interpretation that TextEdit.app shows you. There can be several possible interpretations for some given data, and if TextEdit is set to automatically choose one, then what it shows you may not be the exact same as what's in the file.

Google hex fiend and download it. Use it to tell you what's in your file. Or read the man page for the hexdump command.

Even if TextEdit shows lines correctly, and lines are terminated by newlines, this is no guarantee that every line is less than some arbitrary number like 60. In short, if you don't sanitize your input data, your parser might misinterpret the data.


B. What is the relevance of this question to your previous posts? You hadn't previously mentioned a problem with detecting line-endings. In fact, you haven't really described what the problem is at all. Basically all you've said is that using fscanf() with %s doesn't produce the same output as fgets(), to which I have basically answered "No, they stop on different things, so the output won't be the same".

So please take a little time and describe exactly what you're trying to accomplish, post the code you expect to accomplish this with, then describe what the code produces that fails to meet your expectation.
1. Post your code and your actual data.
2. Describe what you expected to happen.
3. Describe what actually happened.

Post a zip file containing the actual data. If it contains CRs or CRLFs, then pasting it into a post will translate line endings. We need to see the actual data being read and parsed.

balamw
Apr 13, 2011, 11:49 AM
Use something more generous than 60 characters, is my suggestion. I usually go for BUFSIZ size, which is a system defined constant set to 1024.

This.

Plus, if you are using fgets should you pair that with puts instead of printf to avoid the same kind of termination issues chown33 is referring to.

As it stands your code is a poor man's clone of "cat" that will only work properly if each line of the input file is guaranteed to be 60 characters long or less.

B

Bill McEnaney
Apr 13, 2011, 12:00 PM
You are suggesting that I use strrchr to check what the last character is?
I'd use it or the rindex function. My point is that in a null-terminated string, you need to check the character that's to the immediate left of the null character if there's any character there to check. In any null-terminated string, the physically last character is the null character, the '\0'.

farmerdoug
Apr 13, 2011, 01:04 PM
I recreated the file with out a carriage return and then put it back. The file looks ok. Increasing the buffer size did not help; In fact, it made things worse. strrchr told me that there was at least one "\0", in the file.

subsonix
Apr 13, 2011, 02:02 PM
I recreated the file with out a carriage return and then put it back. The file looks ok. Increasing the buffer size did not help; In fact, it made things worse. strrchr told me that there was at least one "\0", in the file.

Well, the point of it is that fgets() reads until '\n' or '\0'. Meaning, if your lines is not exactly 60 characters the end of the string will move relative to your fgets calls. Having the buffer "large enough" means that you will have one line per fgets call.

If fgets reads a string that is less than 60 characters and contain a new line, it will be contained in the string. I usually create a strip_newline function to deal with that.

balamw
Apr 13, 2011, 02:08 PM
Well, the point of it is that fgets() reads until '\n' or '\0'. Meaning, if your lines is not exactly 60 characters the end of the string will move relative to your fgets calls. Having the buffer "large enough" means that you will have one line per fgets call.

If fgets reads a string that is less than 60 characters and contain a new line, it will be contained in the string. I usually create a strip_newline function to deal with that.

Explicitly: The string read in by fgets will include both the \n and the \0 when a complete line has been read. Since you reuse the buffer, reading in a shorter line will leave the previous \n and \0 in the buffer. The first \0 tells you where the last read ended. Increasing the buffer size should just mean potentially fewer reads. If there is no \0 in the buffer, the line was longer than the buffer size.

So, you either want to strip off the \n as subsonix says, or adapt your code to handle the fact that \n is included. e.g. by using fputs instead of printf("%s\n");

The code below is basically "cat".

#include <stdio.h>
#include <stdlib.h>

int main (int argc, const char * argv[])
{
FILE *header;
char *line;
int status;

line = (char *)calloc(60, sizeof(char));
if ((header = fopen("testfile.txt","r")) ==NULL)
printf("couldn't open header file\n");

while (fgets(line,60,header))
{
fputs(line,stdout);
}

fclose(header);
return 0;
}


B

farmerdoug
Apr 13, 2011, 02:30 PM
According to LabView which made the file, EOF on a windows machine is cr/lf while it is just lf on a MAC. It seems that fgets in Xcode looks for cr/lf and there for isn't any good unless you specially tell LabView how to terminate a line.

balamw
Apr 13, 2011, 02:34 PM
According to LabView which made the file, EOF on a windows machine is cr/lf while it is just lf on a MAC. It seems that fgets in Xcode looks for cr/lf and there for isn't any good unless you specially tell LabView how to terminate a line.

So strip one both terminators off and replace if needed with the one you want.

The easiest way to do this is to find the first occurrence of \n or \r and replace it with \0. Basically what subsonix was suggesting with strip_newline. (example http://www.cprogramming.com/tutorial/c/lesson9.html)

The only challenge to this is if you get \n\r instead of \r\n.

B

subsonix
Apr 13, 2011, 02:37 PM
But EOF doesn't mean end of line but end of file and EOF in this case is only the terminating condition of the loop, it doesn't effect what fgets does, only when to stop calling it. That is, keep calling fgets until the entire file is read.


#include <stdio.h>
#include <string.h>

void strip_newline(char *str) {
char *nl = (str + strlen(str) -1);
if( *nl == '\n' )
*nl = 0;
}

int main()
{
char buf[BUFSIZ] = {0};

while( fgets(buf, BUFSIZ, stdin) ) {
strip_newline(buf);
puts(buf);
}

return 0;
}

balamw
Apr 13, 2011, 02:43 PM
void strip_newline(char *str) {
char *nl = (str + strlen(str) -1);
if( *nl == '\n' )
*nl = '\0';
}


Would have to be extended to support \r\n in this case.

B

Bill McEnaney
Apr 13, 2011, 03:26 PM
void strip_newline(char *str) {
char *nl = (str + strlen(str) -1);
if( *nl == '\n' )
*nl = '\0';
}


Would have to be extended to support \r\n in this case.

B

How about this if we can assume that either '\n' or '\r' will always be one place to the left of '\0' when either '\n' or '\r' occurs in str? If you need to know what character you've stripped, you can return it.
/* If you find a '\n' or a '\r', replace it with a '\0'. */

void strip_line_terminator(char *str)
{
char *place = strpbrk(str, "\n\r");

if (place != NULL)
*place = '\0';
}

subsonix
Apr 13, 2011, 03:30 PM
But that will only take care of either '\n' or '\r', unless you call it twice or put the strpbrk call in a while loop.

chown33
Apr 13, 2011, 03:46 PM
But that will only take care of either '\n' or '\r', unless you call it twice or put the strpbrk call in a while loop.

It's not necessary to find more than one line terminator. The first one found terminates the string. Any remainder of the original string is ignored/discarded, regardless of what it contains.

subsonix
Apr 13, 2011, 03:52 PM
Yes, good point.

Bill McEnaney
Apr 13, 2011, 03:55 PM
But that will only take care of either '\n' or '\r', unless you call it twice or put the strpbrk call in a while loop.
Good point.

void strip_line_enders(char *str)
{
char *place;

while ((place = strpbrk(str, "\n\r")) != NULL)
*place = '\0';
}

Bill McEnaney
Apr 13, 2011, 04:04 PM
It's not necessary to find more than one line terminator. The first one found terminates the string. Any remainder of the original string is ignored/discarded, regardless of what it contains.
Oh goodie, that means I don't need my while-loop. I love to decrease overhead.

balamw
Apr 13, 2011, 04:09 PM
The easiest way to do this is to find the first occurrence of \n or \r and replace it with \0.

It's not necessary to find more than one line terminator. The first one found terminates the string.

Isn't that what I said? :p

since fgets will terminate on \n, the risk you run is if your foreign code generating the file puts out \n\r instead of \r\n. That would give you a \r at the beginning of the second and beyond lines and stripping to the first \r would give you zero length strings.

You might want to check if the first character of the string is \r and there are other characters before the \n\0.

If you control the LabView code and can make sure it uses CR/LF this is a non-issue.

B

subsonix
Apr 13, 2011, 04:37 PM
If you control the LabView code and can make sure it uses CR/LF this is a non-issue.

B

I think that's a safe assumption unless it's text from Acorn BBC. :D

http://en.wikipedia.org/wiki/Newline#Representations

balamw
Apr 13, 2011, 05:05 PM
I think that's a safe assumption unless it's text from Acorn BBC. :D

http://en.wikipedia.org/wiki/Newline#Representations

I've run into such files from instruments where someone made a mistake at setup and entered \n\r instead of \r\n.

PEBCAK.

B