Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

dougphd

macrumors member
Original poster
Apr 28, 2016
70
0
If the while loop is not enter then all is well. If the while loop is entered then get_date is called a second time. No line is read in get_date and the program fails.

Code:
/
char* get_date(FILE *fp)
{
   char  *start_date, *line;
   start_date = (char *)calloc(14, sizeof(char));
   line = (char *)calloc(100, sizeof(char));
   fgets(line,100,fp);
   strtok(line,",");
   strtok(NULL,",");
   strcpy(start_date,strtok(NULL,","));
   free(line);
   return(start_date);
};
char * find_start_date( int i, int j, char **etf_list)
{
   char *start_date1,*start_date2, *full_name, *line;
   FILE *etf1,*etf2;
   int comp, k;
   start_date1 = (char *)calloc(14, sizeof(char));
   start_date2 = (char *)calloc(14, sizeof(char));
   line = (char *)calloc(100, sizeof(char));
   full_name = (char *)calloc(100, sizeof(char));
   strcpy(full_name,home);
   strcat(full_name,etf_list[i]);
   if ( (etf1 = fopen(full_name,"r") ) == NULL)
      printf("did not open %s\n",full_name);
   strcpy(full_name,home);
   strcat(full_name,etf_list[j]);
   if ((etf2 = fopen(full_name,"r")) == NULL)
        printf("did not open %s\n",full_name);
    // read first line in etf1
    fgets(line,100,etf1);
    start_date1 =  get_date(etf1);
   // read first line in etf2
    fgets(line,100,etf2);
    start_date2 =  get_date(etf2);
    fclose(etf1);
    fclose(etf2);
    while((comp = strcmp(start_date1,start_date2))!= 0)
         if (comp > 0)
             start_date2 =  get_date(etf2);
         else if (comp < 0)
             start_date1 =  get_date(etf1);
    free(line);
    free(full_name);
    free(start_date2);
    return(start_date1);
};

/[code]
 
You call fclose on both etf1 and etf2 just before the while loop. Inside the loop, you then try to read from the closed files, which will fail.

Edit: And even if you moved the fclose to after the loop, you'd leak every start_date other than the last ones used. Based on your other posts, you are going to be dealing with a lot of data, so avoiding leaks is more important than usual. In particular, avoiding leaks in a loop.
 
Last edited:
Yes,that makes sense and I should have caught it. However, with that fixed, it still fails the same way just after running more pairs of files. I will look at the code again probably tomorrow but if you can take another look it would be appreciated.
 
How does it fail? Any error message? Post it exactly. How many pairs does it process? Without all your code and representative data, I can only really guess.

Off the top of my head:
  • If you are able to process a large number of files, the repeated leaking might eventually cause your process to run out of memory, and cause calloc to fail.
  • Is every single line in your files formatted properly? Blank lines, or lines subtly different than expected, could cause issues. Also, do you intend to completely ignore the first line in every file? Without seeing your data, I just have to guess.
  • Do all your paths fit inside the 100-char buffer you assign for file names?
  • I honestly never remember if fgets consumes the newline or leaves it in the file (causing the next invocation to see an empty line and do nothing) and the man page is less than clear. Given that you even got dates to begin with, I assume it must be consuming it since there's an fgets on both files, and if that left them with a newline at the read head, the first calls to get_date would fail. But I'd take a look here, too.
  • If you reach EOF in a file, your fgets will come up null, and your get_date will fail. You don't seem to check for this.

Suggestions:
  • Check return values of every fgets. If it returns NULL, that means that either you hit the end of a file or had an error.
  • Fix the leaks. Every single calloc needs to be paired with a free. You can remove the callocs for both start_dates inside find_start_date, since they get over-written with calloced buffers from get_date. Then, in the loop, free the date before assigning to the return of get_date. Also, whatever calls find_start_date needs to (eventually) free the date it gets back.
  • Confirm what fgets does with the newline character, and make sure it is what you expect.
  • Turn on all the compiler warnings and see what it says. (-Wall is the option you want). Paste the output here if you want, I'm happy to look at it and let you know what I see.
  • If you are still stuck, post a full, compilable program and enough data to demonstrate the issue (the data can be fake as long as it matches your format). I'm happy to take a look, but I can't debug code I can't compile.
 
Last edited:
  • I honestly never remember if fgets consumes the newline or leaves it in the file (causing the next invocation to see an empty line and do nothing) and the man page is less than clear. Given that you even got dates to begin with, I assume it must be consuming it since there's an fgets on both files, and if that left them with a newline at the read head, the first calls to get_date would fail. But I'd take a look here, too.

It will consume it. This is one of the things the man page means when it says "The newline, if any, is retained.". The other thing being that the \n is left in str.
[doublepost=1463775302][/doublepost]Another generic suggestion - if you don't already know how to use a debugger then learn to do so. Sometimes single stepping through your code, checking the value of things as you go, is enough for that head slapping moment.

I usually use gdb but under OSX, unless you install it, you will only have lldb (not that there is anything wrong with lldb - just not what I am used to as sadly I don't work on OSX). Of course if you are using Xcode there is a nice GUI over lldb.
 
It will consume it. This is one of the things the man page means when it says "The newline, if any, is retained.". The other thing being that the \n is left in str.

That's what I thought, but I couldn't tell if that meant retained in the string you get back (rather than tossed out), or retained just after the current read position in the file (i.e. the read position isn't advanced past it). I should have read farther, since the next paragraph on gets makes it clear that it tosses the newline, unlike fgets.
 
Thanks. I cleaned up some leaks but the real problem is that the while condition is never = 0 for some pairs of files.
 
That would mean those pairs never have the same date in them. Have you confirmed that that is true by inspecting the data for pairs that exhibit this behavior?

Also, the way you go through files is a little odd. It looks like the code assumes the files are sorted lexicographically by date (all we can tell about date is that it is up to 14 chars long and somewhere in the middle of the line, after a couple commas. Many date systems do not chronologically when sorted lexicographically, which could also be a problem. 11/1/xx sorts before 2/1/xx lexicographically, for example. And that's true whether that is Nov 1 and Feb 1, or Jan 11 and Jan 2). If you do not think your code should be doing this, can you share a handful of representive lines of your data, so that we can see how it looks? If the files are not sorted lexicographically, then due to the way you alternate reading files, you could very easily skip over a matching set of dates the way it is coded.

Making up data, if your dates are as follows (just using integers here for simplicity)
Code:
File A  File B
  1       2
  3       6
  5       5
  7       6


You start like this, where > # < are the relevant read positions in each file:

Code:
File A  File B
> 1 <   > 2 <
  3       6
  5       5
  7       6

strcmp will return negative, so you move ahead in file A

Code:
File A  File B
  1     > 2 <
> 3 <     6
  5       5
  7       6

Now strcmp will return positive, so you advance b

Code:
File A  File B
  1       2 
> 3 <   > 6 <
  5       5
  7       6

negative again, so A

Code:
File A  File B
  1       2 
  3     > 6 <
> 5 <     5
  7       6

Still negative, so:

Code:
File A  File B
  1       2 
  3     > 6 <
  5       5
> 7 <     6

Positive:
Code:
File A  File B
  1       2 
  3       6  
  5     > 5 <
> 7 <     6

Positive:

Code:
File A  File B
  1       2 
  3       6  
  5       5 
> 7 <   > 6 <

and now we are at EOF in both files and we skipped the 5/5 match because the two files were not sorted properly and the code stair-stepped over it


If the files really do not match, then adding EOF handling to your fgets calls should at least allow the program to keep running: if you run out of file, just make a note of it and return cleanly.
 
I know the files are chronological and now know that I have to be careful to account for missing dates.
 
They may be chronological, but as I said, your code relies on them being sorted lexicographically/ASCIIbetically. Again, I don't know what your dates look like, but in general there is no guarantee that a chronological sort produces a lexicographical sort (or vice versa) on dates. Pulling a few date systems out of thin air:

2016-05-20
2016-05-21
2016-05-22
2016-12-25

Those are sorted chronologically and lexicographically. Your code will work on dates like that.


But these dates:

2016-5-20
2016-5-21
2016-5-22
2016-12-25

are still sorted chronologically, but now they are not sorted lexicographically (1 sorts before 5 lexicographically). Your code will fail with dates like this.

And then there are US-style dates:

05/05/2016
04/04/2017
03/03/2018

even the padding 0s can't save those dates.


If your times are unix-style timestamps (and now that i think about it, adding ms, as many languages do, makes a unix timestamp take 13 characters, or 14 with the nul), those will sort properly (unless you have dates before early Sept 2001 that are not 0-padded on the left)



As your code stands, if the dates are sorted lexicographically, then gaps don't matter. It will run properly and find matches (or fall off a cliff at EOF if there are no matches, but that is fixable). If they are not lexicographically sorted, you can easily skip a match as one file "gets stuck" on a larger value before a match, while the other shoots past the match.
 
Here are some lines from the file. The ordering has never been a problem.

.10GSPC,D,20151224,000000,8057.3804,8057.3804,8057.3804,8057.3804,0,0
.10GSPC,D,20151228,000000,8038.2915,8038.2915,8038.2915,8038.2915,0,0
.10GSPC,D,20151229,000000,8105.3369,8105.3369,8105.3369,8105.3369,0,0
.10GSPC,D,20151230,000000,8036.7295,8036.7295,8036.7295,8036.7295,0,0
.10GSPC,D,20151231,000000,7946.9121,7946.9121,7946.9121,7946.9121,0,0
.10GSPC,D,20160104,000000,7813.2612,7813.2612,7813.2612,7813.2612,0,0
.10GSPC,D,20160105,000000,7824.7964,7824.7964,7824.7964,7824.7964,0,0
.10GSPC,D,20160106,000000,7720.4253,7720.4253,7720.4253,7720.4253,0,0
.10GSPC,D,20160107,000000,7526.0396,7526.0396,7526.0396,7526.0396,0,0
.10GSPC,D,20160108,000000,7417.1660,7417.1660,7417.1660,7417.1660,0,0
.10GSPC,D,20160111,000000,7398.1880,7398.1880,7398.1880,7398.1880,0,0
 
Here are some lines from the file. The ordering has never been a problem

Yeah, those dates will order properly if they are chronological, so that isn't the problem. So, your above code should be able to find the first match in a pair of files like that, assuming a match exists (and that the match doesn't include the first line of each file, which I assume is a header, since you skip that line).

If it isn't finding a match, find a pair of files that exhibit the problem (print file names when comparing to help) and manually verify that there is a match to be found. Since the files are long, the following command should be able to tell you if there are matches (this is very quick and dirty, but it should suffice)

Code:
grep -F $(awk 'BEGIN{FS=","};{print $3}' ONE_FILE.txt) OTHER_FILE.txt

it'll print the lines in OTHER_FILE.txt that have corresponding dates in ONE_FILE.txt (replace names with appropriate paths).

If your program isn't finding matches but that shell script says there are some, I'll take a closer look at your code (posting the current state of these functions would help in that case). At first glance, it looks like it should work (with the changes that have been mentioned). If there truly aren't matches, then your current code will have issues once it hits EOF, which might be what you are seeing.
 
I check the pair of files that fail and yes there is a gap in dates that causes the code to fail.
 
I check the pair of files that fail and yes there is a gap in dates that causes the code to fail.

I'm not sure what you expect to do if there is no match, but I can help detect it. On all your fgets calls, check the return value. If it is NULL, then you've (probably) hit EOF in that file (it could be a read error but you probably don't need to differentiate). If it is NULL in get_date, free your buffer and return NULL. Also check the return values of get_date, and if that ever is NULL, free any buffers, close your files, and do whatever you want to do when you don't have a match. This should work, since once either file hits EOF, you know you can't find any matches with that pair of files, so even if the other file isn't at EOF yet, there's no need to finish scanning it.
 
What a thoughtful gesture in what you've done!

I was once stuck in a more complex situation developing something for work and someone came to my assistance in the dead of night from afar.

Could not thank the guy enough.
[doublepost=1464012813][/doublepost]BTW sed, awk, grep are my favorite tools (30 + years) for dealing with this sort of thing.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.