Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

ArtOfWarfare

macrumors G3
Original poster
Nov 26, 2007
9,699
6,268
As a homework assignment, I'm making a C++ program that reads a text file and reports how many times each word appears.

Two requirements for the assignment are that we have to use hashes, and we have to use a vector with the indexes being the hashes.

So my program reads each word, then computes a hash (which will be a number less than 4096), then passes the hash and the word (as a string) to this function:

Code:
#include <iostream>
#include <fstream>
#include <vector>
#include <string>

using namespace std;

typedef struct wordRec {
	string word;
	int occurances;
};

vector<wordRec *> words;

void foundWord (string word, unsigned int hash)
{
	cout << "Over here! I have picked up " << word << endl;
	if ((hash <= words.capacity()) && words[hash])
	{
		if (words[hash]->word == word)
		{
			words[hash]->occurances++;
			cout << words[hash]->word << " has now been found " << words[hash]->occurances << " times." << endl;
		}
		
		else
		{
			foundWord (word, hash+4096);
		}
	}
	
	else
	{
		struct wordRec * newWord = new wordRec();
		newWord->word = word;
		newWord->occurances = 1;
		cout << "First time " << newWord->word << " has been found. Hash is " << hash << endl;
		if (words.capacity() < hash)
		{
			cout << "I came here..." << endl;
			words.reserve(hash+2);
			cout << "And I won. Capcity is now " << words.capacity() << endl;
		}
		words.insert(words.begin()+hash, newWord);
		cout << "It has been inserted." << endl;
	}
}

This function first checks whether the vector uses the hash yet. If it does, it checks to see if the word at that hash is the same as the one that was passed to the function. If it is, then it just increments the occurrence counter. If it isn't the same word, then it calls itself but with a hash 4096 greater.

If it doesn't find that the vector has used the hash, it inserts itself. That's what I would like it to do, anyways. Here's what I get for output:

(alice is the first word in the text, which is alice's adventures in wonderland.)

Code:
Over here! I have picked up alice
First time alice has been found. Hash is 1504
I came here...
And I won. Capcity is now 1506
Segmentation fault: 11

So I'm getting segmentation fault after it increases the capacity, but before it inserts the word, because it gets a segmentation fault instead.

Any suggestions as to why or how to fix it?
 
Last edited:
I hate code fragment as they give an incomplete picture of a problem.

Haven't given the code segment more than a quick glance but I suspect problems here …

if ( (hash <= words.capacity()) && words[hash] )

… and, here …

words.insert(words.begin()+hash, newWord);

'vector' array index should probably be 'capacity()' - 1.
Expanding the vector anywhere in the middle will of course throw off the location of all previous contents.

As well as the boundary conditions of your recursive function
 
I hate code fragment as they give an incomplete picture of a problem.

Haven't given the code segment more than a quick glance but I suspect problems here …

if ( (hash <= words.capacity()) && words[hash] )

… and, here …

words.insert(words.begin()+hash, newWord);

'vector' array index should probably be 'capacity()' - 1.
Expanding the vector anywhere in the middle will of course throw off the location of all previous contents.

As well as the boundary conditions of your recursive function

Changing words.capacity() to words.capacity()-1 causes it to crash with a segmentation fault, right on the if() line...
 
Like everything else in software development: Break It Down.

Write a simple test program. It should test whether you can insert items at the beginning, and only at the beginning. If that works, check insertion of items at the end. At every point in the test, make sure you can iterate over the items already in the vector, i.e. demonstrate their accessibility.

If all that works, then proceed to check insertion of items between beginning and end.

Find known working examples of vector, and compile and test them.

Scale up from the simplest working test until something breaks. If the simplest thing breaks, debug that first. Maybe the problem lies in the object you're inserting, and not in the vector at all.

Learn to use the debugger, so you can step into methods and see exactly where they fail.


And as lloyddean pointed out:
Expanding the vector anywhere in the middle will of course throw off the location of all previous contents.
This seems to undermine your basic strategy, unless you do something to avoid it. I see nothing in your code to avoid this problem, which means that even if insertion didn't crash, the code would fail because of the design flaw (failure to account for insertions pushing up all subsequent items).

A vector is not exactly like a C array, nor is it exactly like NSArray. You need to understand exactly what it means to insert into a vector. The posted code suggests you don't understand this yet.
 
likely issue

I have not used C++ since 1996 (been using Java since then), but I think I see your problem. words.begin() is going to give you an iterator to the first element populated in the vector (if any).

Don't you want to get to the element at index hash? Suppose there is already an element at index X. You are positioning the new element at X + hash, as opposed to just hash. This may be extending beyond the bounds of the array.

If this is the very first element to be added to the vector, then words.begin() should not be pointing at anything.
 
At the moment I don't even think this code could compile.

It won't; I left off function main.

chown33 said:
Write a simple test program. It should test whether you can insert items at the beginning, and only at the beginning. If that works, check insertion of items at the end. At every point in the test, make sure you can iterate over the items already in the vector, i.e. demonstrate their accessibility.

If all that works, then proceed to check insertion of items between beginning and end.

Find known working examples of vector, and compile and test them.

Scale up from the simplest working test until something breaks. If the simplest thing breaks, debug that first. Maybe the problem lies in the object you're inserting, and not in the vector at all.

Excellent advice; thank you.

Expanding the vector anywhere in the middle will of course throw off the location of all previous contents.
I don't know how I ended up looking over that entirely the first time I read through your post.

Hmmm... I need to seriously reconsider how I'm making this whole thing work. Maybe I'll start by having 4096 blanks, add new things directly to the end, then swap the new things with the blank that is in the correct location? Wait, that makes no sense. I can just write over whatever is at that index; it's just silly to bother with making new ones and swapping around... I'll post when I have results,

Thanks all for the input.
 
Hmmm... I need to seriously reconsider how I'm making this whole thing work. Maybe I'll start by having 4096 blanks, add new things directly to the end, then swap the new things with the blank that is in the correct location? Wait, that makes no sense. I can just write over whatever is at that index; it's just silly to bother with making new ones and swapping around...
What's silly is using the wrong data structure in the wrong place.

Start with an array. It has 4096 slots. Each slot starts empty (null). When the first hash appears for a slot, make a vector. Put the vector in the array slot. Add the word to the vector.

FWIW, this is one of the classical approaches to hash things. The truly classical way is an array of length N, where N is the number of hashes, or modulo a power of 2. Each entry in the array is the head of a linked list. New items are then linked at the head of the list residing at the index corresponding to the hash.

Another classical approach is to have an array that's several times larger than the hash count. The hash gives the starting point for a sequential search of the array, modulo the current array size (wraps around at end). This approach only works well for well-distributed hashes, where the max number of possible entries is less than the array size.

The easy part in both cases is the basic data structure and the basic search. The tricky part is expanding the array when the density gets so high that performance starts degrading to approximate a linear search.
 
What's silly is using the wrong data structure in the wrong place.

I would, but the assignment was to use vectors.

Start with an array. It has 4096 slots. Each slot starts empty (null). When the first hash appears for a slot, make a vector. Put the vector in the array slot. Add the word to the vector.

Alright. I like that idea.
 
Last edited:
I would, but the assignment was to use vectors.

But it didn't say where to use them, did it?

One of the real keys to effective use data structures is recognizing when and where each one is appropriate. Sometimes it doesn't matter much. Sometimes it matters a lot.

Trying to make a vector work like a much simpler array is silly: just use an array.

Making the choice between a vector and a linked list, well, that's a lot more subtle, since it's possible the vector might be implemented as a linked list (or a variant of linked list, like skip-list, depending on criteria like sortedness).
 
Now the we know that this is homework perhaps your next post should be the assignment word-for-word?

Hint - it makes giving targeted advice possible vs shotgun suggestions.

Oh, and of course I knew 'main' was missing so that is obviously NOT what I was referring to.
 
use size/resize instead of capacity/reserve, and here is the slightly modified code. Notice when re-sizing, I use NULL to fill the new vector elements, though it is not necessary.

Code:
#include <iostream>
#include <fstream>
#include <vector>
#include <string>

using namespace std;

typedef struct wordRec {
    string word;
    int occurances;
};

vector<wordRec *> words;

void foundWord (string word, unsigned int hash)
{
    cout << "Over here! I have picked up " << word << endl;
    if ((hash <= words.capacity()) && words[hash])
    {
        if (words[hash]->word == word)
        {
            words[hash]->occurances++;
            cout << words[hash]->word << " has now been found " << words[hash]->occurances << " times." << endl;
        }

        else
        {
            foundWord (word, hash+4096);
        }
    }

    else
    {
        struct wordRec * newWord = new wordRec();
        newWord->word = word;
        newWord->occurances = 1;
        cout << "First time " << newWord->word << " has been found. Hash is " << hash << endl;
        if (words.size() < hash)
        {
            cout << "I came here..." << endl;
            words.resize(hash+2, NULL);
            cout << "And I won. Capcity is now " << words.capacity() << endl;
        }
        words[hash] = newWord;
        cout << "It has been inserted." << endl;
    }
}
 
After rechecking the assignment, I found that the instructor specified that the indices of the vector have to match up with the hash. So using a C array, for this assignment, isn't allowed.

I ended up doing this:

Code:
typedef struct wordRec {
	string word;
	int occurances;
};

unsigned int foundWord (vector<wordRec *> hashVector, string word, unsigned int hash)
{
	if (hashVector[hash]->occurances == 0)
	{
		hashVector[hash]->word = word;
		hashVector[hash]->occurances = 1;
		return hash;
	}
	
	else if (hashVector[hash]->word == word)
	{
		hashVector[hash]->occurances++;
		return hash;
	}
	
	else
	{
		if (hash+4096 > 65535)
		{
			// Allow a collision to occur, and report it to the user.
			cout << "A collision has occured. Hash " << hash << " contains both ";
			cout << hashVector[hash]->word << " and " << word << "." << endl;
			hashVector[hash]->occurances++;
			return hash;
		}
		
		else
		{
			return foundWord (hashVector, word, hash+4096);
		}
	}
}

To actually create the vector in the beginning, I use this:
Code:
		// Set up the vector that will be used.
		vector<wordRec *> hashVector;
		for (int i = 0; i < 65536; i++)
		{
			struct wordRec *blankWord = new wordRec();
			blankWord->occurances = 0;
			hashVector.push_back(blankWord);
		}

It runs like a charm.

It does have the potential for collisions (if more than 16 unique words have the same hash, which is very possible if someone wanted to deliberately mess with the program, but in reading Alice's Adventures of Wonderland, the most words assigned to a single hash was 4,) but it shouldn't be possible to crash the program. Reading over the assignment, it looks like collisions are expected, and we'll be learning about linked lists on our next assignment, and we'll be revising this one to use those to avoid collisions.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.