java string tokenizing question

Discussion in 'Mac Programming' started by prostuff1, Apr 29, 2008.

  1. prostuff1 macrumors 65816

    prostuff1

    Joined:
    Jul 29, 2005
    Location:
    Don't step into the kawoosh...
    #1
    Have most of this working and i will add the code to the end of the for you all to look at.

    Essentially what i am trying to do is tokenize a string that looks like:

    Code:
    program int ABC, D;
    	begin read ABC; read D;
    		while (ABC != D) begin
    							if (ABC > D) then ABC = ABC - D;
    							else D = D - ABC;
    						end;
    					end;
    		write D;
    	end
    
    
    My tokeizing stuff works just fine for simpler version of this but it screws up a little when it gets to the "special symbols" that are a combination of more then one special token. The special symbols are:

    Code:
    ; , = ! [ ] && or ( ) + - * != ==  <= >=
    

    Now i obviously have to look for those as tokenizing points, but the problem comes when i get to double ones like &&, or, !=, etc.

    I was wondering if there was a way to make it look for the combination != instead of just the ! and then the =.

    If there is not an easy way to do it what would be the best and easiest way to go about parsing that string into tokens? I am kinda leaning towards string.split but am not quite sure on how to set it up. some examples or pointers would be welcome!!

    Thanks and here is the code of the relevant part:

    Code:
    import java.util.ArrayList;
    import java.util.StringTokenizer;
    import java.io.*;
     
    /**
     * 
     * @author Kyle Hiltner
     *
     */
    public class KHTokenizer implements KHTokenizerInterface
    {
    	private String current_token; //used to specify current token
    	private int token_count=0; //used to keep track of which token is being asked for
    	private ArrayList<String> file = new ArrayList<String>(); //stores the parsed input file
     
    	/**
    	 * Creates a new KHTokenizer with the name of the file as input
    	 * 
    	 * @param inputFileName the specified file to be read from
    	 * @throws IOException
    	 */
    	KHTokenizer(String inputFileName) throws IOException
    	{
    		FileReader freader = new FileReader(inputFileName); //create a FileReader for reading
    		BufferedReader inputFile = new BufferedReader(freader); //pass that FileReader to a BufferedReader
     
    		String theFile = Create_String_From_File(inputFile); //create a space separated string for easier tokenizing
    		StringTokenizer tokenized_input_file = new StringTokenizer(theFile, ";=,()[] ", true); //tokenize the string using ;, =, and " " as delimiters
    		String_Tokenizer(tokenized_input_file, file); //create the array by adding tokens
     
    		this.current_token = file.get(this.token_count); //set the current token to the first in the array
    	}
     
    	//--------------------------//
    	//----Private Operations----//
    	//-------------------------//
     
    	/**
    	 * Determines if the specified word is a special Reserved word
    	 * 
    	 * @param reserved_word the current token
    	 * @return true if and only if the reserved_word is a Reserved Word
    	 */
    	private static Boolean Is_Reserved_Word(String reserved_word)
    	{
    		//determine is reserved_word is one the established Reserved Words
    		return ((reserved_word.equals("program")) || (reserved_word.equals("begin")) || 
    				(reserved_word.equals("end")) || (reserved_word.equals("int")) ||
    				(reserved_word.equals("if")) || (reserved_word.equals("then")) ||
    				(reserved_word.equals("else")) || (reserved_word.equals("while")) ||
    				(reserved_word.equals("read")) || (reserved_word.equals("write")));
    	}
     
    	/**
    	 * Determines if the specified word is a Special Symbol
    	 * 
    	 * @param special_symbol the current token
    	 * @return true if and only if the special_symbol is a Special Symbol
    	 */
    	private static Boolean Is_Special_Symbol(String special_symbol)
    	{
    		//determines if special_symbol is one of the established Special Symbols
    		return ((special_symbol.equals(";")) || (special_symbol.equals(",")) ||
    				(special_symbol.equals("=")) || (special_symbol.equals("!")) ||
    				(special_symbol.equals("[")) || (special_symbol.equals("]")) ||
    				(special_symbol.equals("&&")) || (special_symbol.equals("or")) ||
    				(special_symbol.equals("(")) || (special_symbol.equals(")")) ||
    				(special_symbol.equals("+")) || (special_symbol.equals("-")) ||
    				(special_symbol.equals("*")) || (special_symbol.equals("!=")) ||
    				(special_symbol.equals("==")) || (special_symbol.equals("<">")) || (special_symbol.equals("<=")) ||
    				(special_symbol.equals(">=")));
    	}
     
    	/**
    	 * Determines if the specified token is an integer
    	 * 
    	 * @param integer_token the current token to be converted to an integer
    	 * @return true is and only if integer_token is an integer
    	 */
    	private static Boolean Is_Integer(String integer_token)
    	{
    		Boolean is_integer=false; //set up boolean for check
     
    		//try to convert the specified string to an integer
    		try
    		{
    			int integer_token_value = Integer.parseInt(integer_token); //convert the string to an integer
    			is_integer = true; //set is_integer to true
    		}
    		catch(NumberFormatException e) //if unable to parse the string to an integer set is_integer to false
    		{
    			is_integer = false; //set is_integer to false
    		}
     
    		return is_integer; //return the integer
    	}
     
    	/**
    	 * Determines if the specified token is an Identifier
    	 * 
    	 * @param identifier_token the current token
    	 * @return true if and only if the identifier_token is an identifier
    	 */
    	private static Boolean Is_Identifier(String identifier_token)
    	{
    		//rule out that it is a Reserved Word, Special Symbol, or integer so then it must be an Identifier; so return true or false
    		return ((!Is_Reserved_Word(identifier_token)) && (!Is_Special_Symbol(identifier_token)) && (!Is_Integer(identifier_token)));
    	}
     
    	/**
    	 * Determines which value to assign to the specified token
    	 * 
    	 * @param which_reserved_word_token the current token
    	 * @return token_value the integer value relating to the Reserved Word token
    	 */
    	private static int Which_Reserved_Word(String which_reserved_word_token)
    	{
    		int token_value=0; //set initial token_value
     
    		//run through and check which Reserved word it is and then set it to the correct value
    		if(which_reserved_word_token.equals("program"))
    		{
    			token_value = ReservedWords.PROGRAM.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("begin"))
    		{
    			token_value = ReservedWords.BEGIN.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("end"))
    		{
    			token_value = ReservedWords.END.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("int"))
    		{
    			token_value = ReservedWords.INT.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("if"))
    		{
    			token_value = ReservedWords.IF.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("then"))
    		{
    			token_value = ReservedWords.THEN.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("else"))
    		{
    			token_value = ReservedWords.ELSE.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("while"))
    		{
    			token_value = ReservedWords.WHILE.ordinal()+1;
    		}
    		else if(which_reserved_word_token.equals("read"))
    		{
    			token_value = ReservedWords.READ.ordinal()+1;
    		}
    		else
    		{
    			token_value = ReservedWords.WRITE.ordinal()+1;
    		}
     
    		return token_value; //return the token_value
    	}
     
    	/**
    	 * Determines which value to assign to the specified token
    	 * 
    	 * @param which_special_symbol_token the current token
    	 * @return special_symbol_token_value the integer value relating to the Special Symbol token
    	 */
    	private static int Which_Special_Symbol(String which_special_symbol_token)
    	{
    		int special_symbol_token_value=0; //set initial value
     
    		//check to figure out which Special Symbol it is and assign the correct value
    		if(which_special_symbol_token.equals(";"))
    		{
    			special_symbol_token_value = SpecialSymbols.SEMICOLON.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals(","))
    		{
    			special_symbol_token_value = SpecialSymbols.COMMA.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("="))
    		{
    			special_symbol_token_value = SpecialSymbols.EQUALS.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("!"))
    		{
    			special_symbol_token_value = SpecialSymbols.EXCLAMATION_MARK.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("["))
    		{
    			special_symbol_token_value = SpecialSymbols.LEFT_BRACKET.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("]"))
    		{
    			special_symbol_token_value = SpecialSymbols.RIGHT_BRACKET.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("&&"))
    		{
    			special_symbol_token_value = SpecialSymbols.AND.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("or"))
    		{
    			special_symbol_token_value = SpecialSymbols.OR.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("("))
    		{
    			special_symbol_token_value = SpecialSymbols.LEFT_PARENTHESIS.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals(")"))
    		{
    			special_symbol_token_value = SpecialSymbols.RIGHT_PARENTHESIS.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("+"))
    		{
    			special_symbol_token_value = SpecialSymbols.PLUS.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("-"))
    		{
    			special_symbol_token_value = SpecialSymbols.MINUS.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("*"))
    		{
    			special_symbol_token_value = SpecialSymbols.MULTIPLY.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("!="))
    		{
    			special_symbol_token_value = SpecialSymbols.NOT_EQUALS.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("=="))
    		{
    			special_symbol_token_value = SpecialSymbols.EQUALS_EQUALS.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("<">"))
    		{
    			special_symbol_token_value = SpecialSymbols.GREATER_THAN.ordinal()+11;
    		}
    		else if(which_special_symbol_token.equals("<="))
    		{
    			special_symbol_token_value = SpecialSymbols.LESS_THAN_OR_EQUAL_TO.ordinal()+11;
    		}
    		else
    		{
    			special_symbol_token_value = SpecialSymbols.GREATER_THAN_OR_EQUAL_TO.ordinal()+11;
    		}		
     
    		return special_symbol_token_value; //return the correct value
    	}
     
    	/**
    	 * Creates the string separated by white spaces to be read by the String Tokenizer
    	 * 
    	 * @param input_file the stream to be converted into a string
    	 * @return theFile the inputFile converted to a string
    	 * @throws IOException
    	 */
    	private static String Create_String_From_File(BufferedReader input_file) throws IOException
    	{
    		String theFile="", keepReadingFromFile=""; //set initial value of the strings
     
    		//run through the stream and create a file
    		while(keepReadingFromFile != null)
    		{
    			keepReadingFromFile = input_file.readLine(); //read one line at a time
     
    			//if the line is null stop and break
    			if(keepReadingFromFile == null)
    			{
    				break;
    			}
    			else //keep reading from the file and make it into a string
    			{
    				theFile = theFile + keepReadingFromFile;
    			}			
    		}
     
    		theFile = theFile.replaceAll("\\t", " "); //remove any tabs from the string and replace with spaces so it is easier to Tokenize
     
    		return theFile; //return the newly created string
    	}
     
    	/**
    	 * Creates the array of tokens but tokenizing based on the given parameters
    	 * 
    	 * @param theInputFile
    	 * @param file to store the individual tokens in
    	 */
    	private void String_Tokenizer(StringTokenizer theInputFile, ArrayList<String> file)
    	{
    		String token=""; //set up the intial token
     
    		//keep reading with there is still more in the token stream
    		while (theInputFile.hasMoreTokens()) 
    		{
    			token = theInputFile.nextToken(); //set token to the next token
     
    			//if the token is not a white sapce then add it to the array
    			if(!token.equals(" "))
    			{		
    				file.add(token); //add token to the array
    			}
    		}
    		file.add("nill"); //add a final spot to designate the end of the file
    	}
     
    	//--------------------------//
    	//----Public Operations-----//
    	//--------------------------//
    	
    	/**
    	 * Returns the integer value of the current token
    	 * 
    	 * @return the integer value of the current token
    	 */
    	public int getToken()
    	{
    		int token_number=0; //set initial value
     
    		//determine if the current token is a Reserved Word, Special Symbol, Identifier, or nill (for end of file)
    		if(Is_Reserved_Word(this.current_token))
    		{
    			token_number = Which_Reserved_Word(this.current_token); //determine the correct value for the Reserved Word
    		}
    		else if(Is_Special_Symbol(this.current_token))
    		{
    			token_number = Which_Special_Symbol(this.current_token); //determine the correct value for the Special Symbol
    		}
    		else if(Is_Integer(this.current_token))
    		{
    			token_number = 30; //the current token is an integer so set it to 30
    		}
    		else if(this.current_token.equals("nill"))
    		{
    			token_number = 32; //the current token is nill so set it to 32
    		}
    		else//(Is_Identifier(this.current_token))
    		{
    			token_number = 31; //the token is an identifer so set it to 31
    		}
     
    		return token_number; //return the token_number
    	}
     
    	/**
    	 * Sets the current token as the next one in line
    	 */
    	public void skipToken()
    	{
    		//keep getting the next token as long as token_count is less then the size of the array
    		if(this.token_count < file.size()-1)
    		{
    			this.token_count++; //increase token_count
    			this.current_token = file.get(token_count); // get the new token
    		}
    	}
     
    	/**
    	 * This method can only be called to convert an integer in string form to its integer value.
    	 * If called on an non integer token an error is printed to the screen and execution of the Tokenizer is stopped.
    	 * 
    	 * @return integer value of the specified token assuming the token is an integer
    	 */
    	public int intVal()
    	{
    		int integer_token_value=0; //set the initial value
     
    		//if true is returned then go ahead and convert
    		if(Is_Integer(this.current_token))
    		{
    			integer_token_value = Integer.parseInt(this.current_token); //parse the current_token string and get an integer value
    		}
    		else // print he error message and exit Tokenizing
    		{
    			System.out.print("You called intVal() on a non-integer token. You tryed to convert the " );
    			if(Is_Reserved_Word(this.current_token))
    			{
    				System.out.print("reserved word " + "\"" + this.current_token +"\"" + " to an integer");
    			}
    			else if(Is_Special_Symbol(this.current_token))
    			{
    				System.out.print("special symbol " + "\"" + this.current_token +"\"" + " to an integer");
    			}
    			else
    			{
    				System.out.print("identifier " + "\"" + this.current_token +"\"" + " to an integer");
     
    			}
    			System.exit(1); //exit the system and quit tokenizing
    		}
     
    		return integer_token_value; //return the current_token integer value
    	}
     
    	/**
    	 * Returns a string if and only if the token is of the id type.
    	 * 
    	 * @return the name of the id token
    	 */
    	public String idName()
    	{
    		String id_token_name=""; //setup the initial value
     
    		//if the current_token is an Identifer then set it so and return it.
    		if(Is_Identifier(this.current_token))
    		{
    			id_token_name = this.current_token;
    		}
    		else // print message and quit tokenizing 
    		{
    			System.out.print("You called idName() on ");
    			if(Is_Reserved_Word(this.current_token))
    			{
    				System.out.print("a reserved word, ");
    			}
    			else if(Is_Special_Symbol(this.current_token))
    			{
    				System.out.print("a special symbol, ");
    			}
    			else
    			{
    				System.out.print("an integer, ");
     
    			}
    			System.out.println("which is not an identifier token.");
    			System.exit(1); //exit and quit tokenizing
    		}
     
    		return id_token_name; //return the id_token_name if possible
     
    	}
    }
    
     
  2. HiRez macrumors 603

    HiRez

    Joined:
    Jan 6, 2004
    Location:
    Western US
    #2
    You might want to use regular expressions for this, using the Pattern and Matcher classes from java.util.regex. It's probably much more powerful than using the StringTokenizer class. I haven't personally used this in Java, but using regular expressions it's very easy to find, for example, one and only one character from a certain set. Or exactly two adjacent copies of the same character at the end of a line, or whatever.
     
  3. lee1210 macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #3
    http://www.javaworld.com/jw-01-1997/jw-01-indepth.html
    The above is an article on lexical analysis in java. It has examples of using the StringTokenizer to perform lexical analysis. There are also two lexical analyzer generators for Java, JLex and JFlex, though this is probably for a school project so you might not be able to leverage those.

    I would also recommend having a list that has all of your reserved words, another with operators, etc. Then your is_reserved_word routine, etc. can essentially just be a call to the lists contains method. It may be slightly slower, but I think it will be clearer and easier to add/remove reserved words. You can then use the indexOf method of the list to find the position in the list, and use that to get your token value.

    -Lee

    P.S. Last edit, I promise. I don't think you want to tokenize on your operators. They just get eaten by tokenization, but you need them. I guess you'd know where they were if you tokenized on them, but that seems more confusing than just locating them.
     
  4. prostuff1 thread starter macrumors 65816

    prostuff1

    Joined:
    Jul 29, 2005
    Location:
    Don't step into the kawoosh...
    #4
    Thanks for the pointers you two.

    I did end up figuring it out. I had to use a statement like:
    Code:
    Pattern p = Pattern.compile("\\w+|\\s+|[;,()-]|\\+|\\*|\\!=|\\!|\\<=|\\>=|\\==|\\=|\\<|\\>|\\[|\\]|\\&&");
    
    		Matcher matcher = p.matcher(theFile);
    To get it to work like i needed it to, but everything is tokenizing now and working peachy.

    I figured out that i could use split and got it to work, but lee1210 is right that it eats the delimiters. The Matcher class gets around that and it is exactly what i needed.

    The only thing i did end up running into was replacing tabs with spaces. That was easy enough to fix once i figured out were i made my mistake.

    Thanks again
     
  5. RaceTripper macrumors 68030

    Joined:
    May 29, 2007
    #5
    StringTokenizer is awful and always has been. Even the class javadocs say not to use it anymore.

    Use the split(regex) method. It's in the Pattern and String classes. It works like the Perl split function.
     
  6. prostuff1 thread starter macrumors 65816

    prostuff1

    Joined:
    Jul 29, 2005
    Location:
    Don't step into the kawoosh...
    #6
    StringTokenizer is fine if you are doing something simple and it makes defining the tokens to split by easy. there is just no flexibility with it.

    I was originally working on split but i figured out that the tokens to split by are not kept, which i needed (except for one like white space and tab and the like). So i had to use the Pattern and Matcher class to get it to split by the specified tokens but keep them.
     
  7. RaceTripper macrumors 68030

    Joined:
    May 29, 2007
    #7
    OK. I missed that point. Using the regex package sounds like your answer.

    FYI: one big problem with StringTokenizer is it doesn't support empty tokens (unless you consume delimiters and manage it yourself). If it finds successive delimiters it just throws the extra ones away, instead of giving you empty strings as tokens.
     

Share This Page