Generating data with "regex" in Java

Discussion in 'Mac Programming' started by SilentPanda, Apr 14, 2010.

  1. SilentPanda Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #1
    I'm needing a way for users to specify the way that data is generated. For instance they might need to create a last name comprised of alpha characters and is 4-16 long. Or maybe an internal number that has to match "X\d{4}400".

    My though was to write a reverse regex type thing. They could define the data with a regex and it would generate the data off it. Some aspects would not be supported such as +,*,?... mostly because they wouldn't make sense in the application. Primarily it would support (|), {x[,y]}, [], that kind of stuff. I know it's not regular expressions, but it's using that notation as a base.

    Is this the wrong way to go about it? Honestly the users probably won't use anything super in depth as none of them know much about regex. However they do need to be somewhat specific in their data definitions and I felt this would be a way they could do that.

    Just kinda doing a sanity check before I drive myself insane... :p Heck there might already be an alternative library out there that does what I need... I just can't find one.
     
  2. mags631 Guest

    Joined:
    Mar 6, 2007
    #2
    Is the only input the string format? Or will last names, numeric ids, etc. be passed?
    Is the function to generate the next literal?
    Do you need to guarantee uniqueness?

    My immediate thought was regex substitution... but it will depend on the requirements.
     
  3. SilentPanda thread starter Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #3
    The users will need to come up with the notation and then my class will generate a random sampling of data based on the notation. The data will be alpha, numeric, alphanumeric, and mixed (symbols). The data may need to be unique, for instance if they want 500 pieces of data generated.

    Right now they have their data definitions written in a Word document as "I need a last name, it's only alpha characters, and the min length is 5 and the max length is 14". Or, "I need a social security number, it's 9 numbers long". Or, "I need a unique internal ID which is 5 alpha characters followed by 3 numbers and ends with a Z".

    So when I make thousands of records for them, it would be much easier for them to say:

    Last Name - [A-Za-z]{5,14}
    SSN - [0-9]{9}
    Internal ID - [A-Z]{5}[0-9]{3}Z

    and do that 100,000 times.
     
  4. lee1210 macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #4
    This may be a little more difficult, but i'd be inclined to have them pick the specification from some series of dropdowns, etc. that show them something in english. For example:
    Dropdown 1:
    Alpha
    Numerical
    Symbol
    Alphanumerical
    Alpha and Symbol
    Numerical and Symbol
    Any
    Specified Set

    If specified set is chosen, have a text entry Field 1 where they can enter the characters.

    Dropdown 2, minimum number of this character type

    Dropdown 3, maximum number of this character type

    Field 2, description of this data

    They pick, and then choose "Add to specification", you build up a full specification from any number of these individual character groupings. You can generate a description in english of what they've chosen, with a small (5ish) sample of what will be generated.

    Once they're ready to submit, you can store this however you want in the background. If you really want to, you could display this to the user and allow a "shortcut entry" if they know the syntax you're using in the background.

    I guess if your users are super-technical you could make them enter a seemingly random string of gibberish, but that seems pretty mean if they are not also programmers.

    -Lee
     
  5. SilentPanda thread starter Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #5
    That's a good point. They'll primarily be putting the data notation into an Excel spreadsheet which my application will the interpret. But I could at least offer a UI for some of the easier and more common things they will be doing. It would then make the encoded string for them to paste into their Excel document. Most of their data is probably going to be [A-Z]{x,y} and [0-9]{x,y} anyway. The reason we're going a bit further is for those fields that do require a little bit more... oomph. There will probably be a few of these per spreadsheet.
     
  6. mags631 Guest

    Joined:
    Mar 6, 2007
    #6
    I don't think you should use regex (well maybe to parse the rules). The output function should decompose a literal into stems, with individual rules for stems. E.g., here is a simple Python version:
    Code:
    import random
    
    class Stem:
    	def __init__(self, constant_stem=None, valid_chars=None, min_length=0, max_length=0):
    		'''char_range is a string of valid characters'''
    		self.constant_stem = constant_stem
    		self.valid_chars = valid_chars
    		self.min_length = min_length
    		self.max_length = max_length
    		
    		
    	def generate(self):
    		# if this is a constant stem then just return it as the stem
    		if self.constant_stem is not None:
    			return self.constant_stem
    		# otherwise, generate it randomly
    		stem = u''
    		for i in range(random.randint(self.min_length, self.max_length)):
    			random_c = self.valid_chars[random.randint(0, len(self.valid_chars) - 1)]
    			stem = stem + random_c
    		return stem
    		
    
    class Literal:
    	def __init__(self, stem_defs):
    		self.stems = list()
    		for stem_def in stem_defs:
    			self.stems.append(Stem(*stem_def))
    			
    	def generate(self):
    		literal = u''
    		for stem in self.stems:
    			literal = literal + stem.generate()
    		return literal
    	
    	def generateTimes(self, number):
    		literal_list = list()
    		for i in range(number):
    			literal_list.append(self.generate())
    		return literal_list
    
    And it generates:
    Code:
    >>> reload(RandomLiteral)
    <module 'RandomLiteral' from 'RandomLiteral.py'>
    >>> ssn_literal = RandomLiteral.Literal([
    ... (None, "0123456789", 3, 3),
    ... ("-"),
    ... (None, "0123456789", 2, 2),
    ... ("-"),
    ... (None, "0123456789", 4, 4)
    ... ])
    >>> ssn_literal.generate()
    u'972-59-7621'
    >>> ssn_literal.generateTimes(100)
    [u'383-48-5897', u'249-65-8404', u'709-43-4150',  ....]
    >>> 
    
     
  7. mrbash macrumors 6502

    Joined:
    Aug 10, 2008
    #7
    Panda: I don't believe there is an easy for you to do this. The reverse process of pattern-> string is generally non-deterministic. A simple class like [.3*] can have any number of different strings that would be satisfactory.

    I think you'll probably have to start off with some simplifying assumptions.
     
  8. SilentPanda thread starter Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #8
    Well I finished this up yesterday. It works pretty well for my purposes. It supports escaping certain characters with \, nested parenthesis grouping mostly for "OR" statements, "OR" statements with the |, character classes with the [], and explicit ranges with {}.

    I coded things fairly close to "spec" when possible even when it wasn't needed, such as I really had no need to escape the + operator as it's not supported. But in the odd even I did need to implement it and it made sense in the future, it shouldn't be as much of a big deal...

    Came out to about 300ish lines of code for the class and about 350 for all my junit tests... first time I've used junit but I'm very happy with them. I had to overhaul something in the middle of coding and it was nice to be able to run my tests to ensure I hadn't broken anything!
     
  9. lee1210 macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #9
    Does this code belong to your employer? If not, can you post it for the benefit of others?

    -Lee
     
  10. SilentPanda thread starter Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #10
    It does... I had thought about posting it but it's not "mine"... bleah. It's not terribly complex and I would actually be up for posting it otherwise for people to beat up on how inefficient it is and how they'd do it this other way instead... :p

    Actually I like that kind of stuff as it lets you learn... :(
     
  11. macsmurf macrumors 65816

    macsmurf

    Joined:
    Aug 3, 2007
    #11
    One way of doing it would be to translate the regexp to a finite automaton (directed graph) and then travel through it backwards from an accept state.

    I don't know if that is what you have done.
     

Share This Page