python optimization

Discussion in 'Mac Programming' started by samwich, Nov 4, 2013.

  1. samwich, Nov 4, 2013
    Last edited: Nov 4, 2013

    samwich macrumors regular

    Joined:
    Aug 5, 2007
    #1
    Hey there, looking for some ideas to optimize this script. It recursively searches in the specified directory and prints out the most recent accessed files and most recent modified files. Please _don't_ tell me to write it in C or use pypy or something, I'd like to try to keep this in plain python.

    Code:
    import os
    import stat
    import datetime as dt
    import argparse
    from pprint import pprint
     
     
    def print_files(num_files, directory):
        """
        gets a list of files sorted by modified time
     
        keyword args:
        num_files -- the n number of files you want to print
        directory -- the starting root directory of the search
     
        """
        modified = []
        accessed = []
        rootdir = os.path.join(os.getcwd(), directory)
     
        for root, sub_folders, files in os.walk(rootdir):
            for file in files:
                try:
                    unix_modified_time = os.stat(os.path.join(root, file))[stat.ST_MTIME]
                    unix_accessed_time = os.stat(os.path.join(root, file))[stat.ST_ATIME]
                    human_modified_time = dt.datetime.fromtimestamp(unix_modified_time).strftime('%Y-%m-%d %H:%M:%S')
                    human_accessed_time = dt.datetime.fromtimestamp(unix_accessed_time).strftime('%Y-%m-%d %H:%M:%S')
                    filename = os.path.join(root, file)
                    modified.append((human_modified_time, filename))
                    accessed.append((human_accessed_time, filename))
                except:
                    pass
     
        modified.sort(key=lambda a: a[0], reverse=True)
        accessed.sort(key=lambda a: a[0], reverse=True)
        print('Modified')
        pprint(modified[:num_files])
        print('Accessed')
        pprint(accessed[:num_files])
     
     
    def main():
        parser = argparse.ArgumentParser()
        parser.add_argument('-n',
                            '--number',
                            help='number of items to return',
                            type=int,
                            default=1)
        parser.add_argument('-d',
                            '--directory',
                            help='specify a directory to search in',
                            default='./')
     
        args = parser.parse_args()
     
        print_files(args.number, args.directory)
     
    if __name__ == '__main__':
        main()
     
  2. mrichmon, Nov 5, 2013
    Last edited: Nov 5, 2013

    mrichmon macrumors 6502a

    Joined:
    Jun 17, 2003
    #2
    You don't state what you are looking to optimize for. I assume you mean execution time, but many times people need to optimize for things other than execution time. (eg, memory usage)

    In general for time-optimization, you should consider profiling your code to determine where the bulk of the time is spent. You can easily profile by introducing time calculations and then gradually narrowing down to the main time costs.

    Based on my experience, your primary time costs are likely due to:

    • sorting your lists
    • the memory allocation while building the lists

    But since you are only interested in the most recent time, why are you building the list and sorting them at all? How about an approach that just uses two variables:

    • most recent modification time found so far (most_recent_time)
    • modification time of current file (current_time)

    If the most_recent_time is None, then most_recent_time = current time.
    Else if current_time > most_recent_time, then current_time = most_recent_time and most_recent_file = current_filename.

    Also, there is no reason to convert a time to human-readable format until you've found the time that you want to report. Just by moving the time conversion out of the loop and eliminating the conversion until immediately before the print statement will significantly reduce the number of computations required.
     
  3. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #3
    A common optimization is possible here. Instead of a special "None" initial value, initialize most_recent_time with an "impossible" value that compares less than any possible value that will be encountered during the scan. For Unix-times, a common starting value is 0, which represents the Unix epoch time of midnight 01 Jan 1970 GMT.

    The result is that a separate test for "None" is no longer necessary, so the conditional in the loop simplifies to:
    Code:
    If current_time > most_recent_time, then current_time = most_recent_time and most_recent_file = current_filename.
    
    I don't recall the name of this technique, but it's kind of the opposite of a sentinel value:
    http://en.wikipedia.org/wiki/Sentinel_value
     
  4. mrichmon macrumors 6502a

    Joined:
    Jun 17, 2003
    #4
    Fair enough. Since python ints are unsigned it can be easily initialized to an invalid time value of "-1". For the same effect.

    In my text I was aiming for clarity since the original poster has implemented code where a whole lot of unnecessary work is performed repeatedly in the loop. But I am certainly making many assumptions here.

    Also, the original code is printing the n most recently modified files. So a single variable approach will not solve the stated problem. But, an extension to the single variable approach that only maintains a list of at most n timestamps is possible at the cost of a significant increase in the amount of code needed.
     
  5. subsonix macrumors 68040

    Joined:
    Feb 2, 2008
    #5
    Why? It seems to be the same amount of computations, you have just postponed them to a later time.
     
  6. mrichmon macrumors 6502a

    Joined:
    Jun 17, 2003
    #6
    Multiple reasons:
    • The code as posted builds up a list of times and then outputs only a portion of that list in the case where the number of files in the folders is larger than num_files. So, for all the files that are not output to the user a relatively expensive conversion from an integer to a DateTime object is performed but does not contribute to the program output.
    • Comparison of complex data types such as the DateTime objects being compared in the sort operations is generally more expensive than comparison of integers. One of the reasons why Unix epoc times are recorded as integers is to lower the cost of time comparisons.

    By postponing the operations you can reduce the overall number that are necessary. For example, assume 1000 files in the subfolders where n is 10. In the original posted code there are 2000 calls to instantiate a DateTime object. If the lists are built using the integer time value and the DateTime objects are instantiated after the sort operations during the output processing, then only the 20 output time values need to be instantiated as DateTime objects. That's a reduction of 1980 object instantiation operations in this example.
     
  7. subsonix macrumors 68040

    Joined:
    Feb 2, 2008
    #7
    Right, I missed that num_files was an argument to the print function.


    Yes but that is not a direct function of moving the conversions out of the loop but storing the dates as unix time stamps, which I agree with.
     
  8. samwich thread starter macrumors regular

    Joined:
    Aug 5, 2007
    #8
    Thanks! Yes, I was wanting to optimize for execution time. Good catch - I will definitely keep instantiate the datetime objects outside the loop.

    I'll also look into maintaing a compact array instead of one large array
     

Share This Page