Java Notes: Example - WordFrequency

This program reads files of words and lists their frequencies. A file of words to ignore can also be supplied. This example illustrates the use of Java 5's generic data structures (Sets, Maps, and ArrayList), regular expressions (as used by Scanner), and Comparator.

Exercises. A command-line interface is wrtten, but an interesting exercise is to write a GUI interface. You could also think about sorting alphabetically.

A text command-line interface

This command line program can be run with the following command.

   java WordFrequencyCmd textFile ignoreFile
  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
 45 
 46 
 47 
 48 
// File   : wordfreq/WordFrequencyCmd.java
// Purpose: Main program to test WordFrequency
// Author : Fred Swartz
// Date   : March 2005

import java.io.*;
import java.util.*;

/** Prints word frequency in source file. Ignores words in ignore file.
 * Uses Sets, Maps, ArrayList, regular expressions, BufferedReader.
 * @author Fred Swartz
 * @version 2005-03-02
 */
class WordFrequencyCmd {
    //========================================================= main
    /** All code is put into main */
    public static void main(String[] args) {
        //-- Expects two file names on run command.
        if (args.length != 2) {
            System.out.println("Usage: java WordFrequency inputFile ignoreFile");
            System.exit(1);
        }
        
        try {
            //-- Supply two files to WordFrequency constructor..
            WordFrequency wf = new WordFrequency();
            wf.ignoreFile(new File(args[1]));
            wf.processFile(new File(args[0]));
            
            //-- Get the results.
            int n = wf.getEntryCount();
            ArrayList<String>  wrds      = new ArrayList<String>(n);
            ArrayList<Integer> frequency = new ArrayList<Integer>(n);
            wf.getWordFrequency(wrds, frequency);
            
            //-- Print the results.
            for (int i=0; i<n; i++) {
                System.out.println(frequency.get(i) + " " + wrds.get(i));
            }
            
            System.out.println("\nNumber of source words: " + wf.getWordCount());
            System.out.println("\nNumber of unique words: " + n);
            
        } catch (IOException iox) {
            System.out.println(iox);
        }
    }
}//end class

The "Model" -- Word frequency without any user interface

The model (the logic of the program without any user interface) is implemented primarily in the WordFrequency class, which uses two utility classes: CompareByFrequency to use in sorting, and MutableInteger to record the frequency counts.

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
 45 
 46 
 47 
 48 
 49 
 50 
 51 
 52 
 53 
 54 
 55 
 56 
 57 
 58 
 59 
 60 
 61 
 62 
 63 
 64 
 65 
 66 
 67 
 68 
 69 
 70 
 71 
 72 
 73 
 74 
 75 
 76 
 77 
 78 
 79 
 80 
 81 
 82 
 83 
 84 
 85 
 86 
 87 
 88 
 89 
 90 
 91 
 92 
 93 
 94 
 95 
 96 
 97 
 98 
 99 
100 
101 
102 
103 
104 
105 
106 
107 
108 
// File   : wordfreq/WordFrequency.java
// Purpose: Provides methods to read ingore file, input file, get results.
//          Computes the frequency for each word.
// Author : Fred Swartz
// Date   : 2005-03-02

import java.io.*;
import java.util.*;
import java.util.regex.*;

/** Computes word frequency in source file; ignores words in ignore file.
 * Uses generic Sets, Maps, ArrayLists, regular expressions, Scanner.
 * @author Fred Swartz
 * @version 2005-03-02
 */
public class WordFrequency {
    //-- Instance variables.
    Set<String>                 m_ignoreWords;   // Words to ignore.
    Map<String, MutableInteger> m_wordFrequency; // Words -> frequency
    int                         m_totalWords;    // Total source words.
    
    
    //============================================================== constructor
    /** Constructor */
    public WordFrequency() {
        m_ignoreWords   = new HashSet<String>();
        m_wordFrequency = new HashMap<String, MutableInteger>();
        m_totalWords = 0;
    }
    
    //=============================================================== ignoreFile
    /** Reads file of words to ignore. Ignore words are stored in a Set.
     *  The IOException is passed to caller because we certinaly don't
     *  know what the user interface issue is.
     *@param ignoreFile File of words to ignore.
     */
    public void ignoreFile(File ignoreFile) throws IOException {
        Scanner ignoreScanner = new Scanner(ignoreFile);
        ignoreScanner.useDelimiter("[^A-Za-z]+");
        
        while (ignoreScanner.hasNext()) {
            m_ignoreWords.add(ignoreScanner.next());
        }
        ignoreScanner.close();  // Close underlying file.
    }
    
    
    //============================================================== processFile
    /** Record the frequency of words in the source file.
     *  May be called more than once.
     *  IOException is passed to caller, who might know what to do with it.
     *@param File of words to process.
     */
    public void processFile(File sourceFile) throws IOException {
        Scanner wordScanner = new Scanner(sourceFile);
        wordScanner.useDelimiter("[^A-Za-z]+");
        
        while (wordScanner.hasNext()) {
            String word = wordScanner.next();
            m_totalWords++;
            if (!m_ignoreWords.contains(word)) {
                MutableInteger value = m_wordFrequency.get(word);
                if (value == null) {    // Create new entry with count of 1.
                    m_wordFrequency.put(word, new MutableInteger(1));
                } else {                // Increment existing count by 1.
                    value.inc();
                }
            }
        }
        wordScanner.close();  // Close underlying file.
    }
        
    //============================================================= getWordCount
    /** Returns number of words in the soure file(s).
     *@return total number of words proccessed in all source files.
     */
    public int getWordCount() {
        return m_totalWords;
    }
        
    /** Returns the number of unique, non-ignored words, in the source file(s).
     *  This number should be used to for the size of the arrays that are
     *  passed to getWordFrequency.
     *@return Number of unique non-ignored source words.
     */
    public int getEntryCount() {
        return m_wordFrequency.size();
    }
        
    /** Stores words and their corresponding frequencies in the parallel array
     *  parameters.  The frequencies are sorted from low to high.
     *  The size of the arrays must be at least getEntryCount().
     * @param words Unique words that were found in the source file(s).
     * @param counts Frequency of words at corresponding index in words array.
     */
    public void getWordFrequency(ArrayList<String> out_words, ArrayList<Integer> out_counts) {
        //... Put in ArrayList so sort entries by frequency
        ArrayList<Map.Entry<String, MutableInteger>> entries;
        entries = new ArrayList<Map.Entry<String, MutableInteger>>(m_wordFrequency.entrySet());
        Collections.sort(entries, new CompareByFrequency());
        
        //... Add word and frequency to parallel output ArrayLists.
        for (Map.Entry<String, MutableInteger> ent : entries) {
            out_words.add(ent.getKey());
            out_counts.add(ent.getValue().intValue());
        }
    }
}

Comparator

The key-value (Map.Entry) pairs in a HashMap are not sorted, but they need to be sorted based on the frequency. Because sorts are based on comparison of two values at a time, all that is needed is to supply a way to compare two values. That's what a Comparator does -- it defines a compareTo() method that tells how two values compare.

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
// Fred   : wordfreq/CompareByFrequency.java
// Purpose: A comparator to sort Map.Entries by frequency.
// Author : Fred Swartz
// Date   : 2005-03-02

import java.util.*;

/////////////////////////////////////////////// class CompareByFrequency
/** For ordering words from least to most freqent.
 * If frequency is equal, words are put in alphabetical order.
 */
class CompareByFrequency implements Comparator<Map.Entry<String, MutableInteger>> {
    public int compare(Map.Entry<String, MutableInteger> obj1
                     , Map.Entry<String, MutableInteger> obj2) {
        int c1 = obj1.getValue().intValue();
        int c2 = obj2.getValue().intValue();
        if (c1 < c2) {
            return -1;
            
        } else if (c1 > c2) {
            return 1;
            
        } else { // If counts are equal, compare keys alphabetically.
            return obj1.getKey().compareTo(obj2.getKey());
        }
    }
}

MutableInteger

Because only objects (not primitives) can be stored in Collections data structures, the integer count must be in an object. The natural choice would be the Integer wrapper type, but it is immutable, which means that once an Integer object is created, its value can never be changed. But here we need to update the value. So here's a class that stores an int that can be changed.

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
// File   : wordfreq/MutableInteger.java
// Purpose: To hold a mutable int in a class.
// Author : Fred Swartz
// Date   : 2005-03-02

/////////////////////////////////////////////////// utility class MutableInteger
/** Utility class to keep int count because Java's data
 * structures hold only Objects, not basic types and Integer is immutable.
 */
class MutableInteger {
    private int m_value;
    
    /** Constructor */
    public MutableInteger(int value) {
        m_value = value;
    }
    
    /** Return int value. */
    public int intValue() {
        return m_value;
    }
    
    /** Increment value */
    public void inc() {
        m_value++;
    }
}