stat

This is a quick program I wrote to test the various algorithms I've written for this page. It is not a real statistical analysis program, in fact, it only checks the distribution of bits. It is enough to make sure your program is producing a decent pseudo-random sequence which does not tend to produce more 1's, for instance, or has a tendency to produce long strings of 0's, but certainly does not guarantee that your generator is secure.

Running

    This program takes ascii text input, which should consist entirely of the characters '1' and '0'. It also needs a command line argument specifying the 'dimension' -- the length of bit sequences to do analysis on. For instance, a dimension of 1 would give statistics on the number of 1's versus the number of 0's, a dimension of 2 would give information on all of the possible two-bit combinations 00, 01, 10, 11. Dimension 3 is 000 001 010 011 100 101 110 111, etc.

    Since it uses the console for input, this program is best used using your operating system's redirection. In DOS and unix, the pipe is handy:

    fselect microbiologist 10000 | stat 2

    Though redirection is useful as well:

    fselect microbiologist 1000000 > test
    stat 6 < test | more

    The program will produce output like this:

    00: 12534 :*******************
    01: 12465 :*******************
    10: 12453 :*******************
    11: 12548 :********************
    

    Or maybe this:

    000000: 276 :*****************
    000001: 269 :*****************
    000010: 281 :*****************
    000011: 249 :***************
    000100: 285 :******************
    000101: 274 :*****************
    000110: 258 :****************
    000111: 260 :****************
    001000: 282 :******************
    001001: 231 :**************
    001010: 226 :**************
    001011: 265 :****************
    001100: 237 :***************
    001101: 294 :******************
    001110: 264 :****************
    001111: 252 :****************
    010000: 242 :***************
    010001: 235 :***************
    010010: 217 :*************
    010011: 264 :****************
    010100: 279 :*****************
    010101: 271 :*****************
    010110: 237 :***************
    010111: 275 :*****************
    011000: 253 :****************
    011001: 263 :****************
    011010: 313 :********************
    011011: 240 :***************
    011100: 261 :****************
    011101: 261 :****************
    011110: 268 :*****************
    011111: 263 :****************
    100000: 247 :***************
    100001: 250 :***************
    100010: 274 :*****************
    100011: 288 :******************
    100100: 255 :****************
    100101: 227 :**************
    100110: 243 :***************
    100111: 240 :***************
    101000: 255 :****************
    101001: 277 :*****************
    101010: 274 :*****************
    101011: 251 :****************
    101100: 260 :****************
    101101: 235 :***************
    101110: 262 :****************
    101111: 271 :*****************
    110000: 261 :****************
    110001: 282 :******************
    110010: 241 :***************
    110011: 237 :***************
    110100: 268 :*****************
    110101: 267 :*****************
    110110: 271 :*****************
    110111: 272 :*****************
    111000: 294 :******************
    111001: 254 :****************
    111010: 262 :****************
    111011: 253 :****************
    111100: 262 :****************
    111101: 251 :****************
    111110: 260 :****************
    111111: 277 :*****************
    

    Or, if your algorithm is poor, something like this:

    00: 310 :*******************
    01: 50 :***
    10: 60 :***
    11: 323 :********************
    

    Programmers should have no difficulty converting this program to take other sorts of input, indeed, they should have no difficulty rewriting the program entirely.

Improvements

    When doing bit lengths of more than 1, it does not overlap the sequences, that is:
    00101011001101001010111010
    Is broken into the 3-bit segments
    001 010 110 011 010 010 101 110
    and not:
    001 010 101 010 101 011 110 100 001 011 110 101 010 100 001 010 101 010 101 011 111 110 101 010
    I think this is a good idea, since the second repeats bits and seems like it normally wouldn't have a regular distribution anyway. However, adding an option for this might help catch some undesirable patterns.

    This program could calculate a standard deviation and check to see if the distribution is within bounds. This may come in a future version.

Code

This is the code, which you can also just download here. It should have no difficulty compiling on any ansi capable compiler.

/*
   C Implementation of Tom 7's 'stat' program.
   This code is distributed under the GNU public license;
   see http://www.gnu.org/copyleft/gpl.html.

   This takes one parameter on the command line which is the 'dimension'
   of bits to check for distribution in.

     Tom 7

   http://tom7.org/

*/

#include <stdio.h>

main (int argc, char ** argv) {

     int * data, len,n,score,x,c;

     if (argc<2) {
          printf("%s length\n",argv[0]);
          exit(-1);
     }
     len = atoi(argv[1]);

     if (!(data = (int*)malloc((1<<len)*sizeof (int)))) {
          printf("No memory.\n");
          exit(-1);
     }

     for (n=0;n<(1<<len);n++) data[n]=0;

     while (1) {
          n=len;
          score=0;
          while (n--) {
                    if (EOF == (c=getc(stdin))) goto gone;
                    score |= (c-'0')<<n;
                    }
          data[score]++;
     }
     gone:
     score=0;
     for (n=0;n<(1<<len);n++) { if (score < data[n]) score=data[n]; }
     for (n=0;n<(1<<len);n++) {
          for (x=len;x--;) putchar ('0' + ((n&(1<<x)?1:0)   ));
          printf (": %d :", data[n]);
          for (x=(20.*(data[n]/(float)score));x--;) putchar('*');
          putchar('\n');
     }
     free (data);
}

Back to Tom's Cryptography Thingie.
Back to Tom's CMU Page.