compare text file app - global

lgnome · September 9, 2004 4:01PM

hi,

i just had a tab delineated address list run though the postal service address check software.

what a recieved out from the program was a 'new' list of known good addresses.

i need to know if there a piece of software that can compare the two docs and then output what items do not exist in both files... with this info ill be able to delete the 'bad' addresses from my database.

ive tried bbedit's compare doc function, but it looks like it looks at the doc on a line number per line number basis and not globally..

if at the very least if the program says, "found 'John' in file A 10x and 9x in file B", or "Found 'Peterson' in file A but not in file B".. i would have good place to start.

thanks..

amorya · September 9, 2004 7:00PM

So you're saying that the addresses aren't in the same order? If that's true, then it could be tricky.

If they are in the same order, then there's an app comes with Apple's developer tools called FileMerge that's quite good - although I've only ever used it for code...

Amorya

towel · September 9, 2004 7:23PM

Do you know any Perl? You could code a script in a just a few minutes to do what you want.

Quote:

Originally posted by LGnome

if at the very least if the program says, "found 'John' in file A 10x and 9x in file B", or "Found 'Peterson' in file A but not in file B".. i would have good place to start.

Grep does this. Open up Terminal, cd to wherever the two files are, and type "grep John file1". It will spit out all lines in file1 in which 'John' is found.

What's the exact format of your tab-delimted files? Each address on a different line, with components of the address separated by tabs? Like this?

Code:

Jane Doe<tab>123 Some Street<tab>SomeCity, NY 10001

John Smith<tab>345 Another Way<tab>AnotherCity, CA 99991

mcq · September 9, 2004 9:55PM

Maybe a combination of sort and diff(FileMerge if using Dev Tools) would work, but I don't know how quickly sort would work on such a large file.

Sort function:

http://www.ncl.ac.uk/ucs/unix/unixhelp/sort.html

towel · September 9, 2004 10:56PM

I timed myself; it took eight minutes, and that included typing up sample data.

This is a tiny Perl program that takes two files as inputs, finds any lines in file1 that aren't in file2, and prints the non-matching lines to the output (on screen and to a file)

Code:

#! /usr/bin/perl -w

$inputfile1 = $ARGV[0];

$inputfile2 = $ARGV[1];

$outputfile = $ARGV[2];

open (INPUT1, "$inputfile1") or die "Can't open $inputfile1\

";

while (<INPUT1>) {

chomp;

$input1lines{$_} = 0;

}

close INPUT1;

open (INPUT2, "$inputfile2") or die "Can't open $inputfile2\

";

while (<INPUT2>) {

chomp;

$input2lines{$_} = 0;

}

close INPUT2;

open (OUTPUT, ">>$outputfile") or die "Can't open output file $outputfile\

";

foreach (keys %input1lines) {

unless (exists $input2lines{$_}) {

print OUTPUT ("$_\

");

print STDOUT ("No match for $_\

");

}

}

close OUTPUT;

First, copy the above code into a text file (in BBEdit, for example) and save it with some name. I called it "comparelines.pl". Next you have to make it executable. Open the Terminal, cd to the folder in which you saved the program, and type:

Code:

chmod u+x comparelines.pl

Finally, put your "before" and "after" address files into the same folder. Two points of caution: first, make sure all your files have Unix line endings, including the program - you can verify this in BBEdit under Save As/Options, or in SubEthaEdit under Format/Line Endings. Second, this will only match lines if they are exactly alike. So if the address-checking software rearranged the formats or expanded abbreviations, for example, you'd need a more complicated program to find the matches (probably by matching only names, instead of the whole address).

You run the program by typing its name in the Terminal, along with three file names - the before and after files, and whatever name you want to give the output. Like so:

Code:

./comparelines.pl BeforeAddresses AfterAddresses OutputFile

That's it. It'll spit out on screen any non-matching lines, as well as writing them to the output file.

I love Perl. "Make the easy things easy and the hard things possible." This is definitely an example of making the easy things easy.

lgnome · September 10, 2004 4:45PM

Quote:

Originally posted by Towel

I timed myself; it took eight minutes, and that included typing up sample data.

YIZERS!! ask and yee shall receive..

super cool.. well thanks for all the help.. im going to try this out Monday when i get back work.. ill let you all know the progress..

compare text file app - global

Comments