How to edit a **LARGE** data file ?

Posted:
in Mac Software edited January 2014
I have a large text file containing about 1.5 million of lines, tabulated into 8 columns of integers. I can't open this file in Excel or OpenOffice. I simply want to erase the first two columns and the last three ones. How can I do that ? I suspect I could do it with the Terminal app, using some UNIX commands, but I don't know which ones. Any idea ?



The data is typicaly of this type :



1 31 182 207782 251702 497161 155968 177129

2 37 242 179448 251159 456757 482649 51000

3 33 263 58414 183974 286273 394303 359364

4 150 435 62905 383035 150833 39433 65673

5 59 226 126530 17941 368627 127627 20542

6 71 298 113108 307803 62563 184296 540070



As I said, simply want to remove the first two columns, and the last three columns of data, keeping only the third, fourth and fifth columns. I can't do this in Excel since the limit is about 65000 lines or so. My file contains 1.5 million of lines !

Comments

  • Reply 1 of 14
    MarvinMarvin Posts: 15,326moderator
    You can use a hex editor to do this if it's just plain formatting.



    Hexfiend



    http://ridiculousfish.com/hexfiend/



    or 0xED



    http://www.suavetech.com/0xed/0xed.html
  • Reply 2 of 14
    kalikali Posts: 634member
    Quote:
    Originally Posted by Marvin View Post


    You can use a hex editor to do this if it's just plain formatting.



    Hexfiend



    http://ridiculousfish.com/hexfiend/



    or 0xED



    http://www.suavetech.com/0xed/0xed.html



    Hmmm, this is interesting. Thanks for the suggestion. But how do I use the HexEditor to remove the first and second columns, for example ? This isn't obvious to me.
  • Reply 3 of 14
    MarvinMarvin Posts: 15,326moderator
    Quote:
    Originally Posted by Kali View Post


    Hmmm, this is interesting. Thanks for the suggestion. But how do I use the HexEditor to remove the first and second columns, for example ? This isn't obvious to me.



    Ah I see what you're trying to do, The Hex editor won't be good for doing that. You can use a pretty basic script loop that will do this though:



    Code:


    import sys

    import os

    import string



    if (len(sys.argv)<3):

    >print "error"

    >sys.exit()



    filein = open(sys.argv[1],'r')

    fileout = open(sys.argv[2],'w')



    for line in filein:

    >lnsplit = line.split(" ")

    >output = " ".join(lnsplit[2:-3])

    >fileout.write(output+"\

    ")



    filein.close()

    fileout.close()









    Paste that into Textedit - you will need to indent the code correctly by replacing the > at the beginning of the lines with a tab. Save it as columns.py and don't append .txt on the end.



    Open up the terminal and type:



    python <drag columns.py in here> <drag data file here> <drag data file in again but change the name>



    When you drag the files in, you will have something like:



    python /downloads/columns.py /documents/data.txt /documents/data_modified.txt



    Then hit return and wait for the output.
  • Reply 4 of 14
    Here is a one-liner that you could also use:

    Code:


    cat path_to_data | awk '{ print $3 "\" $4 "\" $5 }' > path_to_putput



  • Reply 5 of 14
    kalikali Posts: 634member
    Quote:
    Originally Posted by Karl Kuehn View Post


    Here is a one-liner that you could also use:

    Code:


    cat path_to_data | awk '{ print $3 "\" $4 "\" $5 }' > path_to_putput





    OH GOD ! This worked like a charm ! Simply perfect ! This is what I was looking for. Thanks a LOT !
  • Reply 6 of 14
    floorjackfloorjack Posts: 2,726member
    I was going to call for sed



    >man sed
  • Reply 7 of 14
    kickahakickaha Posts: 8,760member
    Quote:
    Originally Posted by FloorJack View Post


    I was going to call for sed



    >man sed



    You *sadist*!
  • Reply 8 of 14
    kalikali Posts: 634member
    Okay, this brings me to some other question with UNIX in the Terminal app. Is it possible to use the output file and compute some maths like sinus and cosinus ? For example, suppose the data looks like this (columns 1 and 2 are now angles in degrees) :



    3.2380553818938\t-0.1145133041870\t0.00386

    3.2380553818938\t-0.1145133041870\t0.00386

    3.2380553818938\t-0.1145133041870\t0.00386

    3.2380553818938\t-0.1145133041870\t0.00386

    3.2380553818938\t-0.1145133041870\t0.00386



    Then can we compute sin(column1)*cos(column2) and make a new file with the output as a new column ?
  • Reply 9 of 14
    I'd advise you to take a look at Igor pro. It's really great for that kind of data-crunching.
  • Reply 10 of 14
    kalikali Posts: 634member
    More questions on editing ASCII files using UNIX commands in the Terminal app :



    How can we merge the data from two files with several columns and the same number of lines ?
  • Reply 11 of 14
    For that sort of thing you are going to have to create a small program. The languages I would use would be either perl or python. The former was written for this sort of thing, but the latter has become my language of choice, generically. Depending on exactly what you want done this could be a little as 10 lines or so, and would lay out about this way:
    • read in the first source file, line by line, storing the data in a hash keyed on the column you are merging on

    • read in the second file, line by line, and write each one out to our output file with the additional information from the first file

    • if you need to know the unused lines from the first file remove them from the hash when you use them

    • if needed, write out anything left in the hash

  • Reply 12 of 14
    Quote:
    Originally Posted by Kali View Post


    Okay, this brings me to some other question with UNIX in the Terminal app. Is it possible to use the output file and compute some maths like sinus and cosinus ? For example, suppose the data looks like this (columns 1 and 2 are now angles in degrees) :



    Then can we compute sin(column1)*cos(column2) and make a new file with the output as a new column ?



    This too can be tackled by awk:

    Code:


    cat path_to_input | awk '{ print $1 "\" $2 "\" $3 "\" sin($1*atan2(0, -1)/180)*cos($2*atan2(0, -1)/180) }' > path_to_output







    This is about as complicated as I would want to go with awk, it does extend further than even this. I would usually go with a small perl or python program at this point because it is easier to understand 3 months later.
  • Reply 13 of 14
    kalikali Posts: 634member
    Quote:
    Originally Posted by Karl Kuehn View Post


    This too can be tackled by awk:

    Code:


    cat path_to_input | awk '{ print $1 "\" $2 "\" $3 "\" sin($1*atan2(0, -1)/180)*cos($2*atan2(0, -1)/180) }' > path_to_output







    Thanks a lot again !



    And do you know how to add a column from another file, using awk ? I mean suppose you have a file "File1" with three columns, and want to add a fourth column from the data in column 6 of file "File2" ? Is this possible ? (I'm not talking about a simple concatenation of both files, since this operation wont add a fourth column).
  • Reply 14 of 14
    kalikali Posts: 634member
    Another trick with the Terminal app :



    Suppose we have a large data file containing three columns of numbers, and we want to keep one line of data after each three lines. How can we do that ? More explicitely, we have something like this :



    12 32 45

    11 33 55

    22 21 25

    45 43 67

    48 43 56

    33 47 78

    32 45 98

    32 47 88

    21 23 67

    ... etc ...



    and we want to keep these lines only :



    12 32 45

    45 43 67

    32 45 98

    ... etc ...



    Is there a way, using the Terminal app only ?
Sign In or Register to comment.