Saturday, May 18, 2024
11
rated 0 times [  11] [ 0]  / answers: 1 / hits: 1725  / 1 Year ago, sat, january 7, 2023, 10:20:58

I have a big txt file in which values are are repeating many times. Is there some command that I can use that will go through file and if one value appears once do not repeat it again?



SO4
HOH
CL
BME
HOH
SO4
HOH
CL
BME
HOH
SO4
HOH
SO4
HOH
CL
BME
HOH
SO4
HOH
CL
BME
HOH
CL


So it should look something like this:



S04   
HOH
CL
BME


The thing is that I have huge number of different values, so can't do it manualy like here.


More From » command-line

 Answers
4

You could use the command sort with the option --unique:



sort -u input-file


If you want to write result to FILE instead of standard output, use the option --output=FILE:



sort -u input-file -o output-file





The command uniq also could be applied. In this case the identical lines must be consequential, so the input must be sorted preliminary - thanks to @RonJohn for this note:



sort input-file | uniq > output-file





I like the sort command for similar cases, because of its simplicity, but if you work with large arrays the awk approach from John1024's answer could be more powerful. Here is a time comparison between the mentioned approaches, applied on a file (based on the above example) with almost 5 million lines:



$ cat input-file | wc -l
20000000

$ TIMEFORMAT=%R
$ time sort -u input-file | wc -l
64
7.495

$ time sort input-file | uniq | wc -l
64
7.703

$ time awk '!a[$0]++' input-file | wc -l # from John1024's answer
64
1.271

$ time datamash rmdup 1 < input-file | wc -l # from αғsнιη's answer
64
0.770


Other significant difference is that mentioned by @Ruslan:




sort -u will only print the result once the input has ended, while
this awk command will do print each new result line on the fly (this
may be more important for piped input than file).




Here is an illustration:



enter image description here



In the above example, the loop (shown below) generates 500 random combinations, each with a length of three characters, of the letters A-D. These combinations are piped to awk or sort.



for i in {1..500}; do cat /dev/urandom | tr -dc A-D | head -c 3; echo; done

[#9344] Monday, January 9, 2023, 1 Year  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
wheance

Total Points: 314
Total Questions: 96
Total Answers: 112

Location: Benin
Member since Thu, Aug 12, 2021
3 Years ago
;