Monday, May 6, 2024
3
rated 0 times [  3] [ 0]  / answers: 1 / hits: 1758  / 2 Years ago, tue, september 20, 2022, 6:57:47

I want to find the number of unique words in my file named cdj.tsv. I can use head -n 1 cdj.tsv to get the first line. Now I want number of unique words in this line. How can I get that?
Result of the command head -n 1 cdj.tsv looks like:


Country China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark

So, I want the output to be 3 (for Country, China and Denmark).


Thanks


More From » command-line

 Answers
4

One simple way:



  • Get the first line from a file with head -n 1 cdj.tsv (You already know that) or from multiple files by name like this head -q -n 1 cdj.tsv file2.tsv file3.tsv the -q will suppress printing extra headers / file names so that only the lines from the files are printed. You can use globbing * with input file names as well like this head -q -n 1 *.tsv to process all files in the current directory with .tsv extension as input.



  • Then, pipe | that to tr -s ' ' '
    '
    to process the words each on a new line i.e. one at a time ... (notice: many alternate tools can be used to do the same thing in this step even the much less efficient xargs -n 1 and the answer by @Peter Cordes in this regard is worth reading.)



  • Then, pipe that to sort -u to sort and print only unique words.



  • Then, to get both the words themselves and their count, pipe that to tee with wc -l in a process substitution syntax >(wc -l) and put both in a subshell syntax (...) to group the output like so:


    head -q -n 1 *.tsv | tr -s ' ' '
    ' | sort -u | (tee >(wc -l))


  • The result from your example will look like this:


    China
    Country
    Denmark
    DenmarkDenmark
    4



Another faster way with awk or gawk:



  • Start a word (field) counter, set its initial value to 1 like this i=1, limit its maximum value to the available number of fields like this i<=NF and increment it by 1 with each new field like this i++ and put all that in an awk for control statement syntax like this for ( i=1; i<=NF; i++ ).



  • Then, for each field, check if the line being processed is the first line in the file like this NR==1 and if that is true check if the current field (word) hasn't occurred before and if true add its reference to an array like this !seen[$i]++ and print it with print $i and put all that in an awk if control statement syntax in an action group {...} like this { if ( NR==1 && !seen[$i]++ ) print $i }.



  • Then, print the total number of unique words (fields) with print length(seen) and put it in a separate action group after the awk conditional pattern element END like this END { print length(seen) }.



  • Then, Use it on a single input file like so:


    awk '{ for ( i=1; i<=NF; i++ ) { if ( NR==1 && !seen[$i]++ ) print $i }} END { print length(seen) }' cdj.tsv


  • Or use it on multiple input files with FNR==1 instead of NR==1 like so:


    awk '{ for ( i=1; i<=NF; i++ ) { if ( FNR==1 && !seen[$i]++ ) print $i }} END { print length(seen) }' *.tsv


  • Or use it on multiple input files with <(head -q -n 1 *.tsv) and without NR==1 or FNR==1 like so:


    awk '{ for ( i=1; i<=NF; i++ ) { if ( !seen[$i]++ ) print $i }} END { print length(seen) }' <(head -q -n 1 *.tsv)



[#481] Wednesday, September 21, 2022, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
raldership

Total Points: 222
Total Questions: 115
Total Answers: 124

Location: North Korea
Member since Fri, Nov 4, 2022
2 Years ago
raldership questions
Sat, Nov 12, 22, 12:31, 2 Years ago
Sun, Aug 8, 21, 13:07, 3 Years ago
Wed, Sep 8, 21, 08:24, 3 Years ago
;