Question

3

How to find number of unique words in first row using bash command?

rated 0 times [ 3] [ 0] / answers: 1 / hits: 1758 / 2 Years ago, tue, september 20, 2022, 6:57:47

I want to find the number of unique words in my file named cdj.tsv. I can use head -n 1 cdj.tsv to get the first line. Now I want number of unique words in this line. How can I get that?
Result of the command head -n 1 cdj.tsv looks like:

Country China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   China   Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark Denmark DenmarkDenmark  Denmark

So, I want the output to be 3 (for Country, China and Denmark).

Thanks

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

raldership

Add To Favorites

Follow

Total Points: 222

Total Questions: 115

Total Answers: 124

Location: North Korea

Member since Fri, Nov 4, 2022

2 Years ago

raldership questions

1 How to fix WSL domain resolution

Sat, Nov 12, 22, 12:31, 2 Years ago

1 creating two net bridges on one phisical computer

Sun, Aug 8, 21, 13:07, 3 Years ago

1 bcdedit /enum firmware does not show grub

Wed, Sep 8, 21, 08:24, 3 Years ago

1 Ubuntu 21.10 install stuck with lenovo Thinkpad e15 gen2 because of elan error

Sun, Nov 14, 21, 22:08, 3 Years ago

1 How to go to beginning and end of line in ordinary text boxes

Mon, Jun 13, 22, 05:13, 2 Years ago

View All

answered 2 Years ago enefiama · Accepted Answer

One simple way:

Get the first line from a file with head -n 1 cdj.tsv (You already know that) or from multiple files by name like this head -q -n 1 cdj.tsv file2.tsv file3.tsv the -q will suppress printing extra headers / file names so that only the lines from the files are printed. You can use globbing * with input file names as well like this head -q -n 1 *.tsv to process all files in the current directory with .tsv extension as input.

Then, pipe | that to tr -s ' ' ' ' to process the words each on a new line i.e. one at a time ... (notice: many alternate tools can be used to do the same thing in this step even the much less efficient xargs -n 1 and the answer by @Peter Cordes in this regard is worth reading.)

Then, pipe that to sort -u to sort and print only unique words.

Then, to get both the words themselves and their count, pipe that to tee with wc -l in a process substitution syntax >(wc -l) and put both in a subshell syntax (...) to group the output like so:
```
head -q -n 1 *.tsv | tr -s ' ' '

' | sort -u | (tee >(wc -l))
```

The result from your example will look like this:
```
China

Country

Denmark

DenmarkDenmark

4
```

Another faster way with `awk` or gawk:

Start a word (field) counter, set its initial value to 1 like this i=1, limit its maximum value to the available number of fields like this i<=NF and increment it by 1 with each new field like this i++ and put all that in an awk for control statement syntax like this for ( i=1; i<=NF; i++ ).

Then, for each field, check if the line being processed is the first line in the file like this NR==1 and if that is true check if the current field (word) hasn't occurred before and if true add its reference to an array like this !seen[$i]++ and print it with print $i and put all that in an awk if control statement syntax in an action group {...} like this { if ( NR==1 && !seen[$i]++ ) print $i }.

Then, print the total number of unique words (fields) with print length(seen) and put it in a separate action group after the awk conditional pattern element END like this END { print length(seen) }.

Then, Use it on a single input file like so:

awk '{ for ( i=1; i<=NF; i++ ) { if ( NR==1 && !seen[$i]++ ) print $i }} END { print length(seen) }' cdj.tsv

Or use it on multiple input files with FNR==1 instead of NR==1 like so:

awk '{ for ( i=1; i<=NF; i++ ) { if ( FNR==1 && !seen[$i]++ ) print $i }} END { print length(seen) }' *.tsv

Or use it on multiple input files with <(head -q -n 1 *.tsv) and without NR==1 or FNR==1 like so:

awk '{ for ( i=1; i<=NF; i++ ) { if ( !seen[$i]++ ) print $i }} END { print length(seen) }' <(head -q -n 1 *.tsv)

How to find number of unique words in first row using bash command?

Answers

One simple way:

Another faster way with awk or gawk:

raldership

Another faster way with `awk` or gawk: