Friday, May 17, 2024
3
rated 0 times [  3] [ 0]  / answers: 1 / hits: 15106  / 3 Years ago, thu, august 12, 2021, 7:29:07

I need to count the number of unique values based on two columns in a spreadsheet.



Suppose the file looks like this, ordered by name, surname, company:



joe allen ibm
joe smith ibm
joe allen google
joe smith google
rachel allen google


And I need to count the number of unique first names for each company while ignoring the surname:



joe ibm 2
joe google 2
rachel google 1


I have this code:



sort file.tsv | uniq -ci | awk '{print $2,$1}'


If I simply delete the surname column, that code will work. But if I don't want to delete that column, just have awk ignore it, and save the output to a new file?



The data is separated by tabs


More From » command-line

 Answers
1

A GNU awk solution using two-dimensional arrays:



gawk -F $'	' '{a[$1][$3]++} END {for (i in a) for (j in a[i]) print i, j, a[i][j]}' foo.txt



  • a[$1][$3]++ for each combination of first name and surname, increment the count

  • Then loop through the first names and the company names associated with each first name.



Another way that will work other awks using the older form of multidimensional arrays:



awk -F $'	' '{a[$1, $3]++} END{for (i in a) {split (i, sep, SUBSEP); print sep[1], sep[2], a[i]}}' foo.txt



  • Since the old method actually uses a concatenation of the indices separated by SUBSEP, we have to split on SUBSEP to get back the original indices.


[#21088] Friday, August 13, 2021, 3 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
ainubt

Total Points: 496
Total Questions: 98
Total Answers: 126

Location: Sao Tome and Principe
Member since Wed, Dec 21, 2022
1 Year ago
;