Question

3

Count the number of unique values based on two columns in a spreadsheet

rated 0 times [ 3] [ 0] / answers: 1 / hits: 15106 / 3 Years ago, thu, august 12, 2021, 7:29:07

I need to count the number of unique values based on two columns in a spreadsheet.

Suppose the file looks like this, ordered by name, surname, company:

joe allen ibm

joe smith ibm

joe allen google

joe smith google

rachel allen google

And I need to count the number of unique first names for each company while ignoring the surname:

joe ibm 2

joe google 2

rachel google 1

I have this code:

sort file.tsv | uniq -ci | awk '{print $2,$1}'

If I simply delete the surname column, that code will work. But if I don't want to delete that column, just have awk ignore it, and save the output to a new file?

The data is separated by tabs

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

ainubt

Add To Favorites

Follow

Total Points: 496

Total Questions: 98

Total Answers: 126

Location: Sao Tome and Principe

Member since Wed, Dec 21, 2022

1 Year ago

ainubt questions

1 Mendeley Reference Manager crashes on start in Ubuntu 22.04

Mon, Jan 3, 22, 07:56, 2 Years ago

1 How to remove characters/strings/patterns between, before and after some other characters/strings/patterns in a fasta file using a Ubuntu command line

Fri, Dec 16, 22, 22:32, 1 Year ago

1 How to increase battery life on Ubuntu 20.04 and what power saving software should I install, if any?

Thu, Nov 11, 21, 04:48, 3 Years ago

1 Periodic crash with Ubuntu Server 20.04 on Raspberry Pi 3 B+

Thu, Jun 17, 21, 14:31, 3 Years ago

1 Awk/Sed commands find and replace pattern

Sun, Mar 5, 23, 20:18, 1 Year ago

View All

answered 3 Years ago eatack · Accepted Answer

A GNU awk solution using two-dimensional arrays:

gawk -F $'	' '{a[$1][$3]++} END {for (i in a) for (j in a[i]) print i, j, a[i][j]}' foo.txt

a[$1][$3]++ for each combination of first name and surname, increment the count

Then loop through the first names and the company names associated with each first name.

Another way that will work other awks using the older form of multidimensional arrays:

awk -F $'	' '{a[$1, $3]++} END{for (i in a) {split (i, sep, SUBSEP); print sep[1], sep[2], a[i]}}' foo.txt

Since the old method actually uses a concatenation of the indices separated by SUBSEP, we have to split on SUBSEP to get back the original indices.