Question

9

Identify duplicate lines in a file without deleting them?

rated 0 times [ 9] [ 0] / answers: 1 / hits: 71359 / 2 Years ago, thu, june 9, 2022, 3:16:18

I have my references as a text file with a long list of entries and each has two (or more) fields.

The first column is the reference's url; the second column is the title which may vary a bit depending on how the entry was made. Same for third field which may or may not be present.

I want to identify but not remove entries that have the first field (reference url) identical. I know about sort -k1,1 -u but that will automatically (non-interactively) remove all but the first hit. Is there a way to just let me know so I can choose which to retain?

In the extract below of three lines that have the same first field (http://unix.stackexchange.com/questions/49569/), I would like to keep line 2 because it has additional tags (sort, CLI) and delete lines #1 and #3:

http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field

http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI

http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

Is there a program to help identify such "duplicates"? Then, I can manually clean up by personally deleting lines #1 and #3?

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

farnic

Add To Favorites

Follow

Total Points: 409

Total Questions: 117

Total Answers: 125

Location: Andorra

Member since Sat, May 27, 2023

12 Months ago

answered 2 Years ago fenddy · Accepted Answer

If I understand your question, I think that you need something like:

for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

or:

for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

where file.txt is your file containing data about you are interested.

In the output you will see the number of the lines and lines where first field is found two or more times.