Thursday, May 2, 2024
 Popular · Latest · Hot · Upcoming
4
rated 0 times [  4] [ 0]  / answers: 1 / hits: 939  / 2 Years ago, thu, june 30, 2022, 8:22:01

It's admission time in India and I am trying my best to get the best college I can for engg.



I have got a pdf file which contains a table which looks like



enter image description here



It contains about 2500+ entries and I have 3 days time.



So to do some smart work sorting out the right colleges for me, I need to match the contents to multiple regexp, like




  1. Should contain either of the words "computer" or "information"

  2. should contain both GE and FALSE

  3. Should match the regexp [0-9]{5,}



I first tried opening it in libreoffice calc but it opens in libreoffice Draw. I tried pdftohtml and pdftotext but both mess it badly.



Finally I came at pdfgrep, but it does not work in combination with grep as,



pdfgrep regexp1 ./locn to file|grep regexp2|grep regexp3


gives error



Binary file (standard input) matches


So whatever I have to do is with a single regexp to be put in pdfgrep, which will match all regexp's that I need.




EDIT: You can download the pdf here.



More From » pdf

 Answers
6

pdfgrep works on pages, not lines so instead of .* to match anything, you need [^
]*
to match anything but a newline and so ensure that you're matching the same line. For some reason, [
]
is treated as an n (the is ignored) by pdfgrep so some trickery is needed. Try this:



pdfgrep  '(Computer|Information)'[^$'
']'*GE'[^$'
']*'FALSE'[^$'
']*'[0-9]{5,}' Closing_Rank_After_Round_III.pdf


On my system, this returns 82 lines:



  17    DBLCSE4B                                          Computer Science & Engineering    GE    FALSE OTHERSTATE            33161
33 DBLITY4B Information Technology GE FALSE OTHERSTATE 38913
74 DHACSE4B Computer Science & Engineering GE FALSE OTHERSTATE 36528
97 DJKCSE4B Computer Science & Engineering GE FALSE OTHERSTATE 22030
108 DJTCSE4B Shri Mata Vaishno Devi University, J&K Computer Science & Engineering GE FALSE OTHERSTATE 41598
112 DMUCOE4B Mizoram University, Aizawl Computer Engineering GE FALSE OTHERSTATE 39759
124 DMUITY4B Mizoram University, Aizawl Information Technology GE FALSE OTHERSTATE 41723
132 DTUCSE4B Tezpur University, Tezpur Computer Science & Engineering GE FALSE OTHERSTATE 36567
161 IAAITY4B Information Technology GE FALSE OTHERSTATE 19303
173 IALITR5M M.Tech Information Technology GE FALSE OTHERSTATE 12723
181 IALITY4B Information Technology GE FALSE OTHERSTATE 10649
187 IGHCSE4B Computer Science & Engineering GE FALSE OTHERSTATE 20054
195 IGWITY4B Information Technology & Management, Information Technology GE FALSE OTHERSTATE 18357
195 IGWITY4B Information Technology & Management, Information Technology GE FALSE OTHERSTATE 18357
200 IJLCSE4B of Information Technology Design & Computer Science & Engineering GE FALSE OTHERSTATE 19427
200 IJLCSE4B of Information Technology Design & Computer Science & Engineering GE FALSE OTHERSTATE 19427
206 IJLECE4B of Information Technology Design & GE FALSE OTHERSTATE 21863
211 IJLMEC4B of Information Technology Design & Mechanical Engineering GE FALSE OTHERSTATE 22433
217 IKOCOE4B Computer Engineering GE FALSE OTHERSTATE 16837
223 IKPCOE4B Design & Manufacturing, Kancheepuram, Computer Engineering GE FALSE OTHERSTATE 14202
247 IVDCOS4B Computer Science GE FALSE OTHERSTATE 18374
252 IVDITY4B Information Technology GE FALSE OTHERSTATE 19973
284 NAGCSE4B National Institute of Technology, Agartala Computer Science & Engineering GE FALSE HOMESTATE 252288
285 NAGCSE4B National Institute of Technology, Agartala Computer Science & Engineering GE FALSE OTHERSTATE 27007
443 NAPCSE4B Computer Science & Engineering GE FALSE HOMESTATE 338141
444 NAPCSE4B Computer Science & Engineering GE FALSE OTHERSTATE 26762
505 NBHCSE4B Computer Science & Engineering GE FALSE HOMESTATE 11495
608 NCACSE4B National Institute of Technology, Calicut Computer Science & Engineering GE FALSE HOMESTATE LD 657523
735 NDUCSE4B National Institute of Technology, Durgapur Computer Science & Engineering GE FALSE HOMESTATE AN 80861
736 NDUCSE4B National Institute of Technology, Durgapur Computer Science & Engineering GE FALSE HOMESTATE WB 19088
737 NDUCSE4B National Institute of Technology, Durgapur Computer Science & Engineering GE FALSE OTHERSTATE 11900
772 NDUITY4B National Institute of Technology, Durgapur Information Technology GE FALSE HOMESTATE AN 95756
773 NDUITY4B National Institute of Technology, Durgapur Information Technology GE FALSE HOMESTATE WB 26872
774 NDUITY4B National Institute of Technology, Durgapur Information Technology GE FALSE OTHERSTATE 16715
811 NGOCSE4B National Institute of Technology, Goa Computer Science & Engineering GE FALSE HOMESTATE 102938
812 NGOCSE4B National Institute of Technology, Goa Computer Science & Engineering GE FALSE OTHERSTATE 13100
862 NHACSE4B National Institute of Technology, Hamirpur Computer Science & Engineering GE FALSE HOMESTATE 34510
863 NHACSE4B National Institute of Technology, Hamirpur Computer Science & Engineering GE FALSE OTHERSTATE 13867
933 NITCSE4B Birla Institute of Technology, Mesra Ranchi Computer Science & Engineering GE FALSE HOMESTATE 10898
955 NITITY4B Birla Institute of Technology, Mesra Ranchi Information Technology GE FALSE HOMESTATE 23647
956 NITITY4B Birla Institute of Technology, Mesra Ranchi Information Technology GE FALSE OTHERSTATE 14055
1080 NJLCSE4B Computer Science & Engineering GE FALSE HOMESTATE 13424
1081 NJLCSE4B Computer Science & Engineering GE FALSE OTHERSTATE 12160
1129 NJLITY4B Information Technology GE FALSE HOMESTATE 20270
1130 NJLITY4B Information Technology GE FALSE OTHERSTATE 14973
1172 NJMCSE4B Computer Science & Engineering GE FALSE HOMESTATE 22151
1173 NJMCSE4B Computer Science & Engineering GE FALSE OTHERSTATE 13379
1280 NKUITY4B Information Technology GE FALSE HOMESTATE 14993
1281 NKUITY4B Information Technology GE FALSE OTHERSTATE 12373
1317 NMGCSE4B National Institute of Technology, Meghalaya Computer Science & Engineering GE FALSE HOMESTATE 66882
1318 NMGCSE4B National Institute of Technology, Meghalaya Computer Science & Engineering GE FALSE OTHERSTATE 30457
1354 NMRCSE4B National Institute of Technology, Manipur Computer Science & Engineering GE FALSE HOMESTATE 335104
1355 NMRCSE4B National Institute of Technology, Manipur Computer Science & Engineering GE FALSE OTHERSTATE 29987
1386 NMZCSE4B National Institute of Technology, Mizoram Computer Science & Engineering GE FALSE HOMESTATE 780732
1387 NMZCSE4B National Institute of Technology, Mizoram Computer Science & Engineering GE FALSE OTHERSTATE 33351
1500 NNGCSE4B National Institute of Technology, Nagaland Computer Science & Engineering GE FALSE OTHERSTATE 32788
1538 NPACSE4B National Institute of Technology, Patna Computer Science & Engineering GE FALSE HOMESTATE 26912
1539 NPACSE4B National Institute of Technology, Patna Computer Science & Engineering GE FALSE OTHERSTATE 17852
1569 NPAITY4B National Institute of Technology, Patna Information Technology GE FALSE HOMESTATE 31050
1570 NPAITY4B National Institute of Technology, Patna Information Technology GE FALSE OTHERSTATE 21633
1588 NPYCSE4B National Institute of Technology, Puducherry Computer Science & Engineering GE FALSE HOMESTATE 212537
1589 NPYCSE4B National Institute of Technology, Puducherry Computer Science & Engineering GE FALSE OTHERSTATE 13738
1655 NRACSE4B National Institute of Technology, Raipur Computer Science & Engineering GE FALSE HOMESTATE 30599
1656 NRACSE4B National Institute of Technology, Raipur Computer Science & Engineering GE FALSE OTHERSTATE 16002
1686 NRAITY4B National Institute of Technology, Raipur Information Technology GE FALSE HOMESTATE 54124
1687 NRAITY4B National Institute of Technology, Raipur Information Technology GE FALSE OTHERSTATE 20012
1746 NROCEC5M National Institute of Technology, Rourkela and M.Tech. Computer Science 5- GE FALSE HOMESTATE 16014
1812 NROCSE4B National Institute of Technology, Rourkela Computer Science & Engineering GE FALSE HOMESTATE 12845
1821 NROCSS5M National Institute of Technology, Rourkela M.Tech. Information Security 5- GE FALSE HOMESTATE 16350
1822 NROCSS5M National Institute of Technology, Rourkela M.Tech. Information Security 5- GE FALSE OTHERSTATE 10803
1986 NSICSE4B National Institute of Technology, Silchar Computer Science & Engineering GE FALSE HOMESTATE 50138
1987 NSICSE4B National Institute of Technology, Silchar Computer Science & Engineering GE FALSE OTHERSTATE 22448
2044 NSKCSE4B National Institute of Technology, Sikkim Computer Science & Engineering GE FALSE HOMESTATE 353234
2045 NSKCSE4B National Institute of Technology, Sikkim Computer Science & Engineering GE FALSE OTHERSTATE 24788
2173 NSRCSE4B National Institution of Technology, Srinagar Computer Science & Engineering GE FALSE HOMESTATE 39818
2174 NSRCSE4B National Institution of Technology, Srinagar Computer Science & Engineering GE FALSE OTHERSTATE 22786
2259 NSTCOE4B Computer Engineering GE FALSE HOMESTATE DD 173213
2260 NSTCOE4B Computer Engineering GE FALSE HOMESTATE GJ 10724
2427 NUDCSE4B Computer Science & Engineering GE FALSE HOMESTATE 46818
2428 NUDCSE4B Computer Science & Engineering GE FALSE OTHERSTATE 18978
2478 NUSITY4B Assam University, Silchar Information Technology GE FALSE HOMESTATE 107749
2479 NUSITY4B Assam University, Silchar Information Technology GE FALSE OTHERSTATE 38122


The $'
'
is called an ANSI C escape sequence. These are a portable and robust method of specifying certain problematic characters (such as non-printing characters and quotes) to programs that cannot recognize them in other ways. In this case, I am using them in a character class. When a character class begins with ^, it means "match anything except the characters in this class. So, [^$'
']
means "match anything except a newline character". This ensures that the matches wer are looking for are all on the same line.


[#24214] Friday, July 1, 2022, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
vigorousom

Total Points: 394
Total Questions: 96
Total Answers: 110

Location: Pitcairn Islands
Member since Fri, Oct 15, 2021
3 Years ago
;