Sunday, May 12, 2024
5
rated 0 times [  5] [ 0]  / answers: 1 / hits: 895  / 3 Years ago, wed, october 27, 2021, 6:19:12

I have large TXT files in arabic Tashkil and I'm trying to find lines that contain specific pattern mashkula with َ ً ُ ٌ ّ ْ ٍ , I've tried the following grep syntax:


cat file.txt | grep "اهلا"

This returns nothing until I insert Tashkil marks:


cat file.txt | grep "أهْلاً"

I get the correct output


أهْلاً


I also tried


grep -P "[ُ ّ َ ً ِ ٍ ٌ ْ ~]|[اهلا]" file.txt

And this returns all matching characters in different patterns:


أهْلاً أ ... هْ.. لًا أنْتَ لَيْلاً ..

How to match arabic diacritical marks with grep?
Is it possible to remove Tashkil marks from text before using grep?
My OS is Ubuntu 18.04


UPDATE: At this moment, I remove Tashkil marks from text with:
sed "s/[ُ ّ َ ً ِ ٍ ٌ ْ]//g", then I can grep what I want. But in this approach, sed command removes spaces from all text!


More From » command-line

 Answers
3

Assuming UTF-8 source and locale, removing U+064B-U+065B range using Perl:


$ echo "أَهْلاً وَ سَهْلاً" | perl -CSAD -pe 's/[x{064B}-x{065B}]//g'

أهلا و سهلا

Source: This works because vowel diacritics in Arabic are combining characters, meaning that a simple search and remove of these should be enough.


GNU sed also seems to work (note that based on those answers there are other diacritics):


$ echo "أَهْلاً وَ سَهْلاً" | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g'

أهلا و سهلا

uconv might also work.


Check the comments area of this and s3idani's post for more info.


Other sources



[#673] Wednesday, October 27, 2021, 3 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
aslity

Total Points: 336
Total Questions: 133
Total Answers: 98

Location: Zimbabwe
Member since Thu, Jul 21, 2022
2 Years ago
aslity questions
Sat, Apr 22, 23, 23:22, 1 Year ago
Sun, Oct 3, 21, 05:33, 3 Years ago
Fri, Jul 1, 22, 17:16, 2 Years ago
Mon, Dec 13, 21, 00:15, 2 Years ago
;