6

I have large TXT files in arabic Tashkil and I'm trying to find lines that contain specific pattern mashkula with َ ً ُ ٌ ّ ْ ٍ , I've tried the following grep syntax:

cat file.txt | grep "اهلا"

This returns nothing until I insert Tashkil marks:

cat file.txt | grep "أهْلاً"

I get the correct output

أهْلاً

I also tried

grep -P "[ُ\ ّ\ َ\ ً\ ِ\ ٍ\ ٌ\ ْ\ \~]|[اهلا]" file.txt

And this returns all matching characters in different patterns:

أهْلاً أ ... هْ.. لًا أنْتَ لَيْلاً ..

How to match arabic diacritical marks with grep? Is it possible to remove Tashkil marks from text before using grep? My OS is Ubuntu 18.04

UPDATE: At this moment, I remove Tashkil marks from text with: sed "s/[ُ ّ َ ً ِ ٍ ٌ ْ]//g", then I can grep what I want. But in this approach, sed command removes spaces from all text!

Pablo Bianchi
  • 17,371
s3idani
  • 423

2 Answers2

5

Assuming UTF-8 source and locale, removing U+064B-U+065B range using Perl:

$ echo "أَهْلاً وَ سَهْلاً" | perl -CSAD -pe 's/[\x{064B}-\x{065B}]//g'

أهلا و سهلا

Source: This works because vowel diacritics in Arabic are combining characters, meaning that a simple search and remove of these should be enough.

GNU sed also seems to work (note that based on these answers, there are other diacritics):

$ echo "أَهْلاً وَ سَهْلاً" | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g'

أهلا و سهلا

uconv might also work.

Check the comments area of this and s3idani's answer for more info.

Other sources

Pablo Bianchi
  • 17,371
1

Based on Pablo Bianchi's answer, Here's the workaround:

Text: أَهْلاً وَ سَهْلاً

Command: cat Text | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g;s/أ/ا/g;s/آ/ا/g;s/إ/ا/g' | grep -o "اهلا"

Output: اهلا

s3idani
  • 423