How do I grep for arabic characters with diacritical marks?

Question

I have large TXT files in arabic Tashkil and I'm trying to find lines that contain specific pattern mashkula with َ ً ُ ٌ ّ ْ ٍ , I've tried the following grep syntax:

cat file.txt | grep "اهلا"

This returns nothing until I insert Tashkil marks:

cat file.txt | grep "أهْلاً"

I get the correct output

أهْلاً

I also tried

grep -P "[ُ\ ّ\ َ\ ً\ ِ\ ٍ\ ٌ\ ْ\ \~]|[اهلا]" file.txt

And this returns all matching characters in different patterns:

أهْلاً أ ... هْ.. لًا أنْتَ لَيْلاً ..

How to match arabic diacritical marks with grep? Is it possible to remove Tashkil marks from text before using grep? My OS is Ubuntu 18.04

UPDATE: At this moment, I remove Tashkil marks from text with: sed "s/[ُ ّ َ ً ِ ٍ ٌ ْ]//g", then I can grep what I want. But in this approach, sed command removes spaces from all text!

Pablo Bianchi · Accepted Answer · 2025-06-16T21:59:09.043

Assuming UTF-8 source and locale, removing U+064B-U+065B range using Perl:

$ echo "أَهْلاً وَ سَهْلاً" | perl -CSAD -pe 's/[\x{064B}-\x{065B}]//g'
أهلا و سهلا

Source: This works because vowel diacritics in Arabic are combining characters, meaning that a simple search and remove of these should be enough.

GNU sed also seems to work (note that based on these answers, there are other diacritics):

$ echo "أَهْلاً وَ سَهْلاً" | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g'
أهلا و سهلا

uconv might also work.

Check the comments area of this and s3idani's answer for more info.

Other sources

s3idani · Answer 2 · 2022-04-25T21:23:49.640

1

Based on Pablo Bianchi's answer, Here's the workaround:

Text: أَهْلاً وَ سَهْلاً

Command: cat Text | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g;s/أ/ا/g;s/آ/ا/g;s/إ/ا/g' | grep -o "اهلا"

Output: اهلا

edited Apr 25 '22 at 21:23

answered Apr 17 '22 at 00:17

s3idani

423

How do I grep for arabic characters with diacritical marks?

2 Answers2

Other sources