ASCII source file checker

Question

For official Ubuntu documentation where the source English files are in docbook xml, there is a requirement of ASCII only characters. We use a "checker" command line (see here).

grep --color='auto' -P -n "[\x80-\xFF]" *.xml

However, the command has a flaw, apparently not on all computers, it misses some lines with non-ASCII characters, potentially resulting in a false O.K. result.

Does anyone have a better suggestion for a ASCII checker command line?

Interested persons might consider to use this file (text file, not a docbook xml file) as a test case. The first three lines with non ASCII characters are lines 9, 14 and 18. Lines 14 and 18 were missed in the check:

$ grep --color='auto' -P -n "[\x80-\xFF]" install.en.txt | head -13
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
520:1.2.1.4. Freedom and Philosophy
534:1.2.1.5. Ubuntu and other Debian derivatives
555:1.3. What is GNU/Linux?

muru · Accepted Answer · 2016-02-07T08:34:14.733

If you want to look for non-ASCII characters, perhaps you should invert the search to exclude ASCII characters:

grep -Pn '[^\x00-\x7F]'

For example:

$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

In lines 9, 330, 337 and 359, Unicode non-breaking space characters are present.

The particular output you get maybe due to grep's support for UTF-8. For a Unicode locale, some of those characters may compare equal to a normal ASCII character. Forcing the C locale will show the expected results in that case:

$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community

Byte Commander · Answer 2 · 2016-07-10T13:50:36.283

You can print all non-ASCII lines of a file using my Python 3 script that I am hosting on GitHub here:

GitHub: ByteCommander/encoding-check

You can either clone or download the entire repository or simply save the file encoding-check and make it executable using chmod +x encoding-check.

Then you can run it like this, with the file to check as only argument:

./encoding-check FILENAME if it's located in your current working directory, or...
/path/to/encoding-check FILENAME if it's located in /path/to/, or...
encoding-check FILENAME if it's located in a directory that is part of the $PATH environment variable, i.e. /usr/local/bin or ~/bin.

Without any optional arguments, it will print each line and its number where it found non-ASCII characters. Finally, there's a summary line that tells you how many lines the file had in total and how many of them contained non-ASCII characters.

This method is guaranteed to properly decode all ASCII characters and detect everything that is definitely not ASCII.

Here's an example run on a file containing the first 20 lines of your given install.en.txt:

$ ./encoding-check install-first20.en.txt
     9: Appendix��F, GNU General Public License.
    14: (codename "���Xenial Xerus���"), for the 64-bit PC ("amd64") architecture. It also
    18: ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
--------------------------------------------------------------------------------
20 lines in 'install-first20.en.txt', thereof 3 lines with non-ASCII characters.

But the script has some additional arguments to tweak the checked encoding and the output format. View the help and try them:

$ encoding-check -h
usage: encoding-check [-h] [-e ENCODING] [-s | -c | -l] [-m] [-w] [-n] [-f N]
                     [-t]
                     FILE [FILE ...]

Show all lines of a FILE containing characters that don't match the selected
ENCODING.

positional arguments:
  FILE                  the file to be examined

optional arguments:
  -h, --help            show this help message and exit
  -e ENCODING, --encoding ENCODING
                        file encoding to test (default 'ascii')
  -s, --summary         only print the summary
  -c, --count           only print the detected line count
  -l, --lines           only print the detected lines
  -m, --only-matching   hide files without matching lines from output
  -w, --no-warnings     hide warnings from output
  -n, --no-numbers      do not show line numbers in output
  -f N, --fit-width N   trim lines to N characters, or terminal width if N=0;
                        non-printable characters like tabs will be removed
  -t, --title           print title line above each file

As --encoding, every codec that Python 3 knows is valid. Just try one, in the worst case you get a little error message...

score 3 · Answer 3 · answered Feb 07 '16 at 07:16

This Perl command mostly replaces that grep command (the thing missing being the colors):

perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' *.xml

n: causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed -n or awk:
```
LINE:
  while (<>) {
      ...             # your program goes here
  }
```
-e: may be used to enter one line of program.
/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_): If the line contains a character in the range \x80-\xFF, prints the current file's name, the current file's line number, a :\t^string and the current line's content.

Output on a sample directory containing the sample file in the question and a file containing only ààààà and a newline character:

% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file | head -n 10
file(9):    ^AppendixÂ F, GNU General Public License.
file(14):   ^(codename "â€˜Xenial Xerusâ€™"), for the 64-bit PC ("amd64") architecture. It also
file(18):   ^â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”
file(330):  ^when things go wrong. The Installation Howto can be found in AppendixÂ A, 
file(337):  ^ChapterÂ 1.Â Welcome to Ubuntu
file(359):  ^1.1.Â What is Ubuntu?
file(368):  ^  â€¢ Ubuntu will always be free of charge, and there is no extra fee for the "
file(372):  ^  â€¢ Ubuntu includes the very best in translations and accessibility
file(376):  ^  â€¢ Ubuntu is shipped in stable and regular release cycles; a new release will
file(380):  ^  â€¢ Ubuntu is entirely committed to the principles of open source software
% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file1
file1(1):   ^ààààà

ASCII source file checker

3 Answers3

Linked