0

How can I count all the python and shell scripts in my whole system?

2 Answers2

1

Quick overview

Here is a guideline on how to do it.

$ for f in * ; do file "$f" ; done

aptfielout: ASCII text, with very long lines aptfilein: ASCII text, with very long lines aptfileout: ASCII text aptfileparse.sh: Bourne-Again shell script, ASCII text executable, with very long lines aptfileparse.sh~: ASCII text, with very long lines calc.py: Python script, UTF-8 Unicode text executable catall.sh: Bourne-Again shell script, ASCII text executable

Strip out all the files that don't say "Bourne-Again shell script," or "Python script,". Add to the list POSIX shell scripts:

$ file /bin/zgrep
/bin/zgrep: POSIX shell script, ASCII text executable

A complete answer

/$ time find * -type f -print0 2>/dev/null | xargs -0 -P 8 file | \
sed 's/.*: //g' | sed 's/^ *//g' | \
grep -Eio 'shell script,|Python script,' | sort | uniq -c

19151 Python script,
127 python script, 18420 shell script,

real 16m14.939s user 54m7.355s sys 2m33.238s

Starting from the root (/) find all files and pipe to the xargs command as zero byte terminated names.

The xargs command is run in parallel maximizing all 8 CPUs for faster processing. Each parallel process calls the file command which gets a description of the file as shown in the previous section.

The grep commmand selects shell scripts and python scripts.

The sort command sorts shell scripts together and python scripts together.

The uniq command counts the occurrences of each group.


fun facts

You can really tax your system running all 8 CPUs (in my case) at once:

find xargs 8 cores.gif

The beauty of Linux shines through because other jobs such as the screen recorder making the .gif and a video running on the third monitor (big screen TV) continue to function normally. Linux doesn't let the xargs file command bog down the system.

1

In the absence of a more specific goal, this will be approximate no matter how you do it, because of ambiguities about what constitutes a shell script and what constitutes a Python script. That doesn't make the problem too ill-defined, so long as an approximation is what you want. And you can get a good approximation.

Given that, I suggest this command to list shell and Python scripts:

find . -type f -executable -exec file {} + | grep -Ei '(python|shell) script,'

If the output looks reasonable for your needs, you can run it again, modified to count the number of results:

find . -type f -executable -exec file {} + | grep -Ei '(python|shell) script,' | wc -l

You may get some "Permission denied" errors. That's okay. I don't recommend attempting to suppress those error messages, because you should read or at least scan through them to see if it looks like you were unable to access any files or locations that were of interest to you. You can run the find command as root with sudo if you really want to.

  • -type f makes it find only regular files. Usually it's better to use -xtype f to include symbolic links that resolve to regular files, but in this case that would result in overcounting.
  • -executable makes it find only files that are executable by the user who runs find. Looking at non-executable files to see if they appear to be shell or Python scripts would make the command take considerably longer. You may also get more false positives that way, in that files that aren't executable may be "libraries" rather than scripts, i.e., they may consist of shell commands and be intended for sourcing with . or source into shell scripts, or they may be Python modules that one would import with import or from into Python programs. (You might think this would not happen, since such files generally do not have a shebang, but find looks for more than a shebang.) However, you can omit -executable if you like--and if you are willing to wait as your command attempts to open and read the beginning of every regular file on your system.
  • -exec ... + runs a command ... with the found files as its command-line arguments. It runs the command as many times as necessary to process all the files. Often this is just once; for all the executable files on your whole system, it will likely be more than once, but many fewer times than if you ran it once per file (as -exec ... \; would do). Even on the same number of files, running a command fewer times tends to be notably faster than running it more times, because there is lower associated overhead.
  • The file command looks at the beginning of a file and guesses, usually pretty well, what kind of file it is. It outputs in a two-column format, with the path or filename on the left and a summary of what kind of file it appears to be on the right.
  • The grep command filters its input and outputs only lines that case-insensitively (-i) match the extended regular expression (-E) (python|shell) script,. Those are the lines that contain the text python script,, shell script,, or any case variant thereof. Files find identifies as those types of scripts will show this.
  • wc -l, which appears in the second of two commands shown above, counts lines.

As shown, this technique is wholly unsuitable for many tasks that involve discerning what type of files one has. The reason is that a file can have text like python script, in its name, as well as newline characters in its name that that would cause the output of file not to be one-per-line. It is usually important, and often even vital, to account for such things, and it can be done. In this case, however, you're just going for an estimate (due to the fuzzy nature of the problem itself) and it appears you're not renaming, modifying, deleting, or even creating anything based directly on the result, so I don't think it's worthwhile to worry about that. If you end up iterating on this and defining the problem more strictly, then it could be worthwhile to address that.

Note that there is one major case where you might wish to consider non-executable files to be scripts: if you have many Python scripts brought over from a system like Windows where they are not marked executable. In that case, you can search for .py files, though be aware that many of them are likely to be Python modules rather than Python scripts. If the good Python practice of putting a hashbang at the top of the script has been followed (this is useful even in Windows, because py.exe and pyw.exe recognize them, though unfortunately it's not always done), then a technique that looks just for hashbangs but ignores if a file is executable may be more suited to your needs.

There is also a minor but significant case where you might wish to consider non-executable files to be scripts of any kind--or, more precisely, where you might wish to test for executability differently. If you have a drive mounted noexec, then no file on it will pass find's -executable test. Note that this is a different problem from running find as a user who doesn't have permissions to execute some files--like the problem of running it as a user who doesn't have permissions to look in some directories, this can be solved by running it as a sufficiently privileged user.


This problem, as you've posed it, is unusual--ordinarily one would want to find scripts of a specific language or small family of closely related languages. But for the benefit of future readers, note that finding all the (for example) shell scripts in a single, perhaps large, directory can also be accomplished with a slight modification of the above commands. (The same holds for the technique presented in WinEunuuchs2Unix's answer--it is useful for that, too.)

For example, to find all the shell scripts in the current directory:

find . -type f -executable -exec file {} + | grep -Fi 'shell script,'
Eliah Kagan
  • 119,640