Files
pkgsrc-ng/pkgtools/pkglint/files/doc/chap.code.xml
2013-09-26 17:14:40 +02:00

308 lines
13 KiB
XML

<!-- $NetBSD: chap.code.xml,v 1.5 2006/07/21 05:11:34 rillig Exp $ -->
<chapter id="code">
<title>Code structure</title>
<para>In this chapter, I give an overview of how the &pkglint;
code is organized, starting with the <function>main</function>
function, passing the functions that check a single line and
finally arriving at the infrastructure that makes writing the
other functions easier.</para>
<sect1 id="code.overview">
<title>Overview</title>
<para>The &pkglint; code is structured in modular, easy to
understand procedures. These procedures can be further
classified with respect to what they do. There are procedures
that check a file, others check the lines of a file, again
others check a single line. These classes of procedures are
described in the following sections in a top-down
fashion.</para>
<para>If nothing special is said about which procedures call
which others, you may assume that procedures of a certain rank
only call procedures that are of a strictly lower rank. For
example, no <function>checkline_*</function> will ever call
<function>checkfile_*</function>. Sometimes, functions of the
same rank are called, but these cases are documented
explicitly.</para>
</sect1>
<sect1 id="code.select">
<title>Selecting the proper checking function</title>
<para>The <function>main</function> procedure of &pkglint; is a
simple loop around a TODO list containing pathnames of items (I
couldn't think of a better name here). The decision of which
checks to apply to a given item is done in
<function>checkitem</function>, which checks whether the item is
a file or a directory and dispatches the actual checking to
specialized procedures.</para>
</sect1>
<sect1 id="code.dir">
<title>Checking a directory</title>
<para>The procedures that check a directory are
<function>checkdir_root</function> for the pkgsrc root
directory, <function>checkdir_category</function> for a category
of packages and <function>checkdir_package</function> for a
single package.</para>
</sect1>
<sect1 id="code.file">
<title>Checking a file</title>
<para>Since the dispatching for files requires much code, it has
been put into a separate procedure called
<function>checkfile</function>, which further dispatches the
call to the other procedures.</para>
<para>The procedures that check a specific file are
<function>checkfile_ALTERNATIVES</function>,
<function>checkfile_DESCR</function>,
<function>checkfile_distinfo</function>,
<function>checkfile_extra</function>,
<function>checkfile_INSTALL</function>,
<function>checkfile_MESSAGE</function>,
<function>checkfile_mk</function>,
<function>checkfile_patch</function> and
<function>checkfile_PLIST</function>. For most of the
procedures, it should be obvious to which files they are
applied. A distinction is made between buildlink3 files and
other <filename>Makefiles</filename>, as some additional checks
apply to buildlink3 files. Of course, these procedures use
pretty much the same code for checking, and this is where the
<function>checklines_*</function> functions step in.</para>
<para>The <function>checkfile_package_Makefile</function>
function is somewhat special in that it expects four parameters
instead of only one. This is because loading the package data
has been separated from the actual checking.</para>
</sect1>
<sect1 id="code.lines">
<title>Checking the lines in a file</title>
<para>This class of procedures consists of
<function>checklines_trailing_empty_lines</function>,
<function>checklines_package_Makefile_varorder</function> and
<function>checklines_mk</function>. The middle one is too
complex to be included in
<function>checkfile_package_Makefile</function>, and the other
ones are of so generic use that they deserved to be procedures
of their own.</para>
<para>The <function>checklines_mk</function> makes heavy use of
the various <function>checkline_*</function> functions that are
explained in the next chapter.</para>
</sect1>
<sect1 id="code.line">
<title>Checking a single line in a file</title>
<para>This class of procedures checks a single line of a file.
The number of parameters differs for most of these procedures,
as some need more context information and others don't.</para>
<para>The procedures that are applicable to any file type are
<function>checkline_length</function>,
<function>checkline_valid_characters</function>,
<function>checkline_valid_characters_in_variable</function>,
<function>checkline_trailing_whitespace</function>,
<function>checkline_rcsid_regex</function>,
<function>checkline_rcsid</function>,
<function>checkline_relative_path</function>,
<function>checkline_relative_pkgdir</function>,
<function>checkline_spellcheck</function> and
<function>checkline_cpp_macro_names</function>.</para>
<para>The rest of the procedures is specific to
<filename>Makefile</filename>s:
<function>checkline_mk_text</function>,
<function>checkline_mk_shellword</function>,
<function>checkline_mk_shelltext</function>,
<function>checkline_mk_shellcmd</function>,
<function>checkline_mk_vartype_basic</function>,
<function>checkline_mk_vartype_basic</function>,
<function>checkline_mk_vartype</function> and
<function>checkline_mk_varassign</function>.</para>
<para>This class of procedures contains the most code in
&pkglint;. The procedures that check shell commands and shell
words both have around 200 lines, and the largest procedure is
the check for predefined variable types, which has almost 500
lines. But the code is not complex at all, since this procedure
contains a large switch for all the predefined types. The checks
for a single type usually fit on a single screen.</para>
</sect1>
<sect1 id="code.infrastructure">
<title>The &pkglint; infrastructure</title>
<para>To keep the code in the checking procedures small and
legible, an additional layer of procedures is needed that
provides basic operations and abstractions for handling files as
a collection of lines and to print all diagnostics in a common
format that is suitable to further processing by software
tools.</para>
<para>Since October 2004, this part of &pkglint; makes use of
some of the object oriented features of the Perl programming
language. It has worked quite well upto now, but it has not been
fun to write object-oriented code in Perl. The most basic
feature I am missing is that the compiler checks whether an
object has a specific method or not, as I have often written
<code>$line->warning()</code> instead of
<code>$line->log_warning()</code>. This makes refacturing quite
difficult if you don't have a 100&nbsp;% coverage test, and I
don't have that.</para>
<para>The classes are all defined in the
<varname>PkgLint</varname> namespace.</para>
<para>The traditional class is <classname>Line</classname>,
which represents a logical line of a file. In case of
<filename>Makefile</filename>s, line continuations are parsed
properly and combined into a single line. For all other files,
each logical line corresponds to a physical line. The
<classname>Line</classname> class has accessor methods to its
fields <methodname>fname</methodname>,
<methodname>lines</methodname> and
<methodname>text</methodname>. It also has the methods
<methodname>log_fatal</methodname>,
<methodname>log_error</methodname>,
<methodname>log_warning</methodname>,
<methodname>log_info</methodname> and
<methodname>log_debug</methodname> that all have one parameter,
the diagnostics message. The other methods are used less
often.</para>
<para>In January 2006, the logging has been improved in
functionality. Before that, a logical line could well consist of
300 physical lines, so a diagnostic would say <quote>you have a
bug somewhere between line 100 and 400</quote>. This is not
helpful. Therefore, a new class has been invented that allows to
map each character of a logical line to its corresponding
physical location in the file. The new representation of a
logical line is called a <classname>String</classname>. This
feature is still experimental, since the only method for logging
a string is <methodname>log_warning</methodname>. The others are
still missing. It is also completely unclear how lines that have
been fixed by &pkglint; are represented since this moves
characters around in the physical lines.</para>
<para>To make pattern matching with the new
<classname>String</classname> easy to use, the additional class
<classname>StringMatch</classname> has been created. It saves
the result of a <classname>String</classname> that is matched
against a regular expression. The canonical way to get such a
<classname>StringMatch</classname> is to call the
<methodname>String::match</methodname> method.</para>
<para>Since the <classname>StringMatch</classname> was
convenient to use, the <classname>SimpleMatch</classname> class
represents the result of matching a Perl string against a
regular expression. The class <classname>Location</classname> is
currently unused.</para>
</sect1>
<sect1 id="code.style">
<title>Perl programming style</title>
<para>The &pkglint; source code has evolved from FreeBSD's portlint,
which has been written in Perl, and up to now, &pkglint; is written
in Perl. Since one of the main ingredients to &pkglint; are regular
expressions, this choice seems natural, and indeed the Perl regular
expressions are a great help to keep the code short. But &pkglint;
is more than just throwing regular expressions at the package
files.</para>
<para>In 2004, when the &pkglint; source code comprised about
40&nbsp;kilobytes, this was quite appropriate. Since then, the code
has become much more structured and various abstraction layers have
been inserted. It became more and more clear that the Perl
programming language has not been designed with nice-looking source
code in mind.</para>
<para>The first example are subroutines and their parameters. In
most other languages, the names of the parameters are mentioned in
the subroutine definition. Not so in Perl. The parameters to each
subroutine are passed in the <literal>@_</literal> array. The usual
way to get named parameters is to write assign the parameter array
to a list of local variables. This extra statement is a nuisance,
but it is merely syntactical.</para>
<para>More serious is the way the arguments are passed to a
subroutine. Perl allows the programmer to define subroutines with a
weak form of prototypes, which helps to catch calls to subroutines
that provide a wrong number of arguments. This feature catches many
bugs that are easily overlooked. The downside is that anything
besides using scalars as parameter types is difficult to understand
and quickly leads to unexpected behavior. Therefore the subroutines
in &pkglint; only use this style for parameter passing. Oh, and by
the way, the subroutine prototypes are only checked for in certain
situations like direct calls. In method calls, nothing is checked at
all. Since almost all diagnostics are produced by calling
<code>$line->log_warning()</code> or
<code>$line->log_error()</code>, most of the subroutine calls in
&pkglint; go unchecked.</para>
<para>Instead of using magic numbers, well written code defines
named constants for these numbers and then refers to them using
their names, giving the reader extra information that plain numbers
could not give. Although the constant definitions look quite good in
&pkglint; there is one big caveat. The Perl programming language
does not know constants. So these definitions are rather shortcuts
for defining functions that return the value of the constant. And as
functions in Perl have package-wide scope, so have these constants.
This is why the namespace prefixes like <varname>SWST_</varname> are
necessary to avoid name clashes.</para>
<para>Most of the constants would be written as an enumeration data
type if Perl had one. The same limitation applies for many of the
classes (implemented as packages in Perl) that are simply structs.
The typical Perl implementation of structs are classes, er, packages
which then use methods for accessing the fields. Again, the names of
these methods are only checked at runtime, so there is no language
support for detecting spelling mistakes in field names.</para>
<para>Another area where Perl fails to detect many errors is the
loose type system. You can apply almost every operator to almost
every data type, and the Perl language will give you more or less
what you want. Especially it does not prevent you from matching
a regular expression against a reference. It will simply compute
a string representation of the reference and match the regular
expression against that.</para>
<para>The current Perl interpreter is very inefficient when
copying strings. This happens really often in pkglint, for
example when passing arguments to functions or saving the result
of a regular expression match in <quote>real</quote> variables.
For a great speed-up, an implementation that handles string
objects by reference-counting them would be better. (Lua comes
to mind.)</para>
</sect1>
<sect1 id="code.lang">
<title>Switching to another language</title>
<para>Switching to C++ is not an option, since the typing
overhead would be more than twice the current amount. As a
consequence the code would become much less readable.</para>
<para>Switching to OCaml looks nice (because of the type
inference), but the regular expressions that are provided by the
system are by no means sufficient. On the other hand, since
today there is a PCRE package for OCaml in pkgsrc.</para>
</sect1>
</chapter>