Code structure

Code structure In this chapter, I give an overview of how the &pkglint; code is organized, starting with the main function, passing the functions that check a single line and finally arriving at the infrastructure that makes writing the other functions easier. Overview The &pkglint; code is structured in modular, easy to understand procedures. These procedures can be further classified with respect to what they do. There are procedures that check a file, others check the lines of a file, again others check a single line. These classes of procedures are described in the following sections in a top-down fashion. If nothing special is said about which procedures call which others, you may assume that procedures of a certain rank only call procedures that are of a strictly lower rank. For example, no checkline_* will ever call checkfile_*. Sometimes, functions of the same rank are called, but these cases are documented explicitly. Selecting the proper checking function The main procedure of &pkglint; is a simple loop around a TODO list containing pathnames of items (I couldn't think of a better name here). The decision of which checks to apply to a given item is done in checkitem, which checks whether the item is a file or a directory and dispatches the actual checking to specialized procedures. Checking a directory The procedures that check a directory are checkdir_root for the pkgsrc root directory, checkdir_category for a category of packages and checkdir_package for a single package. Checking a file Since the dispatching for files requires much code, it has been put into a separate procedure called checkfile, which further dispatches the call to the other procedures. The procedures that check a specific file are checkfile_ALTERNATIVES, checkfile_DESCR, checkfile_distinfo, checkfile_extra, checkfile_INSTALL, checkfile_MESSAGE, checkfile_mk, checkfile_patch and checkfile_PLIST. For most of the procedures, it should be obvious to which files they are applied. A distinction is made between buildlink3 files and other Makefiles, as some additional checks apply to buildlink3 files. Of course, these procedures use pretty much the same code for checking, and this is where the checklines_* functions step in. The checkfile_package_Makefile function is somewhat special in that it expects four parameters instead of only one. This is because loading the package data has been separated from the actual checking. Checking the lines in a file This class of procedures consists of checklines_trailing_empty_lines, checklines_package_Makefile_varorder and checklines_mk. The middle one is too complex to be included in checkfile_package_Makefile, and the other ones are of so generic use that they deserved to be procedures of their own. The checklines_mk makes heavy use of the various checkline_* functions that are explained in the next chapter. Checking a single line in a file This class of procedures checks a single line of a file. The number of parameters differs for most of these procedures, as some need more context information and others don't. The procedures that are applicable to any file type are checkline_length, checkline_valid_characters, checkline_valid_characters_in_variable, checkline_trailing_whitespace, checkline_rcsid_regex, checkline_rcsid, checkline_relative_path, checkline_relative_pkgdir, checkline_spellcheck and checkline_cpp_macro_names. The rest of the procedures is specific to Makefiles: checkline_mk_text, checkline_mk_shellword, checkline_mk_shelltext, checkline_mk_shellcmd, checkline_mk_vartype_basic, checkline_mk_vartype_basic, checkline_mk_vartype and checkline_mk_varassign. This class of procedures contains the most code in &pkglint;. The procedures that check shell commands and shell words both have around 200 lines, and the largest procedure is the check for predefined variable types, which has almost 500 lines. But the code is not complex at all, since this procedure contains a large switch for all the predefined types. The checks for a single type usually fit on a single screen. The &pkglint; infrastructure To keep the code in the checking procedures small and legible, an additional layer of procedures is needed that provides basic operations and abstractions for handling files as a collection of lines and to print all diagnostics in a common format that is suitable to further processing by software tools. Since October 2004, this part of &pkglint; makes use of some of the object oriented features of the Perl programming language. It has worked quite well upto now, but it has not been fun to write object-oriented code in Perl. The most basic feature I am missing is that the compiler checks whether an object has a specific method or not, as I have often written $line->warning() instead of $line->log_warning(). This makes refacturing quite difficult if you don't have a 100 % coverage test, and I don't have that. The classes are all defined in the PkgLint namespace. The traditional class is Line, which represents a logical line of a file. In case of Makefiles, line continuations are parsed properly and combined into a single line. For all other files, each logical line corresponds to a physical line. The Line class has accessor methods to its fields fname, lines and text. It also has the methods log_fatal, log_error, log_warning, log_info and log_debug that all have one parameter, the diagnostics message. The other methods are used less often. In January 2006, the logging has been improved in functionality. Before that, a logical line could well consist of 300 physical lines, so a diagnostic would say you have a bug somewhere between line 100 and 400. This is not helpful. Therefore, a new class has been invented that allows to map each character of a logical line to its corresponding physical location in the file. The new representation of a logical line is called a String. This feature is still experimental, since the only method for logging a string is log_warning. The others are still missing. It is also completely unclear how lines that have been fixed by &pkglint; are represented since this moves characters around in the physical lines. To make pattern matching with the new String easy to use, the additional class StringMatch has been created. It saves the result of a String that is matched against a regular expression. The canonical way to get such a StringMatch is to call the String::match method. Since the StringMatch was convenient to use, the SimpleMatch class represents the result of matching a Perl string against a regular expression. The class Location is currently unused. Perl programming style The &pkglint; source code has evolved from FreeBSD's portlint, which has been written in Perl, and up to now, &pkglint; is written in Perl. Since one of the main ingredients to &pkglint; are regular expressions, this choice seems natural, and indeed the Perl regular expressions are a great help to keep the code short. But &pkglint; is more than just throwing regular expressions at the package files. In 2004, when the &pkglint; source code comprised about 40 kilobytes, this was quite appropriate. Since then, the code has become much more structured and various abstraction layers have been inserted. It became more and more clear that the Perl programming language has not been designed with nice-looking source code in mind. The first example are subroutines and their parameters. In most other languages, the names of the parameters are mentioned in the subroutine definition. Not so in Perl. The parameters to each subroutine are passed in the @_ array. The usual way to get named parameters is to write assign the parameter array to a list of local variables. This extra statement is a nuisance, but it is merely syntactical. More serious is the way the arguments are passed to a subroutine. Perl allows the programmer to define subroutines with a weak form of prototypes, which helps to catch calls to subroutines that provide a wrong number of arguments. This feature catches many bugs that are easily overlooked. The downside is that anything besides using scalars as parameter types is difficult to understand and quickly leads to unexpected behavior. Therefore the subroutines in &pkglint; only use this style for parameter passing. Oh, and by the way, the subroutine prototypes are only checked for in certain situations like direct calls. In method calls, nothing is checked at all. Since almost all diagnostics are produced by calling $line->log_warning() or $line->log_error(), most of the subroutine calls in &pkglint; go unchecked. Instead of using magic numbers, well written code defines named constants for these numbers and then refers to them using their names, giving the reader extra information that plain numbers could not give. Although the constant definitions look quite good in &pkglint; there is one big caveat. The Perl programming language does not know constants. So these definitions are rather shortcuts for defining functions that return the value of the constant. And as functions in Perl have package-wide scope, so have these constants. This is why the namespace prefixes like SWST_ are necessary to avoid name clashes. Most of the constants would be written as an enumeration data type if Perl had one. The same limitation applies for many of the classes (implemented as packages in Perl) that are simply structs. The typical Perl implementation of structs are classes, er, packages which then use methods for accessing the fields. Again, the names of these methods are only checked at runtime, so there is no language support for detecting spelling mistakes in field names. Another area where Perl fails to detect many errors is the loose type system. You can apply almost every operator to almost every data type, and the Perl language will give you more or less what you want. Especially it does not prevent you from matching a regular expression against a reference. It will simply compute a string representation of the reference and match the regular expression against that. The current Perl interpreter is very inefficient when copying strings. This happens really often in pkglint, for example when passing arguments to functions or saving the result of a regular expression match in real variables. For a great speed-up, an implementation that handles string objects by reference-counting them would be better. (Lua comes to mind.) Switching to another language Switching to C++ is not an option, since the typing overhead would be more than twice the current amount. As a consequence the code would become much less readable. Switching to OCaml looks nice (because of the type inference), but the regular expressions that are provided by the system are by no means sufficient. On the other hand, since today there is a PCRE package for OCaml in pkgsrc.