pkgsrc-ng/pkgtools/pkglint/files/doc/chap.code.xml

<!-- $NetBSD: chap.code.xml,v 1.5 2006/07/21 05:11:34 rillig Exp $ -->

<chapter id="code">
<title>Code structure</title>

	<para>In this chapter, I give an overview of how the &pkglint;
	code is organized, starting with the <function>main</function>
	function, passing the functions that check a single line and
	finally arriving at the infrastructure that makes writing the
	other functions easier.</para>

<sect1 id="code.overview">
<title>Overview</title>

	<para>The &pkglint; code is structured in modular, easy to
	understand procedures. These procedures can be further
	classified with respect to what they do. There are procedures
	that check a file, others check the lines of a file, again
	others check a single line. These classes of procedures are
	described in the following sections in a top-down
	fashion.</para>

	<para>If nothing special is said about which procedures call
	which others, you may assume that procedures of a certain rank
	only call procedures that are of a strictly lower rank. For
	example, no <function>checkline_*</function> will ever call
	<function>checkfile_*</function>. Sometimes, functions of the
	same rank are called, but these cases are documented
	explicitly.</para>

</sect1>

<sect1 id="code.select">
<title>Selecting the proper checking function</title>

	<para>The <function>main</function> procedure of &pkglint; is a
	simple loop around a TODO list containing pathnames of items (I
	couldn't think of a better name here). The decision of which
	checks to apply to a given item is done in
	<function>checkitem</function>, which checks whether the item is
	a file or a directory and dispatches the actual checking to
	specialized procedures.</para>

</sect1>

<sect1 id="code.dir">
<title>Checking a directory</title>

	<para>The procedures that check a directory are
	<function>checkdir_root</function> for the pkgsrc root
	directory, <function>checkdir_category</function> for a category
	of packages and <function>checkdir_package</function> for a
	single package.</para>

</sect1>

<sect1 id="code.file">
<title>Checking a file</title>

	<para>Since the dispatching for files requires much code, it has
	been put into a separate procedure called
	<function>checkfile</function>, which further dispatches the
	call to the other procedures.</para>

	<para>The procedures that check a specific file are
	<function>checkfile_ALTERNATIVES</function>,
	<function>checkfile_DESCR</function>,
	<function>checkfile_distinfo</function>,
	<function>checkfile_extra</function>,
	<function>checkfile_INSTALL</function>,
	<function>checkfile_MESSAGE</function>,
	<function>checkfile_mk</function>,
	<function>checkfile_patch</function> and
	<function>checkfile_PLIST</function>. For most of the
	procedures, it should be obvious to which files they are
	applied. A distinction is made between buildlink3 files and
	other <filename>Makefiles</filename>, as some additional checks
	apply to buildlink3 files. Of course, these procedures use
	pretty much the same code for checking, and this is where the
	<function>checklines_*</function> functions step in.</para>

	<para>The <function>checkfile_package_Makefile</function>
	function is somewhat special in that it expects four parameters
	instead of only one. This is because loading the package data
	has been separated from the actual checking.</para>

</sect1>

<sect1 id="code.lines">
<title>Checking the lines in a file</title>

	<para>This class of procedures consists of
	<function>checklines_trailing_empty_lines</function>,
	<function>checklines_package_Makefile_varorder</function> and
	<function>checklines_mk</function>. The middle one is too
	complex to be included in
	<function>checkfile_package_Makefile</function>, and the other
	ones are of so generic use that they deserved to be procedures
	of their own.</para>

	<para>The <function>checklines_mk</function> makes heavy use of
	the various <function>checkline_*</function> functions that are
	explained in the next chapter.</para>

</sect1>

<sect1 id="code.line">
<title>Checking a single line in a file</title>

	<para>This class of procedures checks a single line of a file.
	The number of parameters differs for most of these procedures,
	as some need more context information and others don't.</para>

	<para>The procedures that are applicable to any file type are
	<function>checkline_length</function>,
	<function>checkline_valid_characters</function>,
	<function>checkline_valid_characters_in_variable</function>,
	<function>checkline_trailing_whitespace</function>,
	<function>checkline_rcsid_regex</function>,
	<function>checkline_rcsid</function>,
	<function>checkline_relative_path</function>,
	<function>checkline_relative_pkgdir</function>,
	<function>checkline_spellcheck</function> and
	<function>checkline_cpp_macro_names</function>.</para>

	<para>The rest of the procedures is specific to
	<filename>Makefile</filename>s:
	<function>checkline_mk_text</function>,
	<function>checkline_mk_shellword</function>,
	<function>checkline_mk_shelltext</function>,
	<function>checkline_mk_shellcmd</function>,
	<function>checkline_mk_vartype_basic</function>,
	<function>checkline_mk_vartype_basic</function>,
	<function>checkline_mk_vartype</function> and
	<function>checkline_mk_varassign</function>.</para>

	<para>This class of procedures contains the most code in
	&pkglint;. The procedures that check shell commands and shell
	words both have around 200 lines, and the largest procedure is
	the check for predefined variable types, which has almost 500
	lines. But the code is not complex at all, since this procedure
	contains a large switch for all the predefined types. The checks
	for a single type usually fit on a single screen.</para>

</sect1>

<sect1 id="code.infrastructure">
<title>The &pkglint; infrastructure</title>

	<para>To keep the code in the checking procedures small and
	legible, an additional layer of procedures is needed that
	provides basic operations and abstractions for handling files as
	a collection of lines and to print all diagnostics in a common
	format that is suitable to further processing by software
	tools.</para>

	<para>Since October 2004, this part of &pkglint; makes use of
	some of the object oriented features of the Perl programming
	language. It has worked quite well upto now, but it has not been
	fun to write object-oriented code in Perl. The most basic
	feature I am missing is that the compiler checks whether an
	object has a specific method or not, as I have often written
	<code>$line->warning()</code> instead of
	<code>$line->log_warning()</code>. This makes refacturing quite
	difficult if you don't have a 100&nbsp;% coverage test, and I
	don't have that.</para>

	<para>The classes are all defined in the
	<varname>PkgLint</varname> namespace.</para>

	<para>The traditional class is <classname>Line</classname>,
	which represents a logical line of a file. In case of
	<filename>Makefile</filename>s, line continuations are parsed
	properly and combined into a single line. For all other files,
	each logical line corresponds to a physical line. The
	<classname>Line</classname> class has accessor methods to its
	fields <methodname>fname</methodname>,
	<methodname>lines</methodname> and
	<methodname>text</methodname>. It also has the methods
	<methodname>log_fatal</methodname>,
	<methodname>log_error</methodname>,
	<methodname>log_warning</methodname>,
	<methodname>log_info</methodname> and
	<methodname>log_debug</methodname> that all have one parameter,
	the diagnostics message. The other methods are used less
	often.</para>

	<para>In January 2006, the logging has been improved in
	functionality. Before that, a logical line could well consist of
	300 physical lines, so a diagnostic would say <quote>you have a
	bug somewhere between line 100 and 400</quote>. This is not
	helpful. Therefore, a new class has been invented that allows to
	map each character of a logical line to its corresponding
	physical location in the file. The new representation of a
	logical line is called a <classname>String</classname>. This
	feature is still experimental, since the only method for logging
	a string is <methodname>log_warning</methodname>. The others are
	still missing. It is also completely unclear how lines that have
	been fixed by &pkglint; are represented since this moves
	characters around in the physical lines.</para>

	<para>To make pattern matching with the new
	<classname>String</classname> easy to use, the additional class
	<classname>StringMatch</classname> has been created. It saves
	the result of a <classname>String</classname> that is matched
	against a regular expression. The canonical way to get such a
	<classname>StringMatch</classname> is to call the
	<methodname>String::match</methodname> method.</para>

	<para>Since the <classname>StringMatch</classname> was
	convenient to use, the <classname>SimpleMatch</classname> class
	represents the result of matching a Perl string against a
	regular expression. The class <classname>Location</classname> is
	currently unused.</para>

</sect1>
<sect1 id="code.style">
<title>Perl programming style</title>

	<para>The &pkglint; source code has evolved from FreeBSD's portlint,
	which has been written in Perl, and up to now, &pkglint; is written
	in Perl. Since one of the main ingredients to &pkglint; are regular
	expressions, this choice seems natural, and indeed the Perl regular
	expressions are a great help to keep the code short. But &pkglint;
	is more than just throwing regular expressions at the package
	files.</para>

	<para>In 2004, when the &pkglint; source code comprised about
	40&nbsp;kilobytes, this was quite appropriate. Since then, the code
	has become much more structured and various abstraction layers have
	been inserted. It became more and more clear that the Perl
	programming language has not been designed with nice-looking source
	code in mind.</para>

	<para>The first example are subroutines and their parameters. In
	most other languages, the names of the parameters are mentioned in
	the subroutine definition. Not so in Perl. The parameters to each
	subroutine are passed in the <literal>@_</literal> array. The usual
	way to get named parameters is to write assign the parameter array
	to a list of local variables. This extra statement is a nuisance,
	but it is merely syntactical.</para>

	<para>More serious is the way the arguments are passed to a
	subroutine. Perl allows the programmer to define subroutines with a
	weak form of prototypes, which helps to catch calls to subroutines
	that provide a wrong number of arguments. This feature catches many
	bugs that are easily overlooked. The downside is that anything
	besides using scalars as parameter types is difficult to understand
	and quickly leads to unexpected behavior. Therefore the subroutines
	in &pkglint; only use this style for parameter passing. Oh, and by
	the way, the subroutine prototypes are only checked for in certain
	situations like direct calls. In method calls, nothing is checked at
	all. Since almost all diagnostics are produced by calling
	<code>$line->log_warning()</code> or
	<code>$line->log_error()</code>, most of the subroutine calls in
	&pkglint; go unchecked.</para>

	<para>Instead of using magic numbers, well written code defines
	named constants for these numbers and then refers to them using
	their names, giving the reader extra information that plain numbers
	could not give. Although the constant definitions look quite good in
	&pkglint; there is one big caveat. The Perl programming language
	does not know constants. So these definitions are rather shortcuts
	for defining functions that return the value of the constant. And as
	functions in Perl have package-wide scope, so have these constants.
	This is why the namespace prefixes like <varname>SWST_</varname> are
	necessary to avoid name clashes.</para>

	<para>Most of the constants would be written as an enumeration data
	type if Perl had one. The same limitation applies for many of the
	classes (implemented as packages in Perl) that are simply structs.
	The typical Perl implementation of structs are classes, er, packages
	which then use methods for accessing the fields. Again, the names of
	these methods are only checked at runtime, so there is no language
	support for detecting spelling mistakes in field names.</para>

	<para>Another area where Perl fails to detect many errors is the
	loose type system. You can apply almost every operator to almost
	every data type, and the Perl language will give you more or less
	what you want. Especially it does not prevent you from matching
	a regular expression against a reference. It will simply compute
	a string representation of the reference and match the regular
	expression against that.</para>

	<para>The current Perl interpreter is very inefficient when
	copying strings. This happens really often in pkglint, for
	example when passing arguments to functions or saving the result
	of a regular expression match in <quote>real</quote> variables.
	For a great speed-up, an implementation that handles string
	objects by reference-counting them would be better. (Lua comes
	to mind.)</para>

</sect1>
<sect1 id="code.lang">
<title>Switching to another language</title>

	<para>Switching to C++ is not an option, since the typing
	overhead would be more than twice the current amount. As a
	consequence the code would become much less readable.</para>

	<para>Switching to OCaml looks nice (because of the type
	inference), but the regular expressions that are provided by the
	system are by no means sufficient. On the other hand, since
	today there is a PCRE package for OCaml in pkgsrc.</para>

</sect1>
</chapter>