Main Page | Namespace List | Class Hierarchy | Alphabetical List | Class List | File List | Namespace Members | Class Members | File Members | Related Pages

The IPTC (news) daemon

Overview

This program deals with reading and storing IPTC messages as defined by "The IPTC Recommended Message Format", 1995 (TEC 7901 R5). This format is used by e.g. news-agencies to spread their news to their customers i.e. newspapers, radio, tv et al.

According to that document such a message is composed of four sections:

  1. pre-header information
  2. message header
  3. message text
  4. post-text information

The format elements are seperated by the special control characters SOH, STX or ETX and terminated by EOT. So in theory a message is build from the following parts:

  MESSAGE HEADER
  Start of Header         SOH             SOH
  Source Identification   byn             one, two or three alphas
  Message Number          0178            three or four numerals
  Field Separator         SP              SP
  Priority of Story       2               one numeral (1-6)
  Field Separator         SP              SP
  Category of Story       pol             one, two or three alphas
  Field Separator         SP              SP
  Word Count              195             one to four numerals
  Field Separator         SP              SP
  Optional Information    any characters  up to 50 characters (optional)
  Field Separator         CR LF           CR LF (additional LF optional)
  Keyword/Catch-Line      any characters  up to 69 characters
  Field Separator         CR LF           CR LF (additional LF optional)
  Start of Text           STX             STX

  MESSAGE TEXT            Text            Text

  POST-TEXT INFORMATION
  End of Text             CR LF ETX       CR LF ETX
  Date and Time           071045          six numerals (two each for day, hour, minute)
  Field Separator         SP              SP (optional)
  Time Zone               GMT             three alphas (optional)
  Field Separator         SP              SP (optional)
  Month of Transmission   jan             three alphas (optional)
  Field Separator         SP              SP (optional)
  Year of Transmission    91              two numerals (optional)
  Msg Separation Pattern  CR LF LF LF     up to 32 characters (optional)
  End of Transmission     EOT             EOT

"In theory" means, that in reality (a) lots of, hmm, garbage can be found on line and (b) especially the fields marked as optional (but not only them) are used in, well, different ways by different sources.

Startup

After being started the program first parses the optional commandline arguments before detaching from the controlling terminal. Then it writes a PID file to support the usual start/stop mechanism of SYSV like systems. Done that, signal handlers are configured and set up and the programs mainthread is created. It in turn reads all its config data from the config file and creates and configures the worker threads accordingly. If this step is done successfully the programs main() function actually starts the main thread which in turn starts all the workers. If no errors occur from now on the program runs forever, well ..., at least as long as it's not aborted, hmm, externally.

The config file

A stripped down version of the configuration file looks like this (see comments below), where all section names and keywords as well as file- and pathnames or configuration values are case-sensitive:

    [general]
    loglevel = 8
    minmsgsize = 32

    [backup]
    indir = /var/opt/iptcd/backin
    outdir = /var/opt/iptcd/backout
    capture = /var/opt/iptcd/old

    [sql0]
    hostname = localhost
    database = IptcDB
    username = iptcw
    password = iptcw
    port = 0

    [sql1]
    hostname = host2.my.domain
    database = IptcDB
    username = iptcw
    password = iptcw
    port = 0

    [sql2]
    hostname = host3.my.domain
    database = IptcDB
    username = iptcw
    password = iptcw
    port = 0

    [sql3]
    hostname = host4.my.domain
    database = IptcDB
    username = iptcw
    password = iptcw
    port = 0

    [port0]
    device = /dev/ttyC0
    baudrate = 300
    databits = 7
    stopbits = 1
    parity = e
    flowcontrol = n
    capturedir = /var/opt/iptcd/capture

    [port1]
    device = /dev/ttyC1
    baudrate = 4800
    databits = 8
    stopbits = 1
    parity = n
    flowcontrol = n
    capturedir = /var/opt/iptcd/capture

    [port2]
    device = /dev/ttyC2
    baudrate = 4800
    databits = 8
    stopbits = 1
    parity = n
    flowcontrol = n
    capturedir = /var/opt/iptcd/capture

    [port3]
    device = /dev/ttyC3
    baudrate = 4800
    databits = 7
    stopbits = 1
    parity = e
    flowcontrol = n
    capturedir = /var/opt/iptcd/capture

Some explainations about the meanings:

[general]

This section contains the following entries:

loglevel =

This setting decides how many status/info/error messages are sent to syslog. One should start with 8 and, after watching the syslogs for some days, then slowly decrease down to 3.

Note: Smaller values than 3 let you miss even serious error messages - not recommended!

minmsgsize =

This is the number of characters that an incoming message must have in its body-text at least to be processed as usual. Setting this to a reasonable value avoids processing all the no-news "control messages" some agencies regularly send.

Note that this is not a "technical" setting but kind of "political" decision: While one customer may be interested in even all control-messages, another might not. Consider as well, that sometimes there are real/valid news that are really short like the result of a sports match. So be carefull not to set this value too high.

[backup]

This section contains the following entries:

indir =

This is the directory where "simple" files are read from by the Backup Reader.

Note: Only files with a .msg extension are read and parsed. Note too, that these files will be deleted once they are completely read, hence only copy files to this directory instead of moving them. For further information about this format refer to the dfg::Iptc2SimpleThread class documentation.

outdir =

This is the directory where "simple" files are written to by the Backup Writer.

The files are stored with a .msg name extension. The writer thread will create subdirectories with a YYYY/MM/DD structure so each "leaf" directory will hold the incoming messages of exactly one day. For further information about this format refer to the dfg::Iptc2SimpleThread class documentation.

Either this feature or the capturing of incoming data (or both) should be enabled as kind of safety belt.

capture =

This is a directory containing original port captured IPTC data files to be read by the Capture File Reader.

Only files with a .iptc extension are read and parsed! Note too, that these files will be deleted once they're read, hence only copy files to this directory instead of moving them.

This is kind of "backward compatibility" setting that's normally not used. It's provided to allow for reading IPTC captures that some old DOS-based software produced. In such an environment this setting (i.e. the associated reader thread) allows for softly migrating from such software to this SQL based solution.

Starting with iptcd version 1.2 the serial ports may be configured to capture all incoming data to a file. Those files can be read as well by the Capture File Reader (after copying and renaming them into this directory.

[sqlX]

These sections (where X refers to the number 0, 1, 2 or 3) each contain the following entries:

hostname =

The hostname/IP-address of the database server to contact.

Note that the daemon does not check this name in any way but simply passes it to the SQL Writer thread when it is created. There the database connection object will try to resolve this name. If you haven't set up a proper DNS system (or if it's down) resolving (i.e. getting the IP-num of the named host) may fail, resulting in error messages sent to syslog(3). The only workaraound (apart from correcting typos) is to use the IP-num of the destination host instead of its unresolvable name.

database =

Name of database to use for storing the messages.

Note that this database must already exist on the given host (see above) and the database server must be up and running when the iptcd daemon starts.

password =

The password to use for database connection.

Note that this password (for the given user, see below) must be already configured at the database server when the iptcd daemon starts.

username =

The username to use with the database connection.

Note that this user (for the given password, see above) must be already configured at the database server when the iptcd daemon starts.

port =

The IP port to use for database connection (0 == default port).

Please note, that it's up to the DBMS administrator to setup the database(s) including all needed permissions (GRANT etc., see the Database chapter below). This program simply assumes the named database(s) usable for the given name/password pair(s). In case this condition isn't met, there will show up, er, error-messages in SysLog(3) ...

For backward compatibility with older versions (where there was only a single SQL-writer) it's still possible to use a single [sql] (i.e. without a trailing number) section. In this case only this section is used to setup just one SQL-writer. But note that this feature will be removed in future versions. Therefor one should use the [sql0] section to setup a single writer and remove the other [sql{1,2,3}] sections if they're not needed.

[portX]

These sections (where X refers to a digit between 0 and 7 inclusive) each contain the following entries:

device =

The filename of the serial port to read e.g. /dev/ttyS0.

If this entry is missing, the given name is invalid or can't be opened or is already locked by another running process, the current [portX] config section is skipped.

baudrate =

The transmission speed to use e.g. 9600 baud.

Possible values are: 50, 75, 110, 150, 300, 600, 1200, 2400, 4800, 9600, 19200. Default (if entry is missing or empty) is 9600. Please note, that speeds like 19200 may cause problems when used w/o handshake (see flowcontrol below).

databits =

The number of databits to expect: 5, 6, 7 or 8.

stopbits =

The number of stop bits to use: 1 or 2.

parity =

The kind of parity bit to use: e (even), o (odd) or n (none).

flowcontrol =

The handshake to use: h (hardware), s (software) or n (none).

Note: All devices (I've seen so far) writing out ITPC data via serial port do not use any kind of flowcontrol, neither CTS/RTS (i.e. hardware) nor XON/XOFF (i.e. software). So until this program's setup doesn't match the other device's exactly only garbage (and lots of error-messages) will be received. But this does not mean that the database(s) will be flooded with garbage: The SQL writer thread(s) won't notice this situation at all since they'll get only those messages that the IPTC reader (threads) consider valid (i.e. they at least received a parseable msg-header and msg-body).

capturedir =

This optional keyword allows to specify a directory name that's used to save all characters read from the serial port to a file. If this entry is missing or empty, this feature is disabled.

The name given here is used as a base directory; the current section name (e.g. port0) is internally added and subdirectories YYYY/MM/ (year/month) are created where the data is stored day-wise with filenames representing the respective current day of reading (which may be not the same as the date given within a IPTC message read) and a ".capture" extension, resulting in filenames like /var/opt/iptcd/capture/port1/2003/12/25.capture (the data that where read on 2003-12-25 by the reader configured in section [port1] with a capturedir value of /var/opt/iptcd/capture.

The possibility to capture all incoming data to a file may be used for archival purposes. They may be used as well by the Capture File Reader to (re-)read data that didn't make its way to its final destination (e.g. a database).

Either this feature or the Backup Writer (or both) should be enabled as kind of safety belt.

Please note, that the section name (e.g. [port0]) is just that: a name. There's no relation with the serial device name (e.g. /dev/ttyC6 for the seventh port of a Cyclades multiport card). In other words: You may configure e.g. the fourth serial device in the [port1] section or any other combination/assignment.

The Programs main() Function

The main() function is the programs entry point, i.e. it's the point where the programs execution begins when it is started. This function has few things to do:

  1. First it checks for options given on the commandline as there are:

        -c | --config {/path/name of configfile} (default: '/etc/opt/iptcd/config')
        -p | --pidfile {/path/name of PID file} (default: '/var/run/iptcd.pid')
        -v | --verbose   show start messages
        -h | --help      this little help
    

    None of these options must be given when starting the program, but they may, if the compiled-in defaults don't match the administrators taste ...

  2. The second step is detaching from the controlling terminal. This is, what "transforms" a normal program into a daemon program. To document its running state, a PID file is created (default: /var/run/iptcd.pid) which can be used by a SYSV like start-/stop-mechanism.

    If a PID file is already existing and holding the process ID of a running process, the (newly started) program terminates writing an error message to STDERR.

  3. In a next step it creates a main thread instance and lets that object read all the configuration data. If that fails (e.g. as a consequence of a missing or incomplete config file) the whole program terminates with an -1 errorlevel.

  4. Otherwise (i.e. the setup is OK) the main() function creates and installs some signal handlers for all "the usual suspicious".

  5. For the rest of its life, this function only waits for signals (such as SIGTERM) to arrive. In case this happens, the signal is catched and the main thread is informed (by calling its Abort() method).

  6. Finally the main thread instance is deleted and the PID file is removed. With returning from the main() function the whole program is terminated. - Finis ...

The Main Thread

This thread, started by the programs main() function, has only two jobs to deal with:

  1. At startup it reads all config data from the config file; then creates and configures the various worker threads. First the writer (see SQL Writer and Backup Writer) are started, since they internally create the IptcMessageQueue instances which are used as message queues. After getting those lists (by way of having the workers call a special method of its parent) the main thread creates the Port Readers, the Backup Reader and the Old file Reader as far as they're configured in the config file.

    Each of the readers gets passed the IptcMessageQueue instances (aquired before from the writers) to store the incoming messages. After some validations (e.g.: is at least one reader and one writer completely configured and successfully created?) the worker threads are actually started, i.e. they begin doing their respective jobs.

  2. After that stage the mainthread sleeps most of its time only waiting for an abort-signal. If such a signal arrives (by calling the threads Abort() method), all worker threads are informed at once to abort their respective work. Depending on the concrete point of execution they are at this moment and depending on the actual processor speed of the machine the daemon is running on, it may take some hundred milliseconds for each worker to cleanup and terminate.

    When the workers terminated (with notifying their parent) the mainthread itself terminates.

The Worker Threads

Depending on the actual configuration (the config file mentioned above) there can be several men at work:

Up to 8 Serial Port Readers

These worker threads represent the usual input sources. They open the serial device file according to the config file setting (discussed above) and wait for data to come in. To actually read and parse the incoming data they internally use a IptcReader instance which implements all the logic to handle the IPTC data. Once a message is read it gets stored in a IptcMessage instance which is then added to the IptcMessageQueue instances that are provided by the configured writer threads (mainly the SQL Writer and the Backup Writer). As far as the reader thread is concerned it can forget now about this message and go on reading, parsing, storing the next incoming news.

Backup Reader

This reader thread works the same way as the Port Reader. The first difference is that it gets its data from a normal text file instead of a serial port. The other difference is the format of the file: It's a simple format designed to be parseable easy even for third party tools (such as PHP scripts for example). For further information refer to the dfg::Iptc2SimpleThread class documentation.

This facility is meant for situations where messages read from a serial port didn't make it into the database (e.g. 'cause of a system- or maintainance shutdown of the database server). In such a situation the administrator may copy the files written by the Backup Writer (see below) into the configured directory (see [backup] indir= above).

While not essentially for the daemon to work, this thread should not (but may) be disabled, just for safety. Otherwise the daemon would have to be restarted in the situations mentioned which would most probably result in additional data loss. Just let it sleep, but at least let it live (for your own sake)!

Capture File Reader

As mentioned earlier this is mainly an aid for migrating from older software which writes IPTC data captures to disc.

Like the Backup Reader this thread reads files from a directory configured in the config file (see [backup] capture= above). And like the Port Reader it uses a IptcReader instance to parse the news messages before storing them in the IptcMessageQueue instances it got at startup. So as far as the writer threads (see SQL Writer and Backup Writer below) are concerned, it makes no difference where the messages originated in the first place, a serial port or a file.

During normal operation this reader thread will be usually disabled (by commenting out or renaming or removing the config-entry), but it doesn't hurt to have this thread activated: It will just spend most of its life by sleeping, only eventually interrupted by a quick look at the directory. What a life!

Backup Writer

This worker is the counterpart of the Backup Reader discussed above. It writes all incoming messages (which it gets through the internal message queue) to the configured directory.

While this thread is not essential it should not get disabled since it's meant a kind of safeguard for unforseen failures; see the discussion above.

Up to 4 SQL Writer

Since the main goal developing this daemon was to provide a database connection, this worker thread is kind of "heart" - well, at least a, hmm, hand - of the whole program. It reads one message at a time from its queue and sends it, embedded in SQL statements, to the configured database server.

It checkes for duplicates (e.g. in consequence of processing old messages: see Backup Reader and Capture File Reader above) and then inserts the data into the different tables, working hard to maintain referential integrity between the tables. Done that, it picks up the next message and, well, so on ... In case, there's no message ready to send to the database, the thread - like its sibblings - goes, er, sleeping ...

The Database

The daemon (i.e. the SQL Writer) expects a database with a given structure to add the incoming messages to. Such a database may be created with the following CreateDB SQL script:

CREATE DATABASE IF NOT EXISTS IptcDB;

USE IptcDB;

DROP TABLE IF EXISTS tNews;
CREATE TABLE tNews (
    fMID VARCHAR(28) NOT NULL,
    fDateTime DATETIME NULL default '0000-00-00 00:00:00',
    fPriority ENUM('0', '1', '2', '3', '4', '5', '6') NOT NULL,
    fMsgNum SMALLINT(4) UNSIGNED ZEROFILL NOT NULL,
    fWordCount SMALLINT(5) UNSIGNED ZEROFILL NOT NULL,
    fCatchline VARCHAR(255) default '', -- optional field
    fOptInfo VARCHAR(50) default '',    -- optional field
    PRIMARY KEY (fMID),
    FULLTEXT ixNewsCatchline (fCatchline),
    KEY ixNewsDateTime (fDateTime),
    KEY ixNewsPriority (fPriority),
    KEY ixNewsMsgNum (fMsgNum),
    FULLTEXT ixNewsOptInfo (fOptInfo),
    KEY ixNewsWordCount (fWordCount)
) TYPE=MyISAM PACK_KEYS=1 COMMENT='IPTC messages (head/foot)';

DROP TABLE IF EXISTS tText;
CREATE TABLE tText (
    fMID VARCHAR(28) NOT NULL,
    fText LONGTEXT NOT NULL,
    PRIMARY KEY (fMID),
    FULLTEXT ixTextText (fText)
) TYPE=MyISAM PACK_KEYS=DEFAULT COMMENT='IPTC messages (body)';

DROP TABLE IF EXISTS tCategory;
CREATE TABLE tCategory (
    fCID CHAR(3) NOT NULL,
    fCName VARCHAR(255) default NULL,
    PRIMARY KEY (fCID),
    UNIQUE KEY ixCategoryCName (fCName)
) TYPE=MyISAM PACK_KEYS=1 COMMENT='used categories';

DROP TABLE IF EXISTS tNewsCategory;
CREATE TABLE tNewsCategory (
    fMID VARCHAR(28) NOT NULL,
    fCID CHAR(3) NOT NULL,
    KEY ixNewsCategoryMID (fMID),
    KEY ixNewsCategoryCID (fCID)
) TYPE=MyISAM PACK_KEYS=1 COMMENT='relation tNews<=>tCategory';

DROP TABLE IF EXISTS tSender;
CREATE TABLE tSender (
    fSID CHAR(3) NOT NULL,
    fSName VARCHAR(255) default NULL,
    PRIMARY KEY (fSID),
    KEY ixSenderSName (fSName)
) TYPE=MyISAM PACK_KEYS=1 COMMENT='sender (news agency)';

DROP TABLE IF EXISTS tNewsSender;
CREATE TABLE tNewsSender (
    fMID VARCHAR(28) NOT NULL,
    fSID CHAR(3) NOT NULL,
    KEY ixNewsSenderMID (fMID),
    KEY ixNewsSenderSID (fSID)
) TYPE=MyISAM PACK_KEYS=1 COMMENT='relation tNews<=>tSender';

DROP TABLE IF EXISTS tKeyWords;
CREATE TABLE tKeyWords (
    fWID BIGINT UNSIGNED NOT NULL auto_increment,
    fWord VARCHAR(64) NOT NULL,
    PRIMARY KEY (fWID),
    UNIQUE KEY ixKeyWordsWord (fWord)
) TYPE=MyISAM PACK_KEYS=DEFAULT COMMENT='used keywords';

DROP TABLE IF EXISTS tNewsKeys;
CREATE TABLE tNewsKeys (
    fMID VARCHAR(28) NOT NULL,
    fWID BIGINT UNSIGNED NOT NULL,
    KEY ixNewsKeysMID (fMID),
    KEY ixNewsKeysWID (fWID)
) TYPE=MyISAM PACK_KEYS=DEFAULT COMMENT='relation tNews<=>tKeyWords';

FLUSH TABLES;   -- write all to disk

SET sql_log_off=1;  -- disable logging of username/passwords

-- set new privileges of default users for this DB
-- (Note, that this are database users, not system/shell users!)
-- writing user:
GRANT Select on IptcDB.* to 'iptcw'@'localhost' IDENTIFIED BY 'iptcw';  -- set dummy
REVOKE ALL on IptcDB.* from 'iptcw'@'localhost';    -- clear all previous priviledges
GRANT Select,Insert,Update,Delete,Lock Tables
    on IptcDB.* to 'iptcw'@'localhost' IDENTIFIED BY 'iptcw';   -- set 'real'

-- reading user:
GRANT Select on IptcDB.* to 'iptcr' IDENTIFIED BY 'iptcr';
REVOKE ALL on IptcDB.* from 'iptcr';
GRANT Select on IptcDB.* to 'iptcr' IDENTIFIED BY 'iptcr';

FLUSH PRIVILEGES;   -- write all to disk

SET sql_log_off=0;  -- enable logging again

Calling this script at a shell prompt like

    $>
    $> mysql <CreateDB.sql
 

causes a MySQL server to create all tables and indices. If you're not familiar with SQL please refer to the MySQL documentation for details.

As can be seen, there are two users configured to access the newly created database: iptcw and iptcr. While the latter has only read permissions (and is meant to be used for data retrieval) the former may as well update the tables. For security reasons this writing user account may be used only from the localhost. It's up to your DBMS administrator to change this settings, but make sure the username/password pair of the writing database user matches the one given in the iptcd daemons config file.

ChangeLog

Author:
Matthias Watermann
See also:
dfg::IptcReader , dfg::IptcMessage , dfg::IptcMessageQueue

dfg::Iptc2SimpleThread , dfg::Iptc2SqlThread

Generated on 17 Jun 2005 for project iptcd+ with
Doxygen 1.3.7 corrected by sed and HTMLtidy  -x-